Joining Datasets | Azure ML Tutorial Part 6

In the last video, we prepared for dataset joining. In this video we’ll use the join data module inside of Azure ML to cross reference each airport id with the airport table to find the airport’s city, state and name. We will briefly go over the different types of joins, then combine the three tables together. Each time we join, we will add 3 columns to our dataset.

Hey.
Welcome back to Data Mining with Azure Machine Learning Studio,
brought to you by Data Science Dojo.
Today, we’re going to learn how to join two data
sets together by a key.
If you’ve ever used SQL before, it’s the same thing.
If you’ve ever used Excel before,
it’s a more powerful version of a VLOOKUP.
So joining allows us to have multiple tables share
relevant information with each other,
or it can be used to cross-reference additional
information about each transaction.
Now, joining usually expands the number of columns or features
that we have access to, which may give the machine learning
model access to more relevant features to learn from.
So we’re going to continue where we left off yesterday,
so make sure you watched the previous video, because it’s
going to be a prerequisite for the current video.
So last video, we took the airport codes,
and then we basically cloned the data and then we renamed–
on one side, we prefixed everything
with the column name, and on the right side,
we went ahead and prefixed everything with destinations.
So now when we do a join, we’ll know that
are we talking about a destination state or an origin
state?
So what we got to do is, first of all,
we got to bring in our main transaction data.
So our main transaction data, as you remember,
will be the flight delay data.
So if I open up the sample data files, scroll down,
I will go ahead and find the flight on time performance raw.
I’ll drag that in, OK?
So we have to join on a key.
So if we look at either of these two data sets,
so if we visualize, right-click and visualize on the output
node, we can see that airport ID will be the unique identifier
used to cross-reference.
And if we go into flight on time performance and visualize,
you’ll see that there are two columns that are
using these airport ID keys.
So an origin airport ID and a destination airport ID.
So when we combine these two data sets to this transaction
table, we will end up with at least three additional columns
per join.
So this is going to add one, two, three,
and this is also going to add one, two, three.
So this will hopefully improve the quality
that our machine learning model can learn from later.
So our machine learning model can learn, oh, man, this flight
is coming from Alaska.
It’s always going to be late if it’s coming from Alaska,
for example.
It can learn those details like that.
So to do that, go to the very top left corner of your toolbar
and then basically expand it, and then, in the search box,
type in the word join.
You will have a module called join data.
So I’m going to drag that in here.
And I like to put it so it’s in the middle of the two data
sets that I’m going to try to join together.
So notice that the main transaction table, so the table
that I want the transactions to be based off of,
is going to be the table on the left.
Azure ML is left-join based, and I’ll
explain what that means in a little bit.
But the idea is you want the main transaction
table on the left side, and you want the cross-reference table
to be on the right side.
And notice that the join module has two input nodes, a left
and right.
So I’m going to drag this over here,
and I’ll drag this over here.
So notice these two data sets are now
funneling into this join data module.
So if I click on this join data module,
you’ll see that there’s a value required.
So that means that I have to expand this.
I have to select it, make sure it’s highlighted in blue,
and then I have to expand the properties table for it.
All right, so now it’s asking us what
key do we want to join on on the left,
and what key do we want to cross-reference
with that join on the right?
So I will go ahead and launch the Column Selector
and say I want to join the origin airport ID.
You see that?
I’ve selected the origin airport ID
on the left table, the left table
being the main transaction table.
And on the right side, the right side,
this is the origin airport.
So notice that I’m going to try to join these IDs together.
So I’m going to launch the Column Selector
and notice that airport ID, I’m going to drag that in.
So I want the origin airport ID to be cross-referenced
with the origin airport ID.
Now, notice that we can match the case.
In this case, I think they’re the same
because they’re numbers, but I’m just
going to uncheck this anyways.
It’s usually a habit of mine.
And then the next thing I’m going to do is see this?
See this that says keep right key columns in joined table?
Now, if you need to keep the airport ID of this data set,
this is what you would do, you would keep it checked.
But since they’re going to be redundant information for me,
I’m going to go ahead and uncheck this box, which
means I’m only going to keep the origin airport
ID on the left side, and I’m going to drop the origin
airport ID on the right table.
And now, this part’s going to be a little bit confusing.
What join type do you want?
So join type is referring to, basically,
when you do the join, what happens when there is a match
or when there’s not a match?
So when you do an inner join, let’s look at the inner
join first.
When you join two tables together,
there’s a table and a right table.
When you do an inner join, you only
want to see rows that have a match on both sides, right?
The same primary key shows up on the left table
as the right table, and then you want
to exclude anything that doesn’t have a pair.
That’s an inner join.
We don’t want that right now, because remember,
our main transaction table is on the left.
We care to keep every transaction table.
We’re only hoping to add additional information
to our table.
So that’s where we want to do what’s
called a left outer join.
So the left outer join is going to keep every key
on the left table, regardless of whether it
has a match to the right table or not.
And notice that everything in the right table that
does not have a match is going to be dropped.
And then the opposite is true for right outer join,
and then for a full outer join, you keep everything.
Whether there is a match or not a match, you keep everything.
So what we’re going to do is we’re
going to do a left outer join to ensure that all transactions
are kept in the left table.
So we’re going to go ahead and select left outer join here.
And then what we’re going to do is join data module,
we’re going to add some documentation.
So I’m going to say look up origin airport location.
So this is going to bring in airport name, airport
state, airport city.
So I’m going to run this module now,
so it’s going to do that join for us.
OK and it’s done.
So if we go ahead and visualize the output
of this brand-new table that is now
an amalgamation of these two tables combined,
we can then visualize.
So if the join happened properly,
we shouldn’t have dropped any rows.
So we have 504,000 rows after the join, and before the join,
if I visualize, I should also have
504,000 rows, which is true.
So the join so far has not corrupted any data.
And if we go back into the data set,
we can go ahead and see if we can find–
so we can see that the destination airport
ID is right there.
That was one of our keys that we were joining.
And notice at the end, it’s added
that the airport ID again, origin city, origin state,
and origin airport.
So what I did was I forgot to uncheck this box right here,
so that’s why the airport ID came in from this table,
too, even though it’s redundant because we already
have a destination airport ID.
I’m just going to go back to this join and uncheck this box.
But I also know that this did wonderfully.
This join did what it was supposed to do.
So now I can go ahead and join this table to the second table.
So I can go ahead and select this join.
I can either right-click and say a copy
and then right-click and say paste,
or I can just go to my keyboard and say Control +
C while selecting this and then Control + V
and then it will copy the join data module so that I
can connect my new table.
So remember, the primary transaction table
is going to be this table.
It’s going to be the left table.
And then we want to now join the destination airport
to the airport ID.
So I am going to drag this join module
until it’s in the center of these two data sets,
just so I’m crystal on what’s happening here.
All right, so notice that it’s on the left table.
It brought in our original value, which was origin airport
ID, which is wrong.
I want to launch this Column Selector again,
and instead of saying origin airport ID,
I want to say destination airport ID.
I want to check this box.
So I’m joining destination airport ID with the airport ID
inside of this table on the right side, which
is where we renamed the columns to be
prefixed with destination.
And I want to keep everything the same,
so I want it to also be a left outer join.
And we’re not going to keep the right key,
and we’re not going to match casing.
All right, so I’m going to run that.
So I’ve done and I’ve successfully–
once this is running, once this is finished running,
I would have successfully added six columns to this data set,
and these six columns will be probably very invaluable to us
when we go and build our machine learning models.
And that’s how you join data, two data
sets, together in Azure ML.
So let’s look at it real quickly,
and then I think that’s it for time.
So if we go here, yup.
So destination, destination, destination, origin, origin,
origin, and then we’re left with 504,000 rows
still, which is good.
No data got lost during the join.
So join us next time, when I will show you how to summarize,
show you some summary statistics from your data,
and how to drop columns.
Hey, if you liked that video you want
to see more videos like this in the future, go ahead
and like and subscribe, and I will look forward
to seeing you at our boot camp.

You can get a free trial of Azure here.

Here is the link to the Azure Portal.

Part 7:
Dropping and Selecting Columns 

Part 5:
Data Exploration

Complete Series:
Introduction to Azure Machine Learning

More Data Science Learning Material:
[Video] Introduction to Azure ML and Cloud Computing
[Blog] Which Machine Learning Tools Should I Learn?

 

(998)

Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>