Renaming Columns and Replicating Data | Azure ML Tutorial Part 5

Renaming Columns and Replicating Data – Now that we have a better understanding of our data-set. Let’s go out and gather more data. There is an additional data-set inside of Azure ML we can use to look up and cross reference the airport name with airport ID, then bring with it, which state and city the airport belongs in. These features will add more information to our data-set. Before we can do this join, let’s mark the columns as being part of the arrival airport or origin airport. We will use the edit meta data module for renaming columns and to fork the data-set into two duplicate but differently named projections.

Hey, and welcome back to data mining with Azure
Machine Learning Studio, brought to you by Data Science Dojo.
Today we’re going to learn how to rename columns.
Renaming columns is a very mundane, but necessary, task
in data science.
These are some of the scenarios that might make
it necessary to rename columns.
So let’s say you’ve read in a data set,
and there’s no headers.
Now you have to go in and manually provide it, right?
Another scenario– someone gave you a bad column name.
Now you need to rename it so that it has some more
clarity among you and your team, so you can actually
look at a column and instantly know what it is for.
Another scenario is a column has transformed into something
different completely.
Let’s say you had a table of customer names
and that you’ve normalized all ages.
Now that column is completely different.
The column should actually say normalized ages, for clarity.
So we’re just going to begin where we left off last time.
So last time we went ahead and did some exploration.
Through data exploration, we looked at these column names.
And then what we found out was if we visualize the column
names again, we can see that there is origin airport
ID and destination airport ID.
So, it just so happens that there is another table
inside of Azure ML.
So if you look at sample data sets,
and yes, if you drag in a data set
called airport codes data set, if you drag that in, OK,
and visualize.
So what is in here?
So these are the airport IDs.
So the idea is if we can cross-reference
these IDs, the airport IDs from the actual flight table
to this table, we can actually then
get what state did the flight depart from?
What city did the flight depart from?
And also what is the name of that airport,
which actually might be more clear than airport ID.
OK?
So we’re going to do a join on this table later.
But before we do that though, something
is going to cause that to be a little bit
weird, because we actually need to do this join twice, right?
Because there is an airport ID for destination
and there is an airport ID for origin.
All right?
So we’re going have to make that join twice.
But if we join twice on the same thing,
the column names are going to be the same.
OK?
So we’re going to have two states.
We’re going to have two cities.
And that’s going to make everything very, very
confusing.
So what we’re going to do is actually
we’re going to go ahead and clone this data set twice
and rename the columns both times on this.
So that way, we can do a join on the left
and we know this is the origin state.
And then when we join on the right, now we can say,
OK, this is the destination state.
OK, so to rename columns, what we’re going to do
is we’re going to use a module called the METADATA EDITOR.
So if you drag in the METADATA EDITOR, OK,
and if you expand the toolbar on the right-hand side,
you can see that we can launch what’s
called a COLUMN SELECTOR.
So these are the columns that we want to make the changes to.
And notice at the very bottom, there
is a section called NEW COLUMN NAMES.
The NEW COLUMN NAMES accepts basically a list of new column
names to be overwritten.
OK?
And it’s in the same order.
All right.
So the idea is if I connect the airport codes
to this edit metadata workflow right here.
So notice that the output of this module
is going to the input of this module.
So the data is flowing out of the data module,
and it’s flowing into the metadata module.
OK?
And now, because this is connected,
if I launch this column selector,
it’s going to see the columns already there
for me, which is really nice.
OK?
And then I can go ahead and select which of the columns
I want to rename.
So I want to rename city, state, and name.
OK?
So that’s what I’m going to do.
And then I’m going say check, OK?
So notice it’s going to give me an array right here, or a list,
of the column names, OK?
And if I paste that into here and I just say city name one,
city name–
or state two, and name three, that’s
what it’s going to rename the columns to in that order,
right?
So the first index matches with the first index,
the second index matches the second index, and so forth.
But what I’m going to do is I’m actually
going to rename it to be something more clear.
OK?
So notice I’m going to bring this into a text editor
and I’m going to rename the beginning of each of the column
names.
So the first one I’m going to say is origin, OK?
So I will know that there is an origin city, origin state,
and origin name.
OK?
That’s what I’m going to rename this one to.
So if I click back into here and paste the new column names back
in, and if I hit the RUN button once again,
it’s going to go ahead and go in and change all the column
names for me.
And noticed that I’ve ignored, OK, I’ve ignored airport ID.
OK?
So we’ll show you in the next episode
why we’re going to ignore airport ID.
So notice that if I look at this data set now, OK,
origin city has been renamed.
City has been named to origin city, same with state,
and same with name.
OK?
And it might be more clear if instead of saying origin name,
I would say origin airport.
I think that’s much more clear.
OK?
So I want to run this again.
And notice that our other data set, the flight delay data set,
has been removed from our workspace.
That’s because it didn’t have any execution
modules associated with that.
OK.
So if we visualize the output of the edit metadata now,
we can see that now it’s origin airport, origin state, origin
city, OK?
But notice that didn’t change the original data set.
If I go back to the original data set output module
and right click and visualize, the data set is still the same.
Notice that the transformation only
happens once it goes through this edit metadata editor.
And notice the output of the edit metadata
is the changed names.
OK?
And also, you can document what this thing is doing.
So if you double click on a module,
notice it opens up a text box for me.
Now I can type in a quick statement
about what this is going to do.
So I can say, RENAME AIRPORTS AS ORIGIN.
OK?
I think that’s much more clear as to what this is doing.
And, OK, that’s only one side of the story, right?
I need a secondary data set where the columns actually
say destination, right?
So I’m going to connect this over here.
And I forget what it’s actually called.
So I’m going to drag in the flight data again
and actually I have the list of column names right over here.
OK.
So it’s called DEST.
OK?
So I will keep this naming convention
so I can be more consistent about it.
So inside of this, I’ll go back to my original notes.
This works in every note editor, note editor, like Notepad++,
Sublime, or even Word.
And also in Exel.
OK?
So you can do a FIND AND REPLACE.
So I can find anything that starts with ORIGIN_ and rename
it to DEST_.
OK?
So I can hit and do a REPLACE ALL.
This is a universal function in most text editors.
I’m going to copy that now.
And notice that because all I did
was copy and paste this module, it automatically
brought in original column names that I had before.
So all I have to do now is say destination city, destination
state, and destination name, and if I run this,
I now have a workflow where the data starts off as one
and then it branches into two different things.
So actually, these are two separate data sets
right now, which is really, really cool, right?
So this is one data set over here,
where everything is called origin,
and there’s another data set over here, where everything’s
called DEST, OK?
So now we are very clear that when
we do these joins in next episode
that this is a destination airport, not an origin airport.
OK, let’s also do a little bit of housekeeping.
Down here– actually, over here, I renamed it origin airport.
Yet over here, it’s called destination name.
So I’m going to renamed this real quick so they match up.
So it said it would say DESTINATION NAME,
I’ll say DESTINATION ORIGIN, OK?
And then the comment over here, instead of saying ORIGIN,
I’ll just say DEST, OK?
So now we know that this site is renaming differently
than this side, OK?
So I’m going to hit this RUN button again
to finalize the results.
And then we’re going to do some safekeeping now,
or some housekeeping now.
So before we couldn’t save the experiment
because we didn’t have an execution module.
So an execution module is any module that performs an action.
You can’t just save an experiment
with just a data object in it.
It has to have some kind of execution module.
So I get to name the experiment here.
So I get to rename this PREDICTING FLIGHT DELAYS and I
can hit the SAVE button now.
OK?
And if I hit this EXPERIMENT button again,
this will list all of my experiments.
And notice that it’s actually called PREDICTING FLIGHT
DELAYS now.
I can go one step further than that
and also add this to the project that I
created in the previous video.
So if I click on this PROJECT button on this side over here.
So if you can’t see this window it’s
because this right toolbar is minimized.
So what you want to do is you’re want
to not to select any modules.
So basically click in a blank spot and then click PROJECTS,
right?
And then you’ll want to add that to an existing project.
So in the last video, I made a project called
PREDICTING FLIGHT DELAYS, and I’m
going to check that, and say yep, yep.
So this experiment and any data that I build,
and any model that I train, will be
associated with this project.
OK?
And so later on, when you have lots
of different basically versions of this experiment,
they will all be collected together,
so you’ll know what it’s doing.
It’s not that big of a deal now, because we only
have one problem that we’re working on.
But the idea is as the number of projects you work on increases,
things get really confusing.
All right and that concludes it for today’s video.
Join us next time where I will show you how to join two data
sets together.
Hey, if you liked that video and you
want to see more videos like this in the future, go ahead
and like and subscribe.
And I will look forward to seeing you at our boot camp.

You can get a free trial of Azure here.

Here is the link to the Azure Portal.

Part 6:
Joining Data Sets

Part 4:
Data Exploration

Complete Series:
Introduction to Azure Machine Learning

More Data Science Learning Material:
[Video] Creating a Titanic Model in Azure ML
[Blog] Custom R Models in Azure Machine Learning
[Blog] A Comprehensive Tutorial on Classification using Decision Trees

(520)

Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>