Dropping & Selecting Columns | Azure ML Tutorial Part 7

The machine learning model will learn from the data it has access to. Sometimes it becomes necessary to shed columns from our dataset so our machine learning model does not learn from them. In this video we’ll drop a few columns that do not currently add value in their current form in our azure ML model.

Hey, welcome back to Data Mining with Azure Machine Learning
Studio, brought to you by Data Science Dojo.
Today, we’re going to go and learn
how to drop or remove columns or features from our dataset.
When you have features that may lead the model astray
or features that the model don’t know how to work with,
like text or image data, dropping them
becomes very necessary to guide the models learning.
Now dropping columns will also speed up
the efficiency of the operations, especially
within Azure ML since every time you run an execution module,
it goes ahead and caches that next data set in its own module
on a separate computer.
So it makes our payloads lighter and it speeds up our workflow
by shedding those columns that we don’t need.
But you want to make sure that those columns that we’re
dropping do indeed add no value in their current form,
because we want the model to learn from as much data
that they can have access to.
And let’s go ahead and get started.
So two videos ago, I did some data exploration
and then we identified columns that
would not add value to our machine learning
model in its current form.
Now if you want to hear my rationale on why we’re dropping
these particular columns in this iteration,
go ahead and watch that video.
It’s called data exploration.
And we did it in about two videos.
And this is the list that I ended up with.
So notice I have a list of columns
I want to drop like quarter, or month, day of the week,
et cetera, et cetera.
Not that these columns aren’t useful,
they’re just not useful right now in their current form.
OK, so I’m going to show you three different methods, three
different ways, to drop columns from Azure ML.
There are three different ways that do the same thing,
but they approach it differently.
And there might be times where one
is more optimal than the other.
All right, so let’s go into that.
So you want to search in your toolbox,
you want to search for a module called
Select Columns, Select Columns.
So I will drag into Select Columns in dataset module,
and I will connect it directly after my drives.
So after I’ve drawn all of my data
sets together in the last video, I
will then output the output of that join.
So this table that now has six extra columns on it
will now be thrown into the Select Columns module.
So this Select Columns module will
let me decide, if I launch this column selector–
so there’s three ways.
So the first way is this window pops up right here.
So this window will only pop up if there is a green checkbox
in the previous module.
If there is not a green checkbox in the previous module, what
you want to do is you want to hover over this–
you want to select the module that is the dependency.
You want to hover over Run and then hit the Run Selected.
So it’s going to go ahead and run everything up
until this module.
So it needs to have a green checkbox for you
to see this particular window where you have Available
Columns and Selected Columns.
So this is the first method, which
is you have this column, OK?
So what this column does is you select the column names
that you want to keep and throw them
into the Selected Column site.
So you can either do–
you can throw the columns you want to keep onto the right
side– for example, one at a time–
or what you can do is you can select all.
Say you want to start with every column
and then start by dropping a particular column.
So I’m going to start doing that real quick.
So in this case, Year, Quarter, Month,
and I think Day of the Month is something
that I want to leave behind.
So these are the columns that are being left behind
right now.
Because I have Airport Name in the last join for both
the origin and the destination, I no longer
need both the origin and the destination airport ID.
So I can drop those now.
So let’s go back to this too.
So it also says I should drop CRS departure time and CRS
arrival time.
So let’s go ahead and do that real quick.
So CRS arrival time–
I’m going to hold down the Shift button so I
can select multiple things.
CRS arrival time and CRS departure time–
so I’m holding down the control button.
That’s how I’m selecting multiple things
at the same time.
And then I can go ahead and tell it
that I want to leave these columns behind.
And then I think that’s it.
Now remember we have four response classes.
And we only want to keep one response class.
That’s the Arrival, Delay 15, so we
want this to be an easy classification machine learning
problem.
So I’m going to go ahead and go and find Arrival Delay.
I’ll leave that behind.
I’m also going to leave behind the Cancelled and the Diverted
columns.
I want to leave those behind too.
So these are the columns I’m going
to be left for going forward.
I’m going to leave behind 11 columns.
These are columns that I’m dropping,
and I’m going to bring forward 13 columns.
I’m going to go ahead and hit this Run button now.
So notice that once I’ve hit the Run button,
there is a list of columns that I’m going to keep inside
of the Launch Column selector.
So the output of this column in the data sets
will be another data set.
It will be cached within this module, and it should have–
if we did it right– it should have much less columns
than it had before.
So it should have 13 columns, if we look at it this way.
Here we are.
So that’s one method, which is we had a window where
we selected which columns we wanted to keep
and which columns we wanted to leave behind.
Let me show you another method.
So I’m going to copy this Select Columns and data set module.
And then I’m going to drag it over here.
I’m also going to show you that it’s just a parallel workflow.
It does the same thing.
So it’s up to you which one you want to keep,
I’m just showing you an alternate method.
So let’s say you had thousands of columns.
That could be a problem sometimes.
You don’t want to specify individual columns that you
want to keep one at a time.
Maybe there’s a thousand columns,
and you only want to drop four of them.
So here’s a way to do that.
So if you launch the Column Selector,
you can filter columns by name–
that’s what we did last time– or we can do it by rule.
So on the left side, there is by rules or by names.
So with rules, there’s two modes.
I can begin with no columns selected
and then I can add individual columns to this list.
For example, notice that I can X these out or I can add them in.
The secondary method I can do is–
I want to say Begin with All Columns, Begin
with All Columns.
And then, instead of saying include, I would say exclude.
All right, so I want to begin with all of the columns– so 13
plus 11.
And these are the particular columns
that I’m about to list to be excluded.
So I’m going to exclude Name Year, Quarter, Month,
Day of the Month.
I think we’re going to drop also–
well, you get the idea.
So that’s the secondary method of doing it.
So I’m going to delete this for less confusion.
And then I’m going to document what this is doing.
So in the Select Columns, I’m going
to say this is dropping columns.
And then I’m going to expand it.
And that’s how you drop columns in Azure ML.
So join us next time where I’ll show you
how to clean missing values from our data set
and also how to get summary statistics out of our dataset.
Hey, if you liked that video and you
want to see more videos like this in the future, go ahead
and like and subscribe.
And I will look forward to seeing you at our boot camp.

In this video we will drop:
Year, Quarter, Month, DayofMonth, OriginAirportID, DestAirportID, CRSDepTime, CRSArrTime, ArrDelay, Cancelled, and Diverted.

If you want to know the rationale as to why these columns are being dropped, watch this video.

Part 8:
Summary Statistics & Cleaning Missing Data

Part 6:
Data Exploration

Complete Series:
Introduction to Azure Machine Learning

More Data Science Learning Material:
[Video] Building Robust Machine Learning Models
[Blog] MLaaS: Deploy & Host Predictive Models Through Webservices using Azure ML

(543)

Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>