Splitting Data & Categorical Casting | Azure ML Tutorial Part 9

In this tutorial we will make sure all the categorical features are treated as categories using the edit meta data module. We will also setup a holdout dataset and randomly sample our dataset in two partitions, a training set and a test set.

Before we can feed this dataset into a machine learning model in Azure ML there are two things we have to take care of. First we have to make sure all the categorical features are treated as categories. We’ll use the edit meta data module once again to cast these features. Then we need to setup a holdout dataset for future evaluation of any model that we build. We will randomly sample our dataset into two partitions, a training set and a test set. The test set we will lock away to pretend that its future world data. The assumption is if the model we built can predict well on this test set, which it has never been exposed to before, it will do moderately just as well on the future world data.

Welcome back to Data Mining with Azure Machine Learning Studio,
brought to you by Data Science Dojo.
So last time what we did was we cleaned all of our data.
We made it nice and pristine for a machine learning model,
so now we won’t get any screams at us for any known values.
And today what we’re going to do,
is we’re going to make sure, before we feed this
into the machine learner model, that all of our features
are casted in their proper data types.
So the machine learner model will behave differently
based upon how the categories are typed.
And then the next thing we’ve got to do
is, we’ve got to split our data into two partitions, a test
set, and a training set.
So the first thing we’re getting to is categorical data.
So why is categorical data need to be treated differently?
So numerics you can leave as numerics.
Just make sure they’re listed as numerics.
But let’s look at this data right here,
where it’s flight ID and state.
So clearly state in this case is going to be a category.
Arizona, Washington, Arizona, Texas.
But the whole backbone of machine learning
is based upon math, algorithms, and things like that.
And you can’t do math on this category.
You can’t divide for example, Arizona by Washington.
You can’t add Washington to Arizona
and get something else out.
So the next back bone is it can’t do distance calculations.
Distance calculation is by which the main core
principle of how a machine learning algorithm determines
that something is similar and something
is not similar to something else.
So what normally has to be done a machine learning model
to understand it, or for any computer to understand data,
in the form of categories, is you
have to create a separate column for each category.
This is also called one hot encoding.
This is also called binarization.
So this is what we’re going to–
an example here.
So notice that every category gets its own column,
and then we have a one where it’s present in that row.
So notice that there is a column called is Arizona,
and because flight ID one was Arizona, there’s a one here.
So notice that it’s going to spawn four columns.
So the number of categories you have
is how many columns you’re going to end up with,
and these columns are going to mutually
exclusive of one another.
So notice that if you’re Washington, you can’t be Texas.
And if you’re Texas, you can’t be California, for example.
So this will be the same thing for male and female,
and this would be same thing for time zones.
This will be the same thing for zip codes.
Anything that’s a category.
So this is really prevalent in other data
mining platforms such as Excel and things like that.
But in Azure ML, Azure ML actually
has a data type called categorical,
which actually will do this tabularization transformation
for you, without you having to think about it.
So what we have to do is, we have to go into Azure ML
and cast all of our categories into categorical data types,
so that our computer treats them properly.
So let’s go into our Azure ML workspace,
and we’ll continue where we left off last time.
So if you look under clean missing data
here, that’s the last thing that we did,
you should have also had the summarized data from last time
as well.
If not, go ahead and drag it in.
So what I’m going to do is I’m going to right click on this,
and I’m going to visualize the summarized data.
And summarized data has what’s called the unique value count
So it basically tells you how many categories are in,
basically, this column set.
And how you can tell that something should be a category,
is basically look at the ratio compared
to the count, versus how much there is.
So you’ll notice that there’s only seven possible values,
and days of the week, out of almost 500,000 rows.
That tells you that hey, this is probably a category.
Because there are so few unique value
counts in regard to the count.
And if we look at all of this, probably everything
should be a category.
Most of it is knowing that the ratio is low,
but there are other ones where it’s
kind of higher, at origin_city.
But it just comes from domain knowledge
that we established earlier.
We know that city is, you’re either in a city
or not in a city.
So these cities, they should be distinct buckets
of things that can be.
Right So the flights can be, basically,
put into different buckets here.
So there’s 268 different cities that you can land in.
Next thing is departure delay and arrival delay.
Notice that there’s only two here,
so this is a binary feature.
So we should definitely convert them into categories.
And in specific, it is very important
that we cast our response class into the correct data type.
So noticed our response class is delay, whether or not
you’ll be late by 15 minutes or not.
So if we left this the way it is right now,
it is a numeric right now.
And how you can tell that is, if you visualize the data right
now, and then click on the column
itself, so if you mouse over and click on
arrival delay right now, so you’ll see
it is a numeric feature.
So in regard to supervised learning,
there’s two types of supervised learning.
There is regression, which is, you’re
trying to predict a number.
So in this case, if you ran this through a machine learning
model right now, it would try to do regression,
and you’ll get weird numbers out,
like the flight will be two.
Arrival delay will be two.
The arrival delay might be negative one,
because it’s trying to do an extrapolation upon a line.
And wouldn’t make sense, because it can only be zero and one.
So it is very important, that for a classification problem,
that the response class is converted
into a categorical data type in Azure ML.
And the next thing is, basically this entire data type,
or this entire data set, if you look at it,
every feature should be a category.
The only feature that should not be a category
is departure delay in minutes.
So that actually is on a numeric spectrum.
So let me teach how to do that real quick.
So you can change things into the proper data types
here by typing in the metadata editor.
So we used this earlier to actually rename our columns
right above here, if you remember this
from one of the earlier videos.
But this can be used to edit the data about the data.
So metadata is data about data.
So we’re going to edit the data around the data.
So what data types and things like that.
So if you connect that to your current workflow,
so connect the output of the clean missing data
to the input of the edit metadata data module.
Now we can launch the column selector
and select which columns we want to be transformed.
And remember, our transform in this case
is, we’re going to convert everything to categorical.
So since we only have one thing that
isn’t category, which is this guy right here,
departure delay, what I’m actually going to do,
is I’m going to do a Control A, which selects everything.
You can also do a Shift–
hold down Shift after clicking the first one,
and then clicking the last one.
So shift will go ahead and select the rest of them
as well.
So you can do a Control A, or can do a Shift selection.
And then you want to say, I want all the columns to be–
or all the features to be part of the transformation.
And then you can say go ahead, OK, I
want everything except departure delay.
So notice that departure delay is not
going to be affected in this transformation,
but every other column will be.
So I’m gonna hit Check and say yes, these
are the columns I want to be transformed.
And then the transformation itself I will select here
and say, make categorical.
And this will go ahead and cast all the columns
into a categorical data type.
So remember earlier when I showed you that table.
When it comes time to build a machine learning model,
it’s actually going to extrapolate and expand out
the table as we see.
But to you as the user, you’ll still see it as one column.
That’s really useful, because let’s say
you had a column with, for example, city,
you’d have a column for every city.
That’s very inconvenient, because now your data set
is spanned by a whole bunch of columns that’s basically
representing one feature.
So this is a really nice data type
to work with, because the entirety of the feature
is represented in one column.
So for example, if you look at origin,
it is now called a categorical feature.
And when it comes time for machine learning,
it’s going to do that transformation for us,
but to us, while we’re working with it,
as humans, W only see one column, which is really, really
nice for understanding.
So now that everything is properly casted into place,
the data set is actually a machine learning model.
Before we move on, let us go zoom out and see
where we are in the data mining framework
to actually understand where we are in the data mining
framework, and where we’re doing some of the things
that we’re doing.
So first thing is, in the past couple of videos,
we’ve explored and we’ve understood our data,
to try to develop a better understanding of data,
so we can process and clean our data better and better.
And we’re at a situation where our data is model ready.
So it’s ready to be fed into a machine learning model.
So this is where we are right now.
And this is where we’re going to be.
This is where we’re going to go.
So the next thing we’re going to do
is, we’re going to select an algorithm by which
we’re going to use.
And the next thing is we’re going
to go ahead and build a model.
And the most important thing that we’re
going to do, actually, is we’re going
to evaluate whether or not the model that we built
is any good or not.
But that’s a little bit trickier than you would think,
because that is, if you built a model,
how do you tell if the model is good or not?
Well, ideally, what you would do is,
if the model can predict future values correctly, well
then it’s a good model.
But the problem is, that’s its job, right?
It’s job is to predict the future.
So if you’re going to evaluate on the future data,
and that at that point the model has failed its job.
Because it’s past its useful shelf.
So if the model is predicting after the future happens,
I think that’s a bit useless.
So what we have to do in the lab is,
we have to synthetically treat future world data.
And we’ll teach you some methodologies
by which to do that.
So the first methodology, one of many, by the way,
this is one of many methodologies,
and the first methodology I’m going to teach you
is to train test split.
So the idea is, we start with 100% of our data.
So this is where we have 499,000 rows or something like that.
The next thing we need to do is, we
need to build two partitions, a training set and a test set.
In this case, we’re going to use the ratio of 70% of the data
sets will randomly go into the test set, or I’m sorry,
70% of the data will randomly go into training set,
and 30% of data will randomly go into a test set.
And if some of you who know sampling,
this is this a sampling without replacement.
So we’re going to go ahead and put them into either two bags
So the idea with the tests set is,
we’re going to take this data set and hide it away.
We’re going to pretend that it’s future world data.
And this is really important, because it
has the labels of the actuals, the ground truth.
The actual labels.
So the idea is, if we build our model,
so we’re going to take our model,
and we’re going to build it using the 70% training set.
And at the end of the day, it’s not going to see that 30%.
So to the model, that test set, is
new world data to that model.
The model has never been exposed to this data set.
And the assumption is, if this model that
was built, if it built a generalizable model that
found the ground truth in the underlying data, the idea
is if it can do well on data that’s never seen before,
if it can predict on data it’s never
seen before, the assumption is it should do moderately just
as well on data it’s never seen before.
So that’s what we’re going to use.
So basically 70% of this data set
is going to be part of training set.
In my mind, I think it’s going to be–
I like to think of the 70% training set–
it’s going to be sacrificed to produce this model.
And it’s going to learn from the past, what
resulted in the current labels being the way they are.
And then the idea is, once the model has been built,
we would run it through and have it predict on the test set.
And because it predicts on the test set,
now we have another column called predictions.
So we have a prediction.
And in our case, it’s going to predict whether or not
the flight will be late or not.
It just so happens in the test set,
in the past we know if the flight was late or not.
So we have, basically, we can build
a comparison between predicted versus actual.
We can go in one at a time, line item, and say, are you right?
Is this row right?
Was this flight correctly predicted upon?
Yes or no?
We can go ahead and do that.
And if we aggregate all of the rights,
and we aggregate all the wrongs, eventually we
can get some pretty good measures of performance
out of this model.
So this is a high level road map of where we’re going to go.
So what we’re going to do today is,
we’re not going to build any models today.
We’re going to actually set up the training set and the test
set today in Azure ML.
So if you will go back into Azure ML with me,
and go where we left off.
So in the Edit Metadata, I’m going
to go ahead and add some documentation to this Edit
Metadata here before we move on.
So I’m going to say, this is casting a categorical data.
And then the next thing is, I’m going to build this 70/30 split
So in this case, if you type in the word split,
there is a split data module.
So go ahead and drag this into the Azure ML workspace,
and connect the output of the Edit Metadata.
So the clean data that’s model ready,
it’s going to flow into the split data.
We’re going to split it by rows.
And we’re going to say, so notice this percentage here?
It says, fraction of rows in the first output data set,
the first output data set being this guy.
So the remaining part of the data will go out here.
So if you put, for example, 0.7 here, 70% of the data
will go out here.
30% of the data will go out here.
And yes, you want the split to be randomized.
So randomization is very important in machine learning.
It will help improve the model itself.
And then there is an idea of stratified splits.
Before I go into what stratified split is,
we have to look at something real quick.
So what stratified split does, is
it keeps the ratios the same on both the test
set and the training set.
So if you look at it arrival delay, arrival delay,
in this case, there’s 86 percent not late, and there’s 14% late.
If you want to keep the ratios the same, basically 86/14,
the same on both sides, you would stratify it.
Now for the most part, you only want
to stratify, and care to stratify, your response labels.
You do not usually care about stratifying
the rest of your predictors, unless there is something
that you really care about that is a rare class.
So for example, if 99% of one of your predictor features
is really common, and the other one is not common,
like let’s say, less than 1%.
So through sheer randomization, you
can actually end up with a split that doesn’t have one
of the categories, for example.
So if you want to prevent that, you would stratify that.
But for the most part, we only care
for the most part about stratifying what’s
called the response class here.
So I want to keep this ratio the same, 86/14.
So I’m going to go ahead and in the split module,
I want to say Stratify True.
Now if you have your categories in your response class
being basically really close to each other,
let’s say 50-50, or 60-40, or something like that,
I would go ahead and just not stratify.
But in this case, it’s getting to the point where,
through just sheer randomisation alone,
I can severely under sample the thing
that I actually care about, which is whether or not
the flight is late or not.
Remember, the one label, being late, is only 14% of the data.
So I’m going to launch this column selector and say,
I want you to split, but I want you
to also stratify arrival delay.
And I’m going to go ahead and hit this Run button right here.
So what this is going to do is going
to split 70% of data over here, and 30% of my data over here.
So 70/30 tends to be the industry standard,
but it is the right percentage anyway.
So the idea is data beats algorithm.
So you always want your test set,
or your training set should always
have the most amount of data.
So the idea is, the model will learn
better if it has more data.
So there’s that.
But then there’s also the other side of it,
which is the test set, which is well,
why can’t you just give everything to the training set?
Well then you’d have nothing left to evaluate with.
So we’d have to keep something.
But the thing is, later, we’re going
to do what’s called aggregate measures of evaluation.
Things like accuracy, precision recall.
We have to have enough representation,
enough observations, to basically trust those numbers.
So for example, if you had 500,000 rows in your training
set, but only 10 rows in your test set, now
are you going to trust the accuracy measure of 10 values?
Probably not, because each value that’s right or wrong
is an extra plus or minus 10% from that measure.
So that measure is going to be very unstable.
So I tend to want it to be enough so that I
trust those numbers coming out.
OK, and just to double check, let’s see if this
did what we wanted it to do.
So if you click on the Edit Metadata from before,
so this is the data before the split.
So notice that we start off with 499,000 rows.
So the idea is, after the split, we
should have 70% of the data in the first output node,
so we have about 349,000 rows.
So let’s go ahead and take out our calculator, and take this
and divide it.
So 349,776 divided by, and then I
think I can just paste the original value here,
and that’s 70%.
So that’s correct.
So that’s the first thing we need to validate.
The next thing we should validate
is whether or not it did the stratification correctly.
So if I click on arrival delay, I should have the same ratio.
14 and 86.
So it kept the number of rows, basically,
or it kept the number of response class labels
in the same ratio as it was before.
So let’s also go ahead and look at our result data
set, too, which is 30% of our data.
So this should be the remaining rows of data.
The next thing we will find is that,
did it stratify this correctly as well?
14 in 86 as well.
So that we’ll go ahead and show that.
Yes it did what we wanted it to do.
And we’ve just about run out of time,
and that would include how to cast your data in Azure ML.
How to set up a train and test split inside of Azure ML.
Now if you like we just saw, remember
to hit that like button.
It will help support us in creating future content
for free.
And remember to subscribe for future content,
and to share this video to spread the glorious word
of data science.
And before we build this model, and before I go,
I have a question for you.
What kind of surprising things do you
think we’ll find out about the aviation industry, or flights
in general, once we build this model?
Go ahead and leave your hypothesis in the comments.
My name is Phuc Duong, and I’ll see you next time.
Happy modeling.

You can get a free trial of Azure here.

Here is the link to the Azure Portal.

Part 10:
Building a Machine Learning Model

Part 8:
Summary Statistics & Cleaning Missing Data

Complete Series:
Introduction to Azure Machine Learning

More Data Science Learning Material:
[Video] Data Transformation – Data Mining Fundamentals
[Blog]  Machine Learning As A Service Tutorial: Deploy the Models!


Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>