Building a Machine Learning Model | Azure ML Tutorial Part 10

Machine Learning Model Building – Let’s build our first machine learning model in Azure ML. First, we have to go shopping for a machine learning model. We must identify what type of machine learning algorithm we want to choose from. We ended up using a decision tree algorithm because we have lots of categorical data. We’ll build a very simplistic model so that we can visualize the decision tree model and understand its results.

Hey, welcome back to data mining with Azure Machine Learning
Studio brought to you by Data Science Dojo.
So today I’m excited.
Today is the day where we get to build our model.
So after all that work that we’ve done,
we finally get to take our data set,
feed into our machine learn model,
have the model iteratively learn by itself
from historical data what are the kind of things
that brought together the circumstances for the results,
right.
So basically whether or not a flight is going to be delayed
or not.
What are the factors that will contribute to a flight being
delayed or not?
So we get to answer those questions soon.
So listen let’s go back to the data mining framework
and remind ourselves where we are in the data mining
framework.
So the last video we sent up a train test partition.
So we have a model ready data.
So that 70% of the data will be the model ready data.
Today what we’re going to do is we’re
going to select an algorithm to train on.
And then the next thing we’re going do
is we’re going to build ourselves a model.
And then we’ll leave evaluation for next time.
So to iterate further where we are
in the methodology of the train test split,
so where we are right now is we’re right here.
So we’re going to build our model.
And then we’re going to see what’s going to go on today.
And for the most part, we’re going
to ignore that test set today.
It’s not until predictions that we care about the test set.
So let’s go ahead and go back into our Azure Machine Learning
workspace.
Last time what we went ahead and did was, we split the data.
So 70% went on the side.
30% went on this side.
The 30% we’re going to basically ignore for a while.
We’re going to pretend that it’s tomorrow’s sales
data, tomorrow’s flight data, for example.
OK, so the next thing we got to do
is to train a machine learner model in Azure Machine
Learning, you have a module called a Train Data.
Just go to your top left bar and type in the word training.
If you don’t see this bar, go ahead and minimize
or expand it out.
So you want to type in train model.
And then just drag in this train data model module.
And then what you want to do is, you want
to hover over the output nodes.
The output nodes will tell you what it wants.
On the left side means, it wants an end train model.
An end trail model means an algorithm, right.
The model is the result of training.
And then the model is the applicable form
of the algorithm.
The algorithm is just a blank set of instructions
on how to build that algorithm.
The next thing is, it wants in a data set.
So it wants to learn from the past here.
So notice that we want to give it
the training set, the 70% of data.
So notice if I mouse over this data set,
it will connect here, as well.
But also notice if I’m also over here,
it will accept it as well.
So notice it just wants a data set.
It doesn’t matter which one.
But you know and I know we need to give it the training set.
So go on and connect that.
And it’s still not happy.
It’s still not happy.
It wants an algorithm.
Now what we’re going to do is, we need to select an algorithm.
So if you go into AzureML and look at–
You should actually just close all
of the extra features for now.
And if you look at just the tab that says Machine Learning.
This is where all the algorithms inside of AzureML is kept.
And if you open this out, there is a thing
called Initialize Model.
Go ahead and expand that.
Now we get into this four families of machine learning
models.
So once you identify what your machine learning problem is,
you will find out what your machine learning
algorithm type you need.
So there’s four types of machine learning algorithms inside
of AzureML.
So the first thing you need to figure out
is this a supervised learning data set,
meaning do you have labels.
Labels being what is it that you want to know from the past?
In this case I want to know if a flight is going
to be delayed in the past.
Do I have that in my data set?
Do I already have whether or not from the past
this flight was delayed or not, yes or no.
If it’s yes, then it’s supervised learning.
I have labels to the past.
I know the answers in the past.
I know the stock price in the past.
So that’s supervised learning.
The next thing you have to figure it out
is what data type it is.
So just because it’s supervised learning,
there’s two types of supervised learning algorithms.
There is classification type algorithms.
And then there’s aggression type algorithms.
If your feature– if the response
class is a label, if it is a category,
it is a classification task.
You are trying to predict is this pixel red, blue, or green.
In this case, we’re not going to predict how many minutes it
will be late by.
We’re going to predict whether or not it will be late
at all, past 15 minutes.
So that tells us it’s classification.
Now regression would have been if I
want to predict how many minutes it would be late.
So there was a column at the beginning
that we dropped called, I think, Arrival Delay and that was
in minutes.
So if we want to predict that later,
that would be a regression prompt.
So now that we know what type of algorithm we need,
we go in and expand the classification task.
And then the next thing it wants you to know
is how many classes are there in the response class.
So notice that we have two classes, you’re late
or you’re not late, zero or one.
That is a two class type algorithm.
So basically we are stuck with these type of algorithms
right here.
Now if you have more than two class, if you were late,
kind of late, super late.
If you had that kind of tiering in your data set,
then you would have a multiclass classification problem.
But in this case, we know we have a two class classification
problem.
And now this is the cool part, we
get to basically go shopping for a machine learning model.
This is this kind of nice.
This is also the curse of machine learning,
because you don’t need to know what these things are.
You can drag them in and they’ll work.
But that’s not how a good practitioner does things.
They should probably understand a little bit
before they start doing something with it.
So first thing we’re going to do,
so we’re not going to really get into the differences
between these algorithms.
If you want to know the differences,
I would join the Data Science Dojo,
the five day Data Engineering and Data Science Boot Camp.
We will teach you about most of these algorithms.
But for now, I know based upon my experience as a data
scientist, that this data contains a huge amount
of categorical data.
So if you visualize this data set, most of it is categories.
So if it is a situation where most of your data set
is categorical, we need to select
what’s called a nonparametric algorithm.
So if we have lots of categories,
decision trees are really, really good
at discerning categories apart from one another.
So if we had lots of numeric data,
that would have been a different issue.
But we have lots of categories.
Basically, there are three families
of decision tree algorithms inside a Azure Machine Learning
studio.
And the simplest one is the decision forest.
So we’re going to go ahead and drag this in first.
And if you want to know what these algorithms are,
we might make a video about it in the future.
But definitely take our Boot Camp,
we will teach you everything you need
to know about these algorithms.
But for now, just go ahead and slide in the decision forest.
So the next thing you need to do is hook up the decision forest.
So select the decision forest.
And then this window on the side will pop up.
If it doesn’t pop up, go ahead and expand it.
And what you need to do is you got to connect this to here.
So notice that this could have taken in
basically any other two class decision algorithm.
I could have just put in another forest here.
I can connect a decision jungle here.
But that’s just showing you as an example.
So I’m going to connect this forest
and then inside of this forest, there
are what’s called two new parameters.
We’ll go over these in a little bit.
But for now, notice that the training model module is–
there’s a red mark next to it.
It’s angry at you.
It wants something from you.
Every time you see this red mark, just click on it.
There should be some kind of launch button on the right side
that will tell you what to do.
So this time it says Value Required.
So it’s kind of cryptic.
But what it actually means is it wants to know
what are you trying to predict.
Is this is this a state predictor?
Because you didn’t actually take your data
set and predict on any column, what type of carriers
is it, what is the departure time, what is the departure
place.
So you can go to Predict and predict any of these functions.
So in this case we know that we want to predict,
so launch the column selector, we
know that we want to predict arrival delay 15, yes or no.
So we’re going to go ahead and say arrival delay will
be the response class.
So now that our training model knows what to do.
So now it’s going to cast the rest of them
as predictors or as features to be
used in regard to the response class, the response class
being arrival delay.
So the next thing you want to do is
you want to look at your algorithm module.
So the algorithm module for me, in this case,
is a two class decision forest.
Once you click on it, you will notice
that there is a toolbar that pops up on the right hand side.
So this toolbar will go ahead and let
us tune how will the algorithms belts guide.
These are knobs and levers.
So you will see that the number of decision trees
right now is eight.
So I want to build this tree.
And I want us to look at this tree and explore this tree,
so we can kind of get the mechanics
of how these trees work.
So we’re going to build actually a very, very bad model.
And bad because we’re going to build
it to be a very simplistic model,
so that humans can understand it.
So what we’re about to do here should not
be used in production.
I’m doing this for educational purposes.
The first thing I want to do is I
want to reduce the number of trees down to one.
I want only to zoom in on one tree right now.
By the way, never deploy a single tree
in production in the real world.
You will regret it.
Trees have a habit of over fail.
That’s why you want to use lots of different trees.
The next thing is maximum depth of decision trees, which
tells you how deep the tree can actually
grow, so in this case 32.
That’s going to be a huge tree.
I might not even be able to look at it,
even if I had a big screen monitor.
So if I want to look at the tree,
I will change this to like five or six.
I’m going to change it to five.
And then the next thing is number of random splits
is left at 128.
Leave that alone for now.
We’ll tune these parameters in a different video.
The next thing you look at is the minimum number
of samples per leaf node.
So basically this is the minimum number
of observations I must have after a split,
if I want to split on it.
So the idea is I don’t want to split and then
have all of a sudden one observation in a single node
all by itself.
That is basically the definition of over fail.
So let’s turn this number up a little bit.
So I want to make this number 34.
So 34 is roughly about 0.1% of the training set which
is 349,000 rows right now.
So once you’ve set all that, go ahead and hit the Run button.
And this will go ahead and build us a decision tree
based upon 70% of the data.
So remember we’re ignoring the 30% for now.
And now what we’re going to do is
we’re going to build a single decision
tree, max step of five.
And we have to have enough representation
in order for a tree to split on that decision.
And again, I want to state that this is really
dumb and simplistic model.
Don’t actually use this in production.
Now this is so we can actually do
what we’re about to do now, which is right click
and visualize on the model.
So the output of a train model module
is actually the model itself.
So notice that this guy right here was–
You can think of it as an algorithm.
You can think of it as a blank set of blueprints
to build a model, to build a tree.
And the output of this is a model itself.
So in this case the tree has been built
based upon historical data.
So if I visualize this, I can then
get a graphic of basically the tree that I built.
So this tree notice that it’s got one, two, three, four,
five depth.
It’s got five depth, because remember,
I set that at five depth.
Now remember, earlier the default was 32.
Can you imagine how basically hairy
that gets as it goes down.
And the next thing I want to look at
is how do I interpret this tree.
So the decision tree, what you want to do
is think of it as, OK, I want to take
in new data set, a new observation, a new flight.
Basically, I could print this out
and I could read it word for word what it’s going to do.
So if I look at this, the first thing it’s going to ask me,
the first question that this decision tree
is going to ask me, if this was a brand new flight route.
Let’s say I’m building a prediction for a brand
new flight.
Is this flight going to be delayed or not based
upon what I’m about to tell it?
So the first thing the model is going to ask me
is, was the departure time between 1700 and 1559?
In this case did your flight leave between the hours
of 5:00 PM and 6:00 PM?
And if you say yes, you’re one, which means
you are greater than zero.
So you go over here.
If you’re less than zero, you go over here.
Less than or equal to zero, you go on the left side.
So let’s just say, no, we did not
leave between 5:00 and 6:00 PM.
The next question it would then ask you
is wherever that node leads you.
So we’ve gone to the left side.
So next thing it’ll ask you, OK, was your flight already
delayed by 15 minutes before you even left the original airport?
And if you are one, you go over here.
If you’re zero, you go over here.
So let’s just say our flight was on time at the very beginning
of the origin airport.
The next thing it’s going to ask you,
hey, was that airport Phoenix, Phoenix Sky Harbor?
If you click on this, it will say Phoenix Sky Harbor.
So in this so far, we’re in a situation
where, let’s just say, no, we did not
come from Phoenix Sky Harbor.
It would ask you the next question.
We’re not going to Phoenix Sky Harbor.
Sorry the destination is we’re going to.
The next thing is origin city.
Are you going to San Fran, yes or no?
And now we give it a decision, right.
So notice that it’s zero or one down here.
So notice that if we are going to San Fran,
we will be on time.
If we’re not going to San Fran, we will be late.
And that is basically how you interpret it.
So for this, let’s assume that this is
a brand new data set coming in.
So if you did not leave between 5:00 and 6:00
and we went ahead and said if the plane was not
late on departure and it wasn’t from Sky Harbor and we
weren’t going to San Francisco.
We’re going to go ahead and be late.
And that’s how you interpret that tree.
Now remember this is a very simplistic tree.
And also you never want to use a single tree in production.
But that was just me showing you so you
can see what the tree is doing, what the model is
doing to your brand new data.
We have about run out of time.
And if you like what you just saw,
remember to hit that like button.
This will help support us in creating future content
for free.
Remember to subscribe for future content
and share this video to spread the glorious word of data
science.
And I have a question for you before we leave.
Now how well do you think this model is
going to do on the test set?
Now I have some opinions.
But I want to hear from you.
I have another question for you.
Was your tree different from my tree?
Was your tree different from my tree?
And if you were paying attention in the previous video,
you’ll know what the result of that is.
And what do you think that tree may or may not be different?
Go ahead and leave your responses in the comments.
My name is Phuc Duong and I’ll see you next time.
Happy modeling.

You can get a free trial of Azure here.

Here is the link to the Azure Portal.

Part 1:
What is Azure Machine Learning?

Complete Series:
Introduction to Azure Machine Learning

More Data Science Learning Material:
[Video] Beginning R Programming Series
[Blog]  Azure Machine Learning- Predicting the Value of Your House

(1115)

Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>