Data Exploration | Introduction to Azure ML Part 4

Data Exploration – Now that we have Azure Machine Learning Studio setup, let’s begin an end-to-end data exploration science project in Azure Machine Learning. We’ll choose the flight delay data, and use it to predict whether not a flight will be late on arrival based upon the flight’s circumstances.

In this video we will begin our preliminary exploration into the dataset using Azure Machine Learning’s dataset module.

Hello, and welcome back to Data Mining
with Azure Machine Learning Studio, brought to you
by Data Science Dojo.
All right.
So today we’re going to give you an introduction
to projects inside of Azure ML.
So basically how do you create a project
to bind a bunch of assets together?
And then we’re going to explore a data set using Azure ML.
And then we’re going to build ourselves a data mining
strategy on how we’re going to approach our action
plan for this data set.
OK, so the first thing you want to do
is navigate to your Azure ML workspace
by going to studio.azureml.net.
And then we’re going to begin the process of creating
an end-to-end project where you will learn Azure ML basically
through trial by fire.
So we will take a data set and bring it
into a predictive model, and then deploy that model.
OK so to begin, go to Project.
So we’re going to start a brand new project,
and this project will contain all of our experiments
that we’re going to use with this.
This is just to be organized.
This step is completely optional.
It’s just later on your workspace
might have a bunch of different experiments in it,
and you might not be aware, or you might get confused
which project is which.
I’m going to create a new project,
and I’m going to call this project Predicting Flight
Delays.
And it has a lot to do with what we’re about to do, so just name
it right now.
So Predicting Flight Delays.
And then the description, I’ll just name it the same thing.
OK, so this is going to basically create
a project folder for me.
So if I go into this, project’s going
to notice that there is a project called Predicting
Flight Delays.
And it’s going to ask me to add assets.
I don’t have assets yet, but we’re
about to go and make some assets.
The first thing you’re going to do is go to New,
and create a brand new experiment.
So get to New, and then Blank Experiment.
And the data set that we’re going to be
working with is under Samples.
And there is a data set down here called
Flight On-Time Performance Raw.
OK, so go ahead and drag that in.
So this data set, if you go ahead and visualize–
it’s data from 2011.
And it’s basically, can we use this data?
Can we use the past to predict the future?
And the past being each row in this data set
refers to a flight.
And then each column is an attribute of that flight.
And towards the end of it, we’re trying to predict this column.
Is the arrival delay can be behind by 15 minutes–
yes or no?
So if there’s a 1 here, it means that it was
delayed more than 15 minutes.
And if it wasn’t delayed, then it’s 0 here.
And this column is actually derived,
so it gets called ArrDel15.
So arrival delay is at 15.
Let me zoom in here, so you guys can
read it a little bit better.
And this column is actually–
basically, what I think it is is it’s
based off this column right here, which
is by how many minutes was the flight on time, or delayed,
or early.
All right, so if it was negative 6,
it means the flight was six minutes early.
The flight was 12 minutes early, and so forth, and so forth.
So if a number in here is greater than 15,
that means it was more than 15 minutes late.
So this would trigger a 0 or a 1.
So this makes this problem really cool,
because we can treat it as either a regression
problem or we can treat it as a classification problem.
So a classification problem being
predicting whether it was late or not.
Or we can predict by how many minutes
it will be late or early.
so we’re going to choose classification.
It’s a lot more simple of a problem to tackle.
But go ahead and do regression if you know how.
And then there’s also these other two columns–
whether the flight was canceled or diverted.
So we’re going to ignore these columns,
but in production, you would actually
build predictive models to predict
whether it is going to be canceled or diverted
at the same time.
There’s many ways you can approach it.
If you build a regression model, and if it’s
more than 15 minutes late, or a certain threshold,
then business logic would kick in
and say, if more than 60 minutes late, or whatever,
say canceled or diverted.
So we’re going to ignore these two columns.
The column we’re going to focus on
is the Arrival Delay 15, meaning that the flight is
delayed by 15 minutes or not– yes or no?
We don’t care if the flight’s early.
We don’t care if the flight’s five minutes late.
We only care if the flight is more than 15 minutes late.
So let’s go ahead and explore this data set.
So if you look at the data set, it’s got 504,000 rows,
and there’s 18 columns.
So this is a pretty sizable medium-ish data
set to work with.
If we look at the year, everything is in 1,
so this column doesn’t seem useful right now.
So as I’m doing this, I’ll write it down.
And I really recommend that you do this for every data
set you will ever work on.
Build an attack strategy during your data
exploration phase, which is, what you are
going to do with each column?
What are some notes that you will take away?
So there’s a column here called year.
And basically, I’m going to drop this column, because everything
in this is 2011.
So it looks like whoever gave me this data
did a query inside of that database
and only took out the 2011 flight data.
And if we look at quarter–
oh, and I can tell that because there’s
the number of unique values is 1.
And because the number of unique values is 1,
well, that tells you everything is locked in 1.
If you hover over this histogram over here of 2011,
it says the count is 100%.
So I suspect the same thing is true here with quarter
so we can see that quarter is also
100% in the fourth quarter.
So in quarter, I will also go ahead and drop this column.
As far as month is concerned, let’s take a look at month.
Month looks like it is under the same thing.
So whoever queried this data created
all in the same month, in the same year,
and in the same quarter.
So we’re going to build a predictive model that’s only
going to be good for October.
So with flights it’s very seasonal.
So I would imagine that you would
build maybe different models to accommodate
different seasons, too.
And then there’s day of the month.
Day of the month being what day is it?
Is it October 1 through 30?
So the problem with day of the month
is that if I’m going to use this feature
to build a predictive model, it’s
not going to be very useful, because if I’m
going to predict the future, what happens on October 6
in the future, that might not mean anything because October 6
might be on a different day.
It might be a Tuesday instead of a Thursday,
or it might land on a different holiday or something like that.
So this feature– it’s too granular
to this particular entry.
It is good historical information,
because I can use this feature to basically determine
is it a holiday or not, or what day of the week it is.
But it looks like someone already did that for us
in here, over here, which is day of the week, which is there
are seven unique values here.
So that tells me this is Sunday through Saturday.
So day of the month–
for now, let’s drop it.
I’m not saying it’s not useful.
I’m saying, in its current form, we’re
not going to be able to do much with it.
It poses too much of a uniqueness to it.
So we don’t want the model learning, OK,
so if it’s October 6, in every time in the future,
it’s going to do this.
No, that’s not how it works, because the way
the calendar works, it’s going to keep
shifting days of the week and things like that.
So day of the month I’m going to go ahead and drop.
But also note, it might be useful to find out holidays–
can derive holidays.
All right.
And then there’s day of the week.
So day of the week is all about Monday through Saturday.
So we don’t know what lines up with what.
So what does 4 mean?
Is 4 a Wednesday, or is 4 a Thursday?
It depends where day of the week lines up.
So we’d have to go do some domain research
and basically look up what October 6 is.
What day was it back in 2011, of October 6?
And then we can figure out what this day is.
So we’ll do that later.
So day of week.
And we want to do that, because if we want a feature later
that says, is weekend, is not weekend, that would
become very useful for us.
All right, so day of the week–
it’s going to be useful.
And right now it is casted as a numeric column.
I would say it is not numeric, it is actually categorical.
Remember, categorical is distinct bins or buckets
of things that could have been.
Numeric assumes that there is some kind of progression
between 1 and 7 even though, yes,
you’re progressing through time, but the jump between 7 and 1–
it doesn’t make sense there, because it’s cyclical loop.
And this data set currently doesn’t encompass that.
So we have to cast this into a category.
OK so we have to cast this into a category,
because it’s in numeric right now.
We don’t want it to treat it as a number.
All right.
So the carrier– looks like these are carrier codes.
So some quick search, so I just double-click on WN,
for example.
WN seems like– and I type in carrier here.
WN stands for Southwest.
So the Southwest code is WN.
So we can look up these codes later,
but we have to ask ourselves a question.
When we consider features to being used,
will it help predict whether or not it will be late or on time?
So yes, I would say that the carrier will probably
be a very important feature in determining whether or not
a flight is going to be on time or not.
So I’m going to go ahead and copy carrier down.
And carrier has to be a category.
Right now it is a string feature.
And we don’t want it to be a string.
We want it to be treated as a category,
as a discrete value to be used in the predictive model.
All right.
The next thing is the airport origin ID.
What airport did they come from?
And then what airport did they depart from?
Now what is weird about this is there are 279 unique airport
origins, and there are 280 unique airport destinations,
which means the destinations has one more
airport than the origin.
So that might be weird later.
So these codes will be very important,
because remember, in this code, maybe there
is some inherent lateness or inefficiencies associated
with certain airports.
For example, I can just imagine that if you had anything
to do with Chicago’s O’Hare Airport, then
you would probably be late.
Or JFK Airport, or one of those really busy hub airports.
So yes, let’s include this.
But also, because they’re codes right now,
it’s also treating as a numeric feature.
No, no, we should not do that, because there’s
no rhyme or reason between–
they’re like postcodes, zip codes.
So they should be treated not as numbers, but as codes.
So origin airport ID–
we’re going to go ahead and cast that to category.
So I’m just building up an attack strategy right now.
So destination airport ID, same thing.
We will cast it into a category, as well.
And then CRS departure time–
CRS departure time, and there should also be an arrival time.
So what time did they leave from the airport?
So it’s listed in, I’m assuming, 2400– so 0 to 2400.
So for example, this flight left 2:35 PM.
And again, right now, as a numeric,
because this is a cyclical feature,
basically after 24 it resets back to 0,
it doesn’t make sense to keep this as a numeric feature.
It should be something else.
But if we cast it into a category,
that’s going to cause too many unique values, in my opinion.
So there’s going to be 1,100 unique values here.
So the idea here is we would bucket these time stamps,
so we would have less categories.
Now, if we see over here, there’s
the departure time bulk.
So it looks like whoever did this took all the time stamps
and put it into 19 bins–
so almost even bins is what it looks like here.
So basically if the flight departure time
was between 2 and 3, you would be in this bin.
So we’re not going to use CRS departure time,
and we’re not going to use CRS arrival time,
but not that they’re not useful.
In their current forms, they’re not useful.
But you notice that we have bulk, the time block here.
So the time block is what we’re going to use.
And that will encode basically the same information
but not to the granularity that we want the machine learning
model to know about.
We want some generality with our machine learning model.
All right.
So on another note, if I take the CRS arrival time
and deduct it from the departure time,
I can build a brand new feature that is basically
number of minutes that it took.
But the problem with that is I think
these time stamps are in the time zone of the airports
that they land in.
So we’d have to convert all of these into the same time zone,
and then we can do that subtraction.
But that feature might be worth more labor
than it’s worth right now.
So we won’t consider that for the time being.
But just note that the thing that you can do–
so CRS departure time, we’re going to drop it.
CRS time bulk, we’re going to go ahead.
It’s a string right now.
We need to cast it later into a category.
And then departure delay–
this is going to be a very important feature,
I think, because if you already start off late
or if you really start off early,
I think that is a very strong indicator of whether
or not you’re going to be early or not.
So in this one, departure delay– and notice
it’s in numeric.
It’s fine.
We want it– keep numeric.
We want it to be numeric, because it’s in minutes.
So notice negative numbers– so this flight was it left early.
So departure time, delay is more than 15 minutes,
so this, again, this is derived.
So if the departure time was delayed–
so in this one, it took 17 minutes behind schedule
to take off.
So that’s why it was casted at 1 here,
because it’s greater than 15.
So I think this is going to be a very good, good indicator on
whether or not a flight is going to be delayed.
So if it’s 1, the flight was already delayed before it left.
So that means that for the flight
to not be 15 minutes late, they actually has to show up early.
So it has to jump through an extra hoop here.
And also, this is actually, if you see here,
there is only two unique values, 0 or 1.
That tells you that it needs to be a category, because it’s
a binary feature right now.
So that’s what we’re going to have to do to it.
So cast into category.
We don’t want it to be a number.
CRS time– I think we discussed this,
that we will do the same thing here, which is,
for now, we will drop it.
You can build some crazy features from this,
just so you know, if you like time series analysis.
Arrival time bulk– what we’re going to do here–
let’s see here.
We’re going to do the same thing that we did with the departure
time block, which is we need to cast it into a category,
because I believe it’s a string right now.
And then the next thing is these four columns right here
are basically what we’re trying to predict.
So these four could be our response classes.
And the rest of them will be our predictors.
Now, we could build four different predictive models,
but once you know how to build one predictive model,
you’ll know how to do it for all of them.
So what we’re going to do here is
we’re only going to care about Arrival Delay 15.
And we’re going to be able to predict the model for that.
So arrival delay, it’s going to be a response class,
but we’re going to drop it.
Same thing with canceled, and same thing with diverted.
So these three columns, they’re all response classes.
They’re all what we would want to predict.
But for now, we’re only going to predict arrival delay
by 15 minutes.
So this is going to be our response class.
And this also needs to be cast into category.
So that is basically our attack strategy.
What are we going to do as far as data manipulation,
transformation, and all of that good stuff?
All right, so join me next time.
We will go ahead and start making some of these changes
to the data set.
Hey, if you liked that video, and you
want to see more videos like this in the future, go ahead
and like and subscribe.
And I will look forward to seeing you at our boot camp.

In Part 4 we will cover:

Part 5:
Renaming Columns and Replicating Data

Part 3:
Modules and Experiments

Complete Series:
Introduction to Azure ML Studio

More Data Science Learning Material:
[Video] Combining Datasets in dplyr
[Blog] Azure ML Tutorial
[Blog] Introduction to R Programming

(838)

Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>