Summary Statistics & Cleaning Missing Data | Azure ML Tutorial Part 8

Summary Statistics & Cleaning Missing Data – Let’s understand the aggregate behavior of our features further by looking at summary statistics. Azure Machine Learning gives us easy access to mean, median, mode, min, and max. Let’s look at each measure to see what it means to the interpretation of the data.

The summarize data module also gives us a count for each feature with missing values. We can then formulate a strategy for cleaning missing data. The cleaning functions used in this tutorial is not the optimal way to clean data, but we must learn to crawl before we walk. We’ll drop each row that has a missing value in our response class. Then use one of the measures of central tendency to fill in the other features; median for numeric features and mode for categorical features.

Hello internet, and welcome back to a data mining
with Azure Machine Learning Studio brought
to you by Data Science Dojo.
So today we will explore our data set further
by looking at some summary statistics inside of Azure ML
on how to clean missing values from our data set
using the clean missing values module inside of Azure ML.
So let’s go to Azure ML.
So the last time what we did was we
used select columns in the datasets module
to select the columns that we wanted
to use for machine learning, which
columns do we want to learn from,
and which model do we want to leave behind.
And not that they’re not useful, but we
want them to be useful in their current form,
We’ll talk about how to talk to do that transformation on some
of it columns later.
So if we look at these columns, you
should have 13 columns just like me, if you click on any column,
you’ll see on the side a bunch of summary statistics.
So if we can click on it, we can see that one of the values
is, if you look at missing values,
we can see a summary of all the missing values in the dataset.
So if you click on, for example, departure time book,
it is completely filled in, every value is there.
Same with this one right here.
But if you want to go and if you have hundreds
of columns or dozens of columns, even thousands of columns,
this would be a huge pain to click on every column them
one at a time and then to view the missing values
counts themselves.
Now luckily, inside of Azure ML there’s
a module that crunches all of that for you.
So we go to the left side, search for something
called summarize data.
So there is a summarize data module,
which if you drag in here and connect it, OK,
you can go ahead and run it.
So I notice that I’m connecting to summarize
data module to decide.
I am making a conscious note that this is not
part of the workflow.
This is like a separate dipstick test.
This is a separate test of quality, if you will.
So if I right-click and visualize this,
I can see basically the summary of all of my columns.
So what are my column names, and then what are the observations?
These are the counts.
then I have a list of unique values.
So notice days of the week should have seven days.
If you look at the missing values
count, and that’s probably what we
care about the most for this video, which is how
many missing values are there?
So in departure delay and departure 15
it’s 3,697 missing values.
And that should be expected, because departure 15 is
derived from departure delay.
So if the flight was late shorter than 15 minutes,
departure delay becomes 1 instead of 0.
And then we also look at what type they are.
So in this case, the missing values
that we have are these two are features,
so they’re predictors.
But then in arrival delay, arrival delay
has 417 missing values.
So this is also a problem, because arrival delay 15
is also our response class.
This is what we’re going to try to predict.
So remember, we’re going to use the past
to predict the future where in the past
we knew the result that brought about that past,
this current future.
And so in this case, we don’t know the result of the past,
so we can’t learn from the past.
So these rows will have to be dropped in a little bit,
but while we’re in here, let’s look at some more summery
statistics.
So we get the min and the max.
We get the range of the data set.
What are the extremes of the dataset?
What is the hottest day?
What is the coldest day?
So if you look at departure delay,
so this is a numeric value, so you
look at departure delay, one flight left 56 minutes early.
This flight left 56 minutes ahead of advertised.
And then if you look at the max, there was a flight in here
and that was 1,401 minutes late to even depart.
So remember departure delay is what was your initial take-off.
Was the initial take-off early or late?
So this initial takeoff was 1,400 minutes delayed.
So that’s bad.
If you divide this by 60, basically they
were waiting for the flight for almost 24 hours.
So it’s like 23 point something hours.
If you look at this one right here,
day of the week, the min and max day
a week, it doesn’t really make sense,
because the reason like what is one?
Is one Sunday?
Is it Monday?
So the thing is day of the week is also a cyclical pattern.
So one to seven means, oh this is beginning
of the week to the end week.
But remember, it this number currently
does not include that this number will start
over and over and over again.
So it’s kind of meaningless.
That tells us that we should probably
treat this not as a number but probably as a category.
We’re going to do that in the next video.
OK, next thing is if we look at departure delay.
There’s only 0 from 1 for arrival delay 15
and departure delay 15.
So that tells you that there’s two unique values.
It’s a binary feature.
It can only be one thing or the other.
So let’s take a look at some other things here.
So these are some measurements of central tendency.
So if you look at this, this is the mean,
so this is the average.
So if you look at this, the average a day of the week
doesn’t really make sense.
The average of the day of the week is what,
is this Wednesday, Thursday?
I’m not sure here.
And what does that even mean?
It doesn’t mean anything.
So if we look at the next thing, the departure delay
is a number, so it would make sense.
So on average, flights that left from their original airport
were about 4.8 minutes late.
That makes sense.
But also keep in mind with the mean
that it’s pulled by extraneous outliers.
So if I looked at this number, I would probably
see that this is probably an extraneous outlier.
because it’s too past the standard deviation.
So there is the mean of the deviation,
where how far do you deviate away from this number?
So notice it’s two past the standard deviation
maybe if three past, so even three past,
so it’s definitely an outlier here.
So the mean is pulled up in this case by this extraneous number
right here.
Then next thing is if you look at the next row, which
his departure delay 15, departure delay 15
is the mean of that is 0.128.
But the weird part is it’s only between 0 and 1.
So what this is measuring is out of all the observations,
out of 500,000 flights, what is the propensity of this column
to be 0 or 1.
That’s what that mean is measuring.
So if you look at this number, it’s very close to zero
and very far away from one.
So that means that most of the time, you can also kind of say
that 88% or 87% of the time, based
upon this data and this sample, that flights are only
going to be 12% that 13% late, which is, I would say that’s
still pretty bad.
Because let’s say I was 10% late to work,
I would probably get fired.
The next thing his arrival delay.
So arrival delay is when you actually
arrive from your flight.
so we see that this is 0.136.
And what’s really interesting is that if you
take this number in contrast to this number,
they are very, very close.
So what does that mean?
To me It means that if you’re already late,
chances are you’re going to stay late.
That’s what I see that this means.
And if you’re already late, there’s
very little chance that you will make up that flight.
So the idea is I think if you take off
and you’re already late, you can’t just go faster.
You’re going to stay late.
So it will have a rippling effect.
So you notice that because this number is
so close to this number, there’s not many factors
once you’re in the air that seem to make you even more late.
So just because you left early or just
because you left on time, it almost
basically assures that you will arrive on time.
So it looks like there’s very little delays once the plane
actually gets in the air.
And that’s what this number is trying to tell me.
The next thing is if you look at the first quartile
and the third quartile, so this is if you took the data
and sorted it, assuming it’s a numeric number,
so if you look at departure delay, which
is a numeric value.
So if you shorted it, the first quartile is 25% of the data
if you sorted it, and the third quartile is to 75th percentile.
So if you took between this range and this range,
that’s 50% of the data.
So you can say that 50% of all flights
were between five minutes early and three minutes late.
So that’s very good, I think.
So to be that tight with your first and third quartiles
ranges within eight minutes.
So if you look at it, I think that’s amazing
that they can do that.
Because these planes are flying from different time zones
to different states to different airports.
And they have different carriers, different management,
and hundreds of different people along the way.
And somehow every plane, or 50% of all planes
are landing between these values, which is
a difference of eight minutes.
I think that is awesome.
And then if you look at the median,
the median is the middle value.
So basically 50% of it is above and below this point.
So notice that 50% of the time the plane
is going to be less than two minutes or earlier.
And 50% of the time the plane will be less than two minutes
or greater.
OK, so let’s go back and look at this.
so the missing values is something
that we should take care of, something we have to do.
So if we exit out, we can go ahead
and start thinking about a cleaning strategy now.
So if you go into the left top left-hand corner and type
in “clean missing data,” just drag that in.
So this is the module which we’re
going to use to clean missing values,
and just go ahead and connect your data.
So notice that this is an actual transformation.
That’s why I’m making it part of this workflow
it’s just a style thing.
I like to keep my experiments nice and neat,
because I work with people, and I want them to also understand
what I’m doing.
So we select the columns we want to clean.
So in this case, I had a response class here.
So if I go by name, what I did was click by name
and I said arrival delay is what I’m going to clean.
So remember arrival delay had 4,700 missing values
and it’s a response class.
So just a very quick rule of thumb here.
So the response class, so if missing response class,
we have to drop the row.
And there’s nothing we can do here.
The idea is we can’t we can’t learn from the past
if we don’t know what resulted in the past
being the way it was.
If we go back even further, the idea
is machines can’t learn from missing values.
The missing values have to be cleaned up.
We can’t ignore them, because you can’t do math
on missing values, and machine learning
is all about mathematical operations.
So missing values, therefore, become like roadblocks
for machine learning.
So they have to be taken care of before the model can
learn from them.
Now the next question is, these packages are smart.
Can’t they just clean missing values for me?
Yes, they can, and they will.
Most packages in R, Python, and even Azure ML,
they won’t complain if you feed it missing data.
It will just take care of it.
You don’t even need to worry about it.
It won’t even tell you that it clean in the missing values.
However, you as the data science practitioner
will always want to retain the power that
is cleaning missing values, because it becomes
a tuning parameter for you.
And remember, these are experiments.
It’s a scientific method, which is
you’re trying to isolate certain causes and performances.
So if you built two different models and one performed
better, is it because it cleaned better because
of the algorithm, or did it clean better because of the way
you cleaned the data?
You can’t isolate that effect if it’s doing that for you.
And the next part is how well you
clean missing data will affect the predict model,
because the model is going to learn from the values
that you filled in the missing data with because you’re going
to learn from that data set.
So cleaning missing values just tries to get back
at the idea of data quality.
The completeness of the data will improve
the quality of your model.
You can think of data as the raw materials for your machine
learning model.
If you feed the model higher-quality raw materials,
you will forge a higher-quality model.
So the problem is missing data is a natural occurrence in data
science.
So many things in life that happen
in the real world that prevent data collection from being
a perfect system.
So data could have been failed to be recorded,
or failed to have been given.
Or the data was never recorded because it doesn’t make sense,
and it’s not applicable for that situation.
So failure give or failure to record.
So some examples.
If you look at survey data, which is probably
the messiest form of data there is, let’s look at let’s say age
is missing.
Why is age missing?
And maybe the surveyor never got time to ask that question,
maybe he didn’t ask that question,
maybe there was a discriminatory law that prevented you
from asking the age.
What about the user side?
Maybe the user didn’t want to tell you their age,
they don’t know their age, or they had
to leave early or something.
So there could be a multitude of reasons why that was missing.
This happens all the time on online
when you make an account on basically any website
now they just ask you for your username or email and then
a password.
But an account still has a bunch of other data columns
ready for you to fill out, like your first name,
your last name, your address, and all of these things.
But they don’t want to ask that right away, because they
want you to go ahead and have an account with them right
away because you can use their platform.
So the rest of that is blank.
This happens all the time on social media.
For example, if you sent out a tweet
and you don’t have a hashtag, the hashtag column
is now blank.
Maybe you queried data and you don’t have access
to a certain row, so to you it looks blank.
So it depends if you have access to certain parts of your data
or not.
If you were looking at manufacturing equipment or IoT
data, the monitoring equipment could have been down,
so maybe each column in that data set
could have represented a different measurement
device, a different sensor, and all
of a sudden a sensor goes down.
And then it would just be submitting
nulls for the whole time.
And then let’s just say maybe you are recording something one
day, maybe heights of students, and that student was absent
that day, so that’s just failed to record.
And then maybe you do something, but not something else.
Let’s say you go to Amazon, you rate something
but you didn’t review it, or you go to Netflix,
you review a movie without ever having watched it,
so watch time is blank.
Another thing is that maybe the situation is just not
applicable.
Maybe some conditions just aren’t met.
But there’s still a column in that dataset.
So for example the spouse will probably
blank unless you have a spouse, unless you’re married.
So in that case, if you’re cleaning missing values there,
you want to fill it in with whatever makes sense.
So instead of just saying blank, you
might want to just say single.
The next thing is maybe the refund date, the refund date
column, or the refund value column
is only populated if someone returns a purchase.
So in this case, if it’s blank you
might want to just fill it in with zero,
just assuming that they never returned anything
for that transaction.
And then maybe graduation is blank because they dropped out,
they never got to the graduation.
So there are many reasons in the real world
why something could be missing.
So let’s go back to Azure ML.
Let’s just finish now with the cleaning missing data.
So you notice there’s two output modes here.
One is for the cleaning transformation,
and one is for the clean datasets.
For the most part, you don’t want
to care about the right output mode.
The right output mode is later on when
you want to deploy this pipeline and you clean the missing
values with a derived number like the median,
the median of something of a column was 40.
Who’s to say that the median of future date will also be 40?
So the idea is if you want to build this as a pipeline
and preserve what the derived values are,
you would save it as a transformation.
But for the most part you don’t need to care about this side.
The next one is this is where the data actually gets cleaned.
So if you visualize out here and you visualize inside of it,
we’ve gone ahead and cleaned in the missing values of I
think arrival delay.
So if I mouse over this, there should be zero missing values.
So everything got dropped.
So notice I had 500,000 rows earlier,
but I dropped almost 5,000 rows.
So I’m down to 4,899 rows.
so I’ve lost quite a little bit, like less than 3%
or so of my data.
That’s not bad.
But I couldn’t have learned from that data
anyway, because the response class was empty.
If you look at these two columns,
they also now don’t have any more missing values in them
either.
And just to be sure, I can just take the summarized data
module, Control C, Control V to copy paste it,
and this connects over here.
Just to double check, do I have any missing values
left in my data?
Now I know I don’t, but just in case I can check it.
So that is if your response class is empty.
So what if your response class was–
what if it was a feature that was missing?
So in this case, remember, these are some very general rules
of thumb to start off.
So what if a feature was missing?
What if your feature was missing,
which is basically these are predictors?
Then it depends on the data type.
If your feature is a category, if categorical, or if it’s
a numeric, answer if it’s numeric,
then you have access to basically a bunch
of numbers, the numbers of central tendency.
So you have mean, you have median, and in both cases
you have the mode.
In this case, when you clean the categories,
the mode makes a lot more sense.
But these are what’s called global cleaning functions.
These are actually really bad.
In production, you generally don’t
want to clean with the mean, median, or mode for numeric,
just because it puts a blanket operation on the whole thing.
For the most part, what you actually want to do
is you want to impute missing values.
So imputation of missing values is all
about you build up basically a predictive model
to predict the missing values of data to clean the data such
that the data can be cleaned to feed
into the actual machine learning model.
But again, you have to crawl before you can walk,
and you have to walk before you can run.
So let’s put that on the back burner for now.
So let’s just clean it with the mode.
If you have a data set and you want
to just get going with machine learning
you can just start off withe median, the mode,
and mean just to get going so the machine learning
model doesn’t scream at you because there’s missing values.
But also remember you can also do things
like can drop the row, or you can drop the column.
now for the most part, these are really sub-par.
Because remember data’s new oil.
The idea is data feeds algorithms.
The more data you feed the algorithm, the better.
So if you drop all the rows, then you’re
going to lose a lot of data that it could have learned from.
So maybe that column could have had
a missing value for that cell, but maybe the rest of the row
was fine.
So you might be punishing the other columns
in that row for the bad behavior of a single cell.
So it’s better just to clean the missing values for the most
part.
And you don’t want to drop a column unless basically you’ve
decided that it’s useless.
And maybe there’s 99% missing, and there’s
no signal in that 1%.
So those are very brief situations
where you want to drop the columns, but almost never would
you want to drop the column unless it’s irrelevant
if it even had missing values.
All right, so that will conclude how
to clean missing data in Azure ML.
And let me just show you really quickly.
So you remember in the cleaning function mode
there were some other modes.
I talked about the mean, median, and the mode.
So the mean and the median will break unless it’s
an actual numeric column.
And then there’s also this MICE thing here.
So MICE is all about loading a mini in-mind regression.
So you select it and click on the More Help function,
it will link you to a white paper released by Microsoft
on what the MICE equation looks like.
You can also do a custom substitution.
Custom substitution is very cool in case
you want to replace things with whatever
you think it should be, like for example 0 or something.
So that’s also good for categories.
If you have a situation where a category is missing,
you can use custom substitution to create
a separate categorical value.
What that will do is it will create a separate category,
and rather than filling the mode,
you can actually isolate out the individual contributions
that the mode had on the dataset, on the model,
on the response class itself, without being
tainted by all the missing values being filled in there.
So the next thing is you can remove the row, remove
the columns, and then you can do basically a pre-CA calculation
to fill in missing values as well.
And that will conclude how to clean missing data in Azure ML.
So during this next time I will show you
how to cast columns into categories
so the computer knows how to treat them correctly
and how to split your data into multiple parts,
pull partitions as we begin to build
machine learning on this clean and ready to go dataset.
So now that this dataset is clean, it’s ready to go,
it is good to be fed into a machine learning model.
And if you like this video and you
want to see more videos like this in the future,
remember hit that Like button and share with your friends
and spread the good word of data science.
It helps a lot.
And before I go, I just got a question to ask you guys.
What do you use as your machine learning forum.
Let me know in the comments my name is Phuc Duong,
and I’ll see you next time

You can get a free trial of Azure here.

Here is the link to the Azure Portal.

Part 9:
Splitting Data & Categorical Casting 

Part 7:
Dropping and Selecting Columns 

Complete Series:
Introduction to Azure Machine Learning

More Data Science Learning Material:
[Video] Unstructured Text With Python, MS Cognitive Services & PowerBI
[Blog] Math for Machine Learning: Math for Aspiring Data Scientists

(837)

Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>