Completing the Titanic Kaggle Competition in Azure ML

In this kaggle tutorial we will show you how to complete the Titanic Kaggle competition in Azure ML (Microsoft Azure Machine Learning Studio). It is helpful to have prior knowledge of Azure ML Studio, as well as have an Azure account. A kaggle competition is when companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data.

Hello, ladies and gentlemen of the internet.
My name is Phuc Duong , and I’m here to show you how to do
the Kaggle competition in Azure Machine Learning Studio.
Now I’ve gone ahead and uploaded the blueprints to a solution
that I’ve come up with in Azure Machine Learning Studio,
and I have posted it on the Cortana Intelligence Gallery,
where, you can clone and replicate this experiment.
Now I’m going to post the link to this experiment
in the description, or you go ahead
and pause the video and look at this notepad of the URL,
and type that in yourself.
Now, I really recommend clicking on the description, OK?
OK, so now that we’re done with that,
there is a quick description on this page.
You can read it if you want, or you
can listen to my beautiful voice as I walk you
through how to do this.
So if I click on the Open in the Studio button,
it’s going to go ahead and try to open it in your Azure
Machine Learning workspace.
The first thing it’s going to ask you
is, which workspace do I want to clone it to?
You choose the workspace that you want to clone it to,
which region you want it to clone to, and say yes.
So it’s going to go ahead and bring
in my datasets, my modules, and then we’re
going to go ahead and run the experiment.
I’m going to walk you through what it’s doing.
Now, remember that this is a solution.
This is not definitely nowhere near the optimal solution,
but it is definitely a good solution
to get running as far as a template for most machine
learning models, not just the Titanic machine learning model.
So once it’s gone ahead and loaded,
you should see something that looks like this.
It kind of looks like an octopus.
So, the first thing you do is to hit run.
So, right now this is just like a blank set of instructions,
or a set of blueprints.
Once you hit run, though, it’s going to go through and execute
and actually do the transformation on the datasets.
So at the very top, we begin with our two datasets.
So the Titanic dataset will be our training set.
So 891 rows where we know from the past whether or not people
lived or died when they stepped on the Titanic.
And then our job is to build a predictive model
based upon these demographical conditions.
How many children they had, whether or not they’re
male or female, et cetera.
And then the test set is our duty.
This is what we’re going to need to do because we
don’t have survivors.
So based upon feeding it brand new data, about the model
that we’ll build, can the model predict on this dataset?
Can we predict whether or not these people will live or die?
And that’s what the Kaggle competition
is going to go ahead and grade you on.
So let’s ignore the test set workflow for a minute.
I’m going to basically mouse over here
so you guys don’t see, so you guys will ignore it.
And I’m going to go through this workflow.
So we’re going to drop four columns right now.
So passenger ID, name, ticket, and cabin,
because they’re not useful in their current forms right now,
we can go back and literally improve this model
by going into name, for example, and extracting
titles, or the cabin letter, or something else.
And then we’re going to go through
and we’re going to cast columns into categories.
So the categories are basically Survived,
Pclass, Sex, and Embarked.
So we want the data set to treat– especially Survived–
as a category, because remember by default,
it treats 0 and 1 as numeric.
If it goes in as numeric, it then
becomes a regression problem, not a classification problem.
And then we have 2 cleaning modules, so in this case,
we’re going to go ahead and do a quick and dirty clean module.
We’re going to clean the entire dataset
with basically the median.
In this case I think the median translates to 28
for 177 rows that are missing an age.
So if I visualize before the cleaning,
we can see that there’s 177 missing values of age
but, then after we clean all the numerics,
we should be left with no missing values of age.
But notice that at around 28, our histogram is entirely
pulled up.
So our distribution stay the same,
our distribution just gets tighter.
Remember, this is a very sub par cleaning function.
You definitely want to eventually build
a machine learning model to predict these missing values.
The next thing we want to do is we’re
going to do the same thing, but for categories.
And for categories, we’re going to clean it with the mode.
So if we looked at before the cleaning, we looked at Embarked
and there should be two missing values of Embarked,
for example, but if after this, if we visualize this,
there would be no missing values of Embarked.
And there should be no missing values for any of the remaining
columns.
So now what we’re going to do is we
want to show you the two methodologies in here.
So ignore the methodology.
So there’s really three methodologies here,
but I want to show you the first step.
At the first site, is the train test.
And notice that all three methodologies
derive from the same algorithm.
So this Two-class Decision Forest
is going into each and every step.
So the left side, basically, you train test split,
so we hold out 30% of the data, and we basically
build them all by using only 70% of the data.
So notice that the 70% of the data is going in,
and it’s going to go and sacrifice itself to build
this forest, build this model.
And once the model has been built,
then we bring in the 30% dataset that we basically hid away.
So we hid away this dataset, to pretend
that it’s new world data.
So we go ahead, and if you visualize the score model
module, you’re going to go ahead and see
that it’s gone ahead and taken this route for every row.
It’s going to basically throw it to the model,
and I’m also going to give a prediction for each
and every single row line by line.
So 445 predictions here.
And looks like– oh, this is a mistake on my part.
I noticed this a 70/30 split here,
it’s actually a 50/50 split.
So ignore that, a mistake.
You can change that if you want.
So it currently doesn’t match the documentation.
All right, so if you go ahead and visualize this,
so we can see that based upon the cleaning function and all
this stuff, the model predicted 50% chance
that this person would live, because it’s
rounding up from 50%–
the model is going to go ahead and say, this person will live,
in actuality, this person died.
So the model was wrong in this case.
So The model was not in agreement
with reality in this situation.
And if you go down here, the model
thought this person would have a 23% chance of living,
it runs down from 50%.
So, zero.
This person is predicted to have died.
Ooh, OK, so this model is wrong again.
So notice that we already found two wrong answers.
But you can also do a comparison between the two.
So if you click on the predicted result,
and then compare that to survive,
which is the ground troop, we’ll have predicted versus actual,
and we get a really cool confusion matrix.
So 103 times the model was right,
254 times the model was right, the rest of this is wrong.
We can go through and add those one by one,
but there’s actually an evaluation module
that would calculate accuracy, precision and recall for us.
So if you visualize the evaluate module, and scroll down,
it says that this model has an 80% accuracy, which
is actually not bad.
So the Titanic competition wants you
to maximize this particular metric accuracy,
and hopefully we won’t over-fit too much
by trying to maximize this metric.
So the idea is, we will build a new feature
and go ahead and run this again.
So we’ll add a new feature, we’ll clean differently,
we’ll tune the parameters of the algorithm,
we’ll do tweaks to make the model better.
And every time we’re going to go back and see,
does that tweaking improve the accuracy of the model or not?
So that’s the first way.
So this is how we do order of development.
Now, it could be that through sheer randomization alone,
we could end up with a series of splits that make us look good.
So then that’s where cross-validation comes in.
So once we get an accuracy that we’re happy with,
the idea is then we can go ahead and cross-validate
to see whether or not we will trust that accuracy measure.
And so if we look at this number, 80%,
is that 80% on a good day, or that is 80% on a bad day?
It could just be through sheer randomisation.
Some of the test set could have just been easy to classify.
And the harder to classify was not in that test set.
So to avoid that, what we’re going to do
is we’re going to build 10 models
in this cross-validation module.
So first of all, I’m not going to trust any number that comes
out of this evaluate module.
I’m going to be skeptical and I’m
going to do a cross-validation check on that number.
So this model claims that it will
get me an 80% accurate model.
This cross-validation is basically
like a separate dipstick test.
So notice it’s reading from the same algorithm,
the same dataset, the same everything, because I
want to test all of the current conditions.
So if I visualize this cross-validation module,
it’s going to go ahead and build me 10 models.
It’s going to take my dataset, chop it up 10 times,
and build a model on a different partition every time,
and train on a different partition
every time, until I see my entire dataset.
And we can go ahead and see in the accuracy,
this is a table that goes ahead and summarizes how the model is
going to do on each and every single bit
of each cross-validation.
So notice that when it went to here,
on fold 7, so basically it trained on every fold except 7,
and then tested on fold 7.
And now we’re in a situation where the model only
got 66% accuracy on that.
I think that’s very bad.
So notice that the model was actually–
that 80% seems to have been on a good day.
So it’s 82, 89, I think that’s pretty good.
But you notice that the model is jumping around.
It is not a stable model.
So this is a model that can eventually, probably betray us,
it could fool us into thinking that it’s a good model,
but we deploy it and then we start
losing a lot of money based upon these predictions
because we’re wrong a lot.
So you noticed that so on average
the model will get 80% when we retrain it.
And the standard deviation from the mean is about 6% here,
so that is a very high standard deviation model.
So if you do two times the standard deviations,
it’s about plus or minus 12.
So the model will be anywhere between 68% and 92% accuracy.
So remember we don’t want to care or get attached
to any individual model.
We want to– we’re basically evaluating a process
based upon our methodologies.
Will this methodology produce me a good model every time?
So the idea is, I don’t care about the individual sharpness
of a single knife or a blade, I care about,
what cross-validation cares about
is, it cares that the factory goes ahead and produces
a sharp blade every time.
I don’t care about the individual sharpness
of a blade, I care about the overall sharpness
of all the blades come out of this factory.
And the same thing is true of this machine.
So that tells me first of all, that this is not a good model.
It’s a very unstable model.
So I would go back to the drawing board.
I would engineer more features and I would build–
I would go ahead and build better tuning functions,
maybe use a different parameter for the algorithm,
maybe use a different algorithm altogether.
But the idea is once I’m happy with the standard deviation
in the cross-validation, once it’s low enough,
I can go ahead and decide that I want to deploy this model.
So deployed means I want to use it on my production data.
So in this world and the Titanic Kaggle competition,
the production data is the Kaggle test set,
and so that’s the other 418 rows that they
don’t give you survived on.
So once I’m happy with my process on the model,
I’m going to go ahead and retrain
the model on 100% of data.
And notice that once my cross-validation has validated
my process, my process being the way I treat my data,
the way I clean my data, the way I’ve engineered my features,
and also with the algorithm I chose,
and the parameters of the algorithm, once all of that
is perfect, I have a spreadsheet somewhere
keeping track of all this, then I retrained the model
on our 100% of the data.
And notice that because we’ve trained on 100% of the data,
we have no holdout set, and because we have no holdout
set that we can’t evaluate to tell how well this model did.
But, we have some guarantees, because we
know that in the past, it got this well
and it did this good on this particular set of parameters,
this algorithm, and this dataset.
The assumption is if I feed it more data,
and keep everything else the same, it will perform better,
it will learn better from that dataset.
So, this is my production model, so
notice I’m feeding it 100% of the data, no evaluation,
because the evaluation was done over here.
So notice if I had two weeks to put a predictive model,
this is basically day, I don’t know if 13 out of the 14 days.
And notice that if you go back to the test set,
I cleaned the test set exactly the same way.
Notice that I basically took this,
I right clicked, I copied and I pasted it over here.
So I cleaned it exactly the same way.
The only difference is I actually
don’t drop passenger ID.
I said like passenger ID to go forward and do it here.
So if I visualize this, I actually keep passenger ID.
Because remember Kaggle wants passenger ID.
But why doesn’t it air when it gets down here,
in the prediction?
Because we never built a model that uses passenger ID.
Well the idea is Azure ML is going to look for column names
that it was trained on.
If it sees extra column names, it’s
going to go ahead and ignore it.
It’s just going to pass through it.
So that’s why passenger ID is fine over here,
but it’s not fine over here.
The next thing is we don’t have Survived.
So I’m not even going to basically keep Survived.
And also this documentation is wrong,
it should not say dropping passenger ID,
it should just say dropping name, ticket, and cabin.
And I noticed that we’re going to cast everything except
Survived into a category.
So notice that no one survived here,
but that was left because I copy and pasted this from over here.
So notice that I’m casting these things into categories.
So it knows that everything has to stay the same.
If it was a category when you train it,
it has to be a category when you predict on it,
or in this case, score on it.
And then I clean it the same way.
Now this is also really sub par cleaning,
because I think the median over here for age is 29,
and the median over here is actually 28.
Now if you were really serious about keeping everything
the same, you would do a custom substitution over here.
And just say “replace with 28”.
So that that’s probably what you would have done here, and just
done a hard replace with 28.
Same thing over here.
I think the mode is also the same on both sides
so that we’re not really affected here.
But now what we’re going to do is,
it’s going to go ahead and take this dataset, that’s
now a train on 100% data, or this model
is trained on 100% of data.
And in this case, it’s going to go through
and once that model’s been built,
it’s going to run through and build predictions.
It’s going to derive predictions based
upon that, basically, that production model,
and it’s going to give me a prediction for each
and every single row.
So go ahead and visualize it when it’s done.
So you can see that starting from passenger 892
and going onward, we’re going to see that these
are the predictions that we’re going to get,
and the scored labels is derived from the score probabilities,
so it’s going to round it between higher than 50%
and lower than 50%.
So notice that this person is rounded up to one,
this person is rounded down to zero, and life, and so forth.
And remember this is what Kaggle wants from you.
Kaggle just wants a straight answer.
Will this person live or will this person die?
Now what we need to do, is we need
to formulate our dataset so we can upload it
to the Kaggle website.
So we’ve gone ahead and we take the scored label,
and we take this passenger ID– that’s all Kaggle wants,
so that’s what we’re going to do here.
So we’re going to take this select columns in dataset
so this will be passenger ID and this will be scored labels,
and we’re going to get basically only two columns out of that.
The next thing we’re going to do is
we’re going to go ahead and rename the scored labels.
Call them Survive because Kaggle is looking for two
columns of particular names.
So in this case, it was called scored labels, what
we’re going to do is we’re going to rename that column.
So I said I want scored labels to be renamed to be Survive,
and then I want to convert the whole thing to CSB.
So if I right click and I download
it’ll go ahead and download the dataset to my computer.
So the next thing I want to do is
I want to upload this to the Kaggle competition.
So once you have downloaded that file, go to “make a submission”
and this is what you should be looking at.
So notice I have a Titanic Kaggle competition file
that I’ve downloaded from Azure ML,
from that convert to CSB module.
And I’m just going to go ahead and drag that in, and submit
my dataset.
And then I will say submit.
It’s going to go ahead and score my competition,
and then based upon this model, I got a 77% accuracy
on my dataset on the test set.
And basically that’s how you do it.
So to raise your rank from this competition,
you basically go back to the drawing board.
You change the cleaning functions,
you engineer different features, you
bring in different data, more data,
and you do what you need to do to improve the model.
Now if you’re in the public Kaggle competition,
bringing in more data might be bad,
because it’s against the terms and conditions.
But if you’re doing this with the Data Science Dojo cohort,
go ahead and by any means necessary,
make the model better.
So let’s talk about this workflow real quick.
So notice that I’ve set everything up
in the same workspace.
And notice that I will engineer features,
I will build different, I would clean it differently.
I will change these parameters, and every time I’ll
go ahead and train test split and check whether or not
if I clean out the mice, did that improve the accuracy.
If I clean it with the median, would that
increase the accuracy?
And I’ll keep doing that until I get an accuracy that’s good.
Once the accuracy is good enough,
then I’ll go ahead and check the stability
of that accuracy using the cross-validate module.
And once all of that is happy, if it’s not stable,
if the cross-validation brings me back a not-stable model,
I’ll go back and I’ll find different features,
different cleaning algorithms, different machine learning
algorithms, different parameters, until my model is
good and stable.
And once my model is stable enough,
then I go ahead and train on 100% of my data,
and just right click over here and say “download”.
And notice that this does everything in one workflow.
So notice I can train when I’m happy I can then
check the evaluation and see if I’m happy,
and once I’m fully happy with the whole thing,
I can just right click and say download here
and then publish that into my Kaggle competition.
And now we’re about out of time.
Thanks for joining me today where we went through an end
to end solution for the Titanic Kaggle competition in Azure
Machine Learning Studio.
If you like what you just saw, remember to like this video.
It’ll help me produce more videos
like this in the future for free,
and remember to subscribe to get the latest tutorials.
And if you know someone who’s getting into data mining,
why don’t you give them this video
and I will spread the good word of data science to them.
All right, now if you do use this experiment,
let me know how you did it in the comments below.
And if you thought of a great way to improve the model,
like, let’s say you used a different cleaning
function, a different feature, a different algorithm,
a different parameter for the algorithm,
or a different methodology altogether,
remember that data science is only powerful
when it’s collaborative.
So go ahead and share your ideas and methodologies
and help each other out in the comments below.
For me in particular, I found that this dataset
is best suited for the two-class decision jungle.
So my name is Phuc Duong with Data Science
Dojo, and happy modeling.
I’ll see you guys later.

The experiment can be found here:
Kaggle Titanic Experiment

Full Kaggle Competition Series:
Kaggle Competition Series

More Data Science Material:
[Video] Salving the Kaggle Competition in R Part 1
[Blog] Kaggle Competitions and Data Science Portfolios

(648)

Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>