Introduction to Kaggle – My First Kaggle Submission

As an introduction to Kaggle and your first Kaggle submission we will explain: What Kaggle is, how to create a Kaggle account, and how to submit your model to the Kaggle competition. This is your first homework assignment for our Data Science Bootcamp.

Hello, all right, my name is Phuc Duong.
I’m the senior data engineer of Data Science Dojo,
and I’m here to walk you through day two’s homework.
I hope you’ve been enjoying so far of the boot camp.
OK, so quickly just go to portal.datasciencedojo.com.
All the homework is laid out there for you.
If you prefer video, this is a video in walking you
through the homework.
But just so you know, the homework
is elaborated in on both sections.
So there are two parts of the homework.
The first part of the homework is
to apply what you learned today.
So basically take the Titanic dataset
and apply a predictive model to it.
So go ahead and use R part.
Or if you’ve gotten to random force,
go ahead and use a random force model.
That’s the first part of homework.
The second part of the homework is to actually enter
into a Kaggle competition.
So both the homeworks are elaborated here.
I’m just going to go talk and show you
how to do that Kaggle competition real quick.
OK, so this really is your data science capstone project
for this course.
So by the end of Friday, basically
you’ll be working from Tuesday all the way to Friday
to perfect your model, and then you’ll
be ranked among your peers.
Your peers being basically everyone at boot camp.
And then there are prizes on the line,
so I’ll talk more about the prizes later.
But for now, this whole page talks you
through how to create a Kaggle account, how to submit,
and how to do all that good stuff.
Now I will talk you through that also here as well.
OK, so what you want to do is you want
to Google Kaggle Titanic, OK?
So notice that we’ve actually entered you into a Kaggle
competition since day one.
So that Titanic dataset actually comes
from this Kaggle competition.
And what is Kaggle?
Well, Kaggle is a crowdsourced way of doing data science.
So real companies like Home Depot, Liberty Mutual,
Allstate, Netflix, they come together and post
real datasets.
And from these real datasets, there is a datamining problem.
And you’re ranked among your peers
as you do these datamining problems on what
are called leaderboards, OK?
So the Titanic competition is basically
the introductory Kaggle competition homework
that we’ll do together.
And then, if you notice, if you go to Data in this thing,
there are a bunch of data sets that are associated
with this Kaggle competition.
So if you notice here, there is a train.csv,
and let me tell you what that is real quick.
So you notice that, throughout this Kaggle competition,
you’ve been given this data set with 191 rows, right?
This is the training set.
This is the set that you’ve been working with,
although some of you should have been kind of suspicious
if you’ve been paying attention to history.
The Titanic boat actually housed about 2,000 people,
yet we only have 891 passengers.
I wonder where the rest of the other passengers went?
Well, it turns out Kaggle actually
is withholding the other passengers in this test set.
So your homework is actually to basically build
a predictive model.
Your capstone is to build a predictive model
on this training set and to apply it to this test set.
So I’m going to go ahead and download this test set.
We can see what is inside of it, OK?
So if you open up this test set, you
will notice that the passenger ID starts at 892.
So these are the remaining passengers
that were on the Titanic.
But you’ll also notice that we have one less column.
You notice that survived is now missing.
That is your job, OK?
You’re supposed to predict whether or not
these people will survive or die.
So notice that that is all the Kaggle competition is.
Kaggle wants your answers.
They want to know whether or not individual passengers lived
or died.
For example, passenger 897, did they
live or die based upon these demographical conditions that
are going to be read in by your predictive model?
So I’m going to show you real quick how
to submit to Kaggle for the purposes of just
a introduction.
So for tonight’s homework, you don’t
need to hook up a predictive model and submit the Kaggle,
you just need that to just submit.
And I’m going to show you how to submit.
So Kaggle wants two things from you.
It wants passenger ID, and it wants basically survived.
Did the person that is corresponding with that
passenger ID live or die?
So Kaggle just wants two columns from you.
So the fact that these columns are here, irrelevant
so we’re going to delete it.
So Kaggle wants a column called passenger ID,
and noticed that the I is capitalized
and the P is capitalized, and it’s one word.
And it also wants a column called Survived.
Notice that it’s past tense, and there’s a capital
S. Kaggle will check for that.
And we’re going to build a very simple model, a model where
everyone dies.
So you notice that, if everyone dies,
then this is going to be a very–
basically it’s not even a predictive model.
We’re just going to say, if you step on a boat you will die.
But notice that, if you remember from day one
when we did exploration, when we looked at the class
distribution of survive versus dead,
we noticed that there was about a 62% chance of death
just by stepping on the boat.
So actually by saying everyone died,
we would have a statistical likelihood
of doing better than a coin flip, doing better than 50%.
So I’m going to go ahead and say everyone dies here,
and I’m going to save that as a CSV.
So I’m going to go and save this as my own model, everyone
dies.csv, and I’m going to save that.
All right, and what you need to do
is you need to go to Kaggle and upload this file.
So go ahead and make a submission.
There’s a Make Submission button here.
So click on Make Submission, and then go ahead
and we’ll upload a submission in here.
So everyone dies.csv, and we’ll go ahead and submit that.
All right, so it just so happens, notice that we–
notice that we don’t even give Kaggle our predictive model.
We just give Kaggle the answers.
That makes it so we can build a predictive model in Python,
Azure ML, it doesn’t matter.
It is now class agnostic.
They only care about your answers.
And notice that we are just submitting predictions
to Kaggle.
And Kaggle is actually going to score this.
And Kaggle is actually going to be able to give you
an accuracy out of this.
That’s because they actually hold the true labels.
Kaggle actually knows whether or not the person lived or died.
And if you remember from evaluation,
if you compare predicted versus actual,
you’ll get a confusion matrix.
So you submitted predictions.
Kaggle has the actual.
From that, Kaggle builds a confusion matrix.
From the confusion matrix, you get accuracy.
And notice that it spits me out and accuracy,
and says my submission got a 62% accuracy,
and I rank 5,517 in the world.
All right, so the capstone here is basically
we’re going to enter all of you guys into a Kaggle competition
within the class, OK?
And to enter yourself into this Kaggle competition,
save the name that appears on the Kaggle leaderboard.
So noticed that I’m Phuc H Duong,
so I’ll save my username as Phuc H Duong.
And then I’ll go ahead and go back to that Kaggle submission
homework, and I’ll paste it into this form down here.
So this form down here will actually
go ahead and enter your Kaggle username
into our internal leaderboard.
And on Friday, after lunch, we’re
going to end the Kaggle competition wherein
the first place winner, basically the person that ranks
highest, will get a prize.
The prize will be an advanced statistical R book,
and it’s a very good book.
If you want to do some of these extra advanced datamining
processes in R, that’s in there.
And notice that we only can teach you so much.
That book actually contains a lot of the other stuff
that we couldn’t teach you.
For example, there is actually more than one way
to cross validate, right?
We taught you just K fold cross validation,
but there’s also leave one out cross fold validation, right?
So there’s four other ways to cross validate
that we were not able to cover in class,
and that book covers that.
And then the second and third place winner
will get an O’Reilly book called Doing Data Science.
I also really enjoy that book.
I was raised by O’Reilly, and hopefully you will be as well.
OK, and more importantly, yes I know you can buy these books,
I know you can go ahead and just kind of pass this off,
but this is really important.
You want to do this Kaggle competition
and be able to ask the instructor
questions while you’re still in class, right?
Because there’s actually a lot of minute little steps
to go along the way here that might basically cripple you
when you go back to work and you try
to work on your own Kaggle competition
or your own datasets, OK?
But more importantly, your honor is on the line.
You have to defend your honor.
And you will get big bragging rights from all of this, OK?
All right, now, happy modeling.

Data Set Used:
Titanic Data Set

More Data Science Material:
[Video] Creating a Titanic Model in R Part 1
[Video] Creating a Titanic Model in Azure
[Blog] Getting Started with Kaggle Competitions

(458)

Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>