The kaggle competition requires you to create a model out of the titanic data set and submit it. We will show you how you can begin by using RStudio. This kaggle competition in r series gets you up-to-speed so you are ready at our data science bootcamp.
My name is Phuc Duong, and I’m here
to show you how to do the Kaggle Competition in R. OK,
so the first thing you want to do
is– you want to quickly go to Google
and just go to type in Titanic Kaggle.
So this goes directly to the Titanic competition on Kaggle.
Now we want to grab our data files that we’re
going to be working with.
So we have the train set, that’s going to be
our supervised labeled data.
And then we have the test set, so that is our blind hold data
set where we don’t know if they lived or died.
So I’m going to save this to a working folder called Kaggle,
so all of my submissions and my project files
are going to be in this folder.
So I’m going to save it to there.
I’m also going to save the test set.
So that’s the first thing I’m going to do.
So I have two data files.
So the training set is the one I’m
going to use to build a predictive model on.
And then the test set is basically
the one I’m going to score.
All right, so let’s do this real quickly.
So I have R open, so this is R Studio.
And I’m just going to do the rest of this in R Studio.
So the first thing I want to do is–
I want to set a working directory.
I can do a set WD or, if I’m in R Studio.
I can go to a session, and then set working directory,
choose directory, and then from that I can go to my Kaggle
file, select that as my folder.
So you notice that it automatically
typed in set working directory for me.
So I’m going to be typing this as a script, basically.
So if you guys remember, we can just
save things, and then paste them,
and execute them as needed.
Hold on, I need to make a new R script.
So the next thing is, now that I have a working directory,
now that it shows that I’m in the folder
that I want to work with, now I can do read.csv.
So I can read in these data sets now.
So I have two data files, and I’m
going to read them both in separately.
So on one hand, I have a titanic.train model.
So I’ll call that titanic.train.
And I’ll do a read.csv of it where the file will
equal, in this case, train.scv.
And I also want to do something special here,
so I want to make sure that what’s called stringAsFactors
So by default, read.csv is going to read into a data file,
build a data frame out of that, and then convert all
the strings into categories.
And we don’t want that, because we
want to do some manipulation.
If we want to do some manipulation of the factors,
that’s also going to be a thing.
And also, I’m going to do a very different methodology by which
I am actually going to combine these two data files together
And if you’ve ever worked in R before, you’re
going to know very quickly that to combine two data
files together, the factors basically
have to line up perfectly.
And then, just to basically remove
that as a barrier, what we’ll do is, we’ll keep them as strings.
Once we combine the two data files together,
then we can go ahead and cast them
as factors when we need to.
And then, just as a style thing, I
like to do headers equals true.
Now, by default, read.csv has headers equals true,
but it’s just a good habit that I’ve
gone into it where I sometimes do read table
And sometimes I get confused, and then my models break
because it read in the first line as a header or it didn’t.
So it read in 191 rows of that.
And I just want to really quickly check
the tail of this file, so titanic.train.
I’m checking the tail of the fire,
because if there was an error on the read
in later down on the file, it will ripple through and error
through the rest of the file.
So that’s why I do a tail.
So if the last line seems OK, then
that means the rest of the file is probably OK.
So for now, it seems that it read in correctly.
So I’m going to go ahead and do that for the other one as well,
but instead of saying titanic.train,
I’m going to go ahead and do titanic.test, so titanic.test.
And then I’ll do a read.csv of test.csv.
So we’re going to go ahead and execute that.
And I’m executing the line from the script
by just using Control-Enter here from R Studio.
So noticed that the minute I press Control-Enter,
it copies the code and executes it in the console.
So notice that I have 418 observations of the test set.
So this is the things that we’re trying to predict on.
So if I do an str of titanic.test,
we should find that survived is missing.
So I have between PassengerId and Pclass,
there should have been survived, but there is no survived.
So that’s kind of what we intended for.
And now I’m going to go and try to combine these two
files together, because I want to clean them together.
Now there’s nothing stopping you from cleaning them separately,
you just have to run each function twice.
So if you clean with the median on the training set one way,
you have to find out what the median is and then
So if the median was 29 on the training set,
you have to clean it the same way on the test set.
so then you have to save the 29 and then
insert that as a hard insert into the test set.
But if I combine them together, I
can actually get the global median.
So the global median actually might
be different from the actual non-global median.
So what we’ll do is– actually, if you type in median,
actually, to just figure out what the median is real
quick of the titanic of, let’s see, for example, Age,
And I think I have to do na dot remove is equal to true.
So basically, it’s going to calculate the median
in the absence of missing value.
So the missing values, I think there were 177 missing values.
So it’s gone ahead and calculated to median
for basically the non-missing values, so 28.
So that is the median of the training set.
Now the median of the test set might be something different.
So let’s just try that real quick,
and hopefully it will be different.
So notice that the median of the two data sets are different.
So when I combine them together, there
might be a third different median as well, what’s
known as the global median.
So now I need to combine these two data files together.
So when I combine these two data files together, later on I
want to split them apart, I’ll need a way
to differentiate whether or not it is part of the training set
are part of the test set.
Now one can very easily just do if PassengerId
is greater than 891 or not.
But you want to get in the best practice in case
there is no PassengerId in your work data set.
So normally, what I like to is, I like to create my own column.
And then I’d like to build it into true or false.
So we’ll build a column by which we’ll
mark if it’s part of the test set, which true.
And we’ll mark it part of the train set.
Or if train set is false for the test set.
So in this case, I would do a titanic.train.
And we’ll build a brand new column that doesn’t exist,
we’ll call a brand new column that doesn’t exist.
So R is going to go ahead and create a column for us.
And we’re going to fill that entire column with true.
So we’ve gone through it, and we’re
going to mark 891 rows here to be true.
So if I go to the tail of titanic.train of–
actually, titanic.train of isTrainSet,
you will notice that everything is true.
And we’ll do the same thing for the test set.
However, we’ll flip it, and we’ll say it’s false.
So later on, we’ll just do a very quick check
to see if it’s true or false.
And we’ll split it that way.
So now that I have my label set on both the training
set and the test set, I can now go ahead and combine them
However, when I do a combining them together,
they actually have to line up.
So if I do an ncol call of titanic.train,
I will find that there’s 13 columns.
And then if I do and ncol of titanic.test,
I’ll notice that there’s 12 columns.
So you notice that already there is one column missing,
So first of all, I have to make sure
that the survived column exists in the test set.
And then secondly, I have to make sure
that the row names, or the column names, the headers,
Lineup being, they’re spelt exactly the same.
And let’s see, if I do that names of titanic.train,
and if I do names of titanic.test,
we can go ahead and compare these two.
Check the spelling, so basically the capital Sex capital
Now it just so happens that they are the same.
You can check it if you want to, but I
know that they’re the same.
So yes, we don’t have to go ahead and line up
the headers in this case, but we do
have to add a column called Survived to the test.
So the test that doesn’t have that.
So basically, to add a Survived column, what we’re going to do
is– we’re going to do titanic.test and call Survived.
So notice that there is no column called Survived.
We’re calling a column that doesn’t exist right now.
But R is smart enough to know that if you
call a column that doesn’t exist and fill it with something–
in this case, we’re going to fill it with NA, NA
So basically, in the test set, we’re
going build an entire column, and we’re
going to fill that entire column with NA,
so basically 418 rows of NA for Survived.
So we’re going to go ahead and execute that.
So now if I do ncol of titanic.test, we’ll go with 13.
And if I go with names of titanic.test,
we’ll go ahead and see that there is a column now called
And the next thing is– now we’re
going to go ahead and combine these two data sets together
into one now that the headers have lined up
and the columns have lined up.
So I’m going to call this new data set called titanic.full.
I’m going to go ahead and do what’s called an rbind.
We’re going to do a simple row bind.
If you’re more used to SQL, this is called a union.
So basically, we’re going to take the first data
set, titanic.train, and then we’re
going to merge it with the second one, titanic.test.
We’re just basically going to do a vertical join on these two
So titanic.full has 1,309 rows.
So just to check the math on that–
891 plus 418 rows is 1,309.
So that’s fine, so no rows got skipped.
And I also want to check the tail of this file just
to be sure that everything came out fine.
So you noticed that there, the train set,
isTrainSet is all set to false, correct?
This is ordered right now.
So just to double check real quickly–
if I do a table of isTrainSet, I should
have 891 trues and 418 falses.
So let’s do that real quick.
So if you do a table of titanic.full of isTrainSet,
I can find that there’s 891 true and 418 false.
This is exactly what I want, so later on I
The next thing is– there are some missing
So I have if I quickly do a table of titanic.full
of Embarked, I can see that there
is a category of basically an empty string.
So there’s C, Q, S, and an empty string,
So I’ll go ahead and clean the missing
values of this real quick.
Now what I’m going to show you is not,
I would say, the optimal cleaning method.
It is a cleaning method, but we need to clean the data in a way
so that we can go ahead and build a model.
Because the model is not going to like us
having basically null values.
I’ll quickly build a filter, so titanc.full of Embarked
is equal equal to this double quote, double quote
So if I run this, I’ll get a series of true or false.
And in there, somewhere, should be two trues.
I want to query just the Embarked column
So of titanic.full, I’m going to query the rows where
Embarked is equal to null, and I only
want Embarked to come back.
So I should get 2 values back, and they should be, basically–
and let’s just run it real quick –they should be null.
So once I’ve selected these two values,
I’m going to replace them with something.
So it just so happens that if I do table of titanic.full
of Embarked– and if I do a table of that again, let’s
figure out what the mode is and just replace it with the mode.
That’s the quick and easy way of doing it.
So in this case, it’s S, I’m going
to replace it with S. So with that,
let’s see if that went ahead and did exactly what we
So if I do a table again, the nulls should be missing now.
So you notice that those two nulls have been added to S now.
So the S before had 914, now S has 916.
I think age had missing values.
So if I do an is.na of titanic.full of Age,
I should get a pretty sizable amount there.
So if I do a quick table of this,
I should get a count of trues versus falses.
And we can see that there is 1,000 false and 263 true,
so there is a lot of missing values.
So I think in the training set alone, there’s 177.
So the test set also brought with it
So that tells me that almost half the dataset in the test
set had missing values of Age.
So how we clean Age is actually going
to become extremely important.
But I’ll let you do that as your homework
for day four, which is going to be tomorrow.
So but for now, let’s just replace it with the median.
This is going to be a quick and easy way of doing it.
So this right here represents a query of true-falses
So that is already our filter.
The next thing we want to do is– we only
want to query the column Age.
Now we want to query which data set?
And we’re going to really quickly just replace everything
So before I run this query, I’m actually
going to define what the median is.
So in this case, the median will be,
if you remember– if I just do a quick median
If I ran it right now, it would break,
because there’s missing values.
So I would have to na dot remove.
So I’m telling it to do the calculation of median
in the absence of missing values.
So notice that it’s also 28.
So luckily, the global median is also the same median
So I’m going to assign that to basically age.median.
I’m going to assign that to a variable.
Now I could have just very simply
just when I’d done this query of the missing values of Age–
so notice, I get, I think, these 400 rows back,
or 200 something rows back, where Age is missing.
Now I could have very easily just said,
replace everything with 28.
I didn’t have to go through basically this long calculation
But if we’re in the process of writing
a script that does an automated process–
The median of tomorrow might be different than the median
So I want to build my script in a very extendable manner, so
basically that’s why I use this variable here
that was calculated elsewhere from the data itself.
So if I run this line, it’s going
to go ahead and calculate the median for me
and then insert that into the missing values of Age.
So if I run this query again, this query
again finds missing values.
There should be true-falses–
basically trues where there is a missing value.
So if I do a table right now of is.na of titanic,
there should be no trues.
Basically all the missing values have been filled in.
There is another column that has missing values,
so that needs to be done.
So if I do a titanic.full of Fare–
if I run that real quick, that gives me the fare
but is.na will filter out true and false, true where it’s NA.
And if I wrap that entirely around the table,
I should get number of true and false.
Notice that there is one missing value of Fare.
So we’ll go ahead and fill in real quickly.
So let’s just do the median again.
So day four, your homework is to build a predictive model that
uses regression to actually predict
the missing value of Fare.
But for now, let’s just do it with the median.
So we’ll grab this line that we wrote up here
where we calculated the median, but instead of Age, we’ll
just go ahead and say the median of Fare and also fare.median.
So we’ll figure out what the median of fare is.
So if I go to fare, the median fare is $14.45.
So I’ll go ahead and also do this replacement strategy
So I’m going to just basically copy and paste
the code I had above except change
everything from Age to Fare.
And then we’ll do fare.median instead here.
So it’s going to replace everything missing a fare.
So I’m going to push up a couple of times
to go back to where I queried how many missing values of Fare
I’m going to run that query again, and notice
that that one true is missing.
That one true has been replaced.
So now we are ready to build a predictive model.
But before we can even do that, we
need to go ahead and split our data back out into train
and and basically test set.
So if you remember, we can do a query now.
So titanic.full of isTrainSet is equal equal to true.
This represents a query that we’ll find ourselves–
basically the test set, the 891 rows.
So if I do titanic.full, I can throw this data back
So with any luck, this should give me back what I wanted,
891 rows, but this time it has been cleaned.
So I’m going to add that in the script up here.
And then I’m going to do the same thing,
So the test set, same query except false.
Now, for you programmers out there,
I also could have just flipped the true to falses
by just adding an exclamation here using a not operator.
But if you don’t understand what that means, ignore it,
Just say is equal equal to false.
And instead of saying train here, I’ll say test.
So I’ll throw this into the test set.
So I’ve gone ahead and ran that.
Actually, before we should have done
that, we should have casted everything
we needed to into categories.
Well, I forgot to do that.
So there were a couple of things if I go back to titanic.full.
So notice, because it’s in a script, I can rerun this later,
So I’m actually going to insert some lines in here
where I’m going to do categorical casting here,
So I’m going to do some commenting here.
So this will say, “split data set back out
And this was “clean missing values of fare.”
And then going forward in that, we’re going to also–
before we do that, we’re going to do categorical casting.
Now we’re going to do categorical casting
for every column except Survived.
Because if we do categorical casting of Survived now,
there’s actually three categories in Survived.
So let me just show you, so titanic.full of Survived–
And then I think there should be a bunch of NAs.
I think something went wrong here.
Yeah, there was a bunch of NAs at the very end.
So if we did a categorical casting now,
we will lose the binary classification
We would actually have three classifications– an NA, a 0,
So what we need to do is– actually
we need to cast everything else except Survived.
So if I quickly do an str of titanic.full,
we will see the columns that we have at our disposal.
So basically we’re going to do as.factor,
we need to do that no matter what.
And I’m just going to copy this, because we’re
So as.factor titanic.full– and actually, while I’m here,
I might as well just copy that as well.
I think Pclass needs to be a category.
And by the way, you should also convert Pclass
But I’ll let you guys figure out how
to do that, you have to pass it the order.
Pclass class is going to need that.
Next thing is– we will also cast Sex into a factor.
And then if we go down further, Embarked
should definitely be a factor.
Now also keep in mind, there could
new case made for a sibling, spouse,
and parent-child to be an ordinal category.
I’ll let you guys experiment with that
to see if that improves the performance of the model.
And also, we can’t just cast them into factor,
we have to actually assign them back into the data itself.
So I’ve got to basically take the same column
and basically assign it back into itself.
So after the factor has been casted,
I will go ahead and load it back in.
So I’m going to run these three lines
to do my categorical casting.
So a titanic.full– str of that, so it
gives me the structure of this.
So notice that now Embarked is a factor.
And there’s only three levels, before there
would have been four levels, the fourth level being the missing
And then notice that we have gender,
Now we can run these two lines again
and split the data back up.
So I’m going to rerun this, and then it’ll
So if I do an str of titanic.train train now,
we’ll notice that the factors have
been retained in the same order and in the same types
So that’s nice, that’s what we wanted.
So now we have a situation where we
can build a predictive model.
But before we can build a predictive model,
you guys remember, we have to cast Survived back
So if we go back into here, if we
do titanic.train of Survived.
So we need to cast this into a factor, so as.factor.
And notice I’m doing this after I’ve split apart
my dataset, because I don’t want that NA to be inside
of this Survived category.
So titanic.train Survived here.
So that’s going to cast my Survived into a category, which
is going to tell me that, yes, this
is going to be a binary classification problem.
It’s not going to try to regression.
It’s not going to try to do a multiclass classification
Now I’m going to show you something new.
So up until this point in the boot camp,
you’ve probably just dropped the columns
and when you built you predictive model,
and when you went ahead and did your predictive model.
Now I’m going to show you how to mark, for R, which ones are
predictors and which ones to ignore when you
send in to a predictive model.
This is useful because our test–
we need to keep PassengerId.
Yeah, we need to keep PassengerId.
So instead of drop– we can’t drop PassengerId,
So I’m going to show you how to basically define
So in this case, we’re going to actually build a formula.
The formula, if you guys remember,
the normal formula was just, if you remember, randomForest.
So that was the initial formula, basically.
We told it to use everything except
Survived to predict Survived, and that required us to drop
everything we didn’t need.
And that’s not the form that we want.
So now I will just simply do–
I will explicitly call out which columns
I want to build a predictive model out of.
So if I do basically str of titanic.train,
I can then find out what my predictors need to be.
So notice, I’m not going use PassengerId,
but Pclass is going to be one of the ones that I want to use.
So basically, we’re going to work with a string.
And we want to tell it, we need to predict Survived.
So actually, I’m going to copy paste,
that’s a much safer way here.
I don’t want to mess up the column names,
because if I mess up the column names,
I’m going to have to debug it.
So basically Pclass plus–
I’m going to tell it to use Pclass,
I’m going to tell it to use Sex, I’m
going to tell it to use Age.
This plus sign is required because it tells it to use it.
Age– I’m going to tell it to use siblings.
Spouse– tell it to use –oops, need a plus sign
here, plus sign again, Parch.
We need to tell it to also use Fare.
And we can’t use Cabin now, and we need to use Embarked.
So that is our survived.equation.
So we’ll go ahead and say that this will be our equation
before we spit into the model.
But we also need to cast that as a formula, because R expects
this to be in a format called a formula, it’s a data type.
So as.formula will throw the Survived equation
And then we’ll assign it to a new variable
So that’s going to build us basically
a set of relationships– predict Survived given these columns.
So now I can go ahead and do install.packages of, let’s say,
It’s going to go ahead and install myself a random forest.
And now I can do a library of a random forest.
Awesome, now I can actually call a random forest.
So randomForest where my formula is
equal to my survived.formula, so that’s where I just
find what the relationships are, predict
Survived given Pclass, x, and et cetera.
The data that it will train on will be the titanic.train.
And notice, I’m skipping the 70/30 split here.
I’m also skipping cross-validations.
You guys should definitely be doing this on your data sets.
I’m just simply showing you how to build a predictive model
And then I think I can just do ntree is equal to 500,
I think that’s the default.
And then mtry will also equal to 3.
The square root of, I think, 7 is 2 point something,
but we’re going to add it up, we’ll round up to 3.
And then I like my node sizes, basically the minimum samples
I’d like to have at least be 1% of how big my test set is
or how big my train set is, so titanic.test.
So basically 891 rows, it needs to see around 9 to 8
for it to consider it as a good split.
So we’ll go ahead and build this.
And we’ll call this a titanic.model.
So there it is, it built me a predictive model.
And now I need to go ahead and apply the predictive model.
So we’ll go ahead, and we can also specify features now.
So in this one, I’ll use the same thing,
I’ll define what features are being used.
So in this case, we don’t have a Survived anymore,
so I’m just going to remove that.
It’s very important that you define this,
because we want to use PassengerId.
So in this case, we’ll do a survived.features
Now we can go ahead and use the predict function.
So in this case, I can run a prediction from my model.
So my model is the titanic.model, so basically
titanic, random forest, model.
And then my new data, the data I’m going to predict on,
will be titanic.test set.
So this is going to score each of them.
And then I’m going to go ahead and say–
hold on just a second, there’s something I’m missing.
Oh right, I need to assign it.
So these are just predictions.
So I’m going to call it Survived, because that’s going
to be the name of the column.
We’re going to do a C bind later,
and it’s going to take the name of the variable
So technically, it should be called titanic.predictions,
but I’m just skipping that step for now.
That’s because I selected just Survived.
So if I run entire line, it should work.
So if I type in Survived right now,
I should get a bunch of 0s and 1s of whether
or not these people lived or died.
It basically went through my random forest
and did this, which is really, really cool.
So this is what Kaggle wants from us.
Kaggle wants these answers from us, these 0s and 1s.
So the next part of that is–
I need to build a data frame to write that out as a CSV.
And the CSV only needs to have two columns–
PassengerId and Survived.
So we’re going to go ahead and do that real quick.
So basically, let’s isolate PassengerId, so titanic.test
And we’ll throw that into, I don’t know,
a barrel called PassengerId.
And notice, I’m going to call it basically by the same name,
because later on when we do cbind together of these two
things, it will take the column name to be the variable name,
so later on I don’t have to rename the column names.
But Kaggle wants it in a very particular way.
It wants capital P Passenger, and then Id with a capital I.
So that’s why I’m calling it that way.
I want to convert that into a dataframe.
So as.data.frame– so this will be my initial data
frame that I’m going to submit.
So I’m going to throw the PassengerId vector into there.
And then this will be called, I guess, as an output dataframe–
So it is a one-dimensional dataframe
right now with only one column in it.
Now what we’ve got to do is, if you remember,
if we call output dot dataframe and call a column it doesn’t
exist, such as Survived, we can throw Survived actually
So the Survived vector, basically all the predictions
of 0s and 1s, we’re going to throw it
on as a secondary column in here.
So if we do a tail of basically output dot data frame,
we can see that PassengerId and Survived
And this is what Kaggle wants.
So I’m going to do a write.csv here of the output dot
Basically, I’m going to write this dataframe out to a file,
and this is where you get to name the file.
So I’m going to call it kaggle_submission.csv.
And notice I just get to call it kaggle_submission,
because I already have my working directory that I set
way in the beginning up here.
And it’s going to write to this same directory.
And there’s something very, very– that isn’t unintuitive.
We have to set row.names is equal to false.
Because if we don’t, it’s actually
going to write this column right here.
So this column right here, see 413, 414, 415–
it’s actually going to write that into the CSV by default.
We don’t want that, so this is what we’re going to do.
if I check my folder, there should be a brand new thing
So if I open that up, we can see that it correctly
did that for us, beautiful.
Now it’s time to go ahead submit this to Kaggle.
So basically, I’m going to go to make a submission,
and I’m going to upload that dataset now.
So Kaggle submission– so before, I think I submitted,
and everyone dies in the model, and I got a 62%.
Let’s see how well I do today.
Now keep in mind, this model is probably not good,
because I cleaned everything with the median.
Remember, the median of different genders–
are different for genders is different
And maybe, once you learn regression on day four,
build a regression model, build a predictive model
to predict missing values of Age,
missing values of Fare, et cetera.
So look at that, that’s awesome.
My model boosted me up to 77% accuracy,
which, if you remember, my rank was 5,000 something.
I jumped up 1,000 points with the help of this random forest.
And notice, I haven’t done parameter tuning,
I haven’t done cross-validation, I haven’t done the 70/30 split,
So this is definitely not the best model
that I could have built by any means,
but that is your homework.
that will go ahead and conclude our data science really
Now if you go to your get Github repository,
I’m going to post this homework solutions to this– basically
the thing that I’ve been working on here with you,
I’m going to upload that as an R file.
So you can follow along– if you don’t
want to follow along with this video, you also have a script.
So if you go to github.com/datas ciencedojo/bootcamp,
under homework solutions, I’ve gone ahead and posted
the Kaggle Titanic example dot R.
R Also keep in mind that I’ll show you how to do this
So if you like Azure ML, that’s coming.
Just remember, the Kaggle competition ends at 1:00 p.m.
All right, happy modeling.
Watch Part Two:
Creating a Titanic Model in R Part 1
The Data set used:
Titanic Data Set
Full Kaggle Competition Series:
Kaggle Competition Series
More Data Science Material:
[Video] Salving the Kaggle Competition in Azure ML
[Blog] Kaggle Grandmaster insight – secrets to an exceptional career in Data Science