The kaggle competition for the titanic dataset using R studio is further explored in this tutorial. We will show you more advanced cleaning functions for your model. This kaggle competition in R series is part of our homework at our in-person data science bootcamp.
This is Phuc Duong again.
This is part two of how do a Kaggle competition in R.
So last video, I showed you how to do a very simple model in R
and submit it to Kaggle, but we also
used some very subpar cleaning functions, OK.
So this video is going to depend on basically the code
base from that previous video.
If you have not watched that video yet,
you’re going to be very confused.
So basically click here on the screen
to watch the previous video and then come back.
I’m going to wait a little bit so you guys can click it.
If you’re still here, then that means
that you are ready for part two of this video.
So I have basically saved what I’ve
written in the last video as a script,
so I’m basically just going to rerun all of it.
Well, I don’t need to write the series.
And maybe I shouldn’t have run install packages again,
All right, so we’re going to go back up
to where we cleaned the data.
Where did we clean the data?
We cleaned the data somewhere up here.
So notice that the age, we cleaned it with the median.
We also cleaned fare with the median.
It’s very sub optimal, because if you actually
do bucketing and segmentation, you’ll
find that fare of different P-classes are different.
The median of fare of different P-classes are different,
the fare of different genders are different.
Maybe fare of females in third class
is higher than the median of fare of females in first class.
So you just start stacking a lot of things
and then the median will start to change dramatically.
So let us build a predictive model
to actually clean the missing values of our data.
Let’s not just make blanket assumptions
on the data set by filling in with the median.
Let’s take a much more educated guess,
and then we’ll feed that into the actual predictive model.
So notice that we’re building a predictive model
to clean missing data so that we can actually
get a more accurate predictive model on all of it.
I’m going to go ahead and comment out this line.
Basically this line is what we don’t want to do,
I no longer want to clean with the median.
Now, this is an example for fare but you can apply this for age
or anything else that is numeric.
Last time I already showed you how to build a classification
You would apply the same concept.
In this case, I want to build a regression model
Now this is also kind of a waste, because,
actually, if you look at it there’s only–
if I run this line, which is basically
going to query all the missing values of fare,
and there is actually only going to be–
In this case I ran the script through and it went ahead
and it cleaned the missing values for me.
I want to run the script and have it stop right here, right
So I’m to go ahead and see this brush up here?
It’s going to clear all of my objects
so I can restart again fresh.
Basically I’m going to select everything up until that point,
and I’m going to run basically everything else.
So notice that it cleaned in embarked
and it cleaned in median.
So I still have to clean the missing values or else
the model’s not going to like that I’m missing values.
And I also have to do this categorical casting here,
which I will do, actually, now.
So now what do we want to do?
All right let’s load a predictive model
to predict fare, and because our response class
is going be numeric it will be a regression model.
if i run this statement which is basically
find me all the missing values of fare, there’s only one.
So basically this is going to be a very wasteful
predictive model, in a sense that we’re
going to build a predictive model just to predict one row.
Now I’m only showing you that because your homework should
actually be how do I build a predictive model
Age has 200 and something missing values.
That would be much more useful there.
So I’ll let you guys do that for homework.
But the idea is, you can take this code base
and convert it into an H-predictor very easily.
You just switch out the names of the columns.
So if we look at Titanic fare, we
can build a linear regression model.
So I can easily just call this lm function,
but, here’s a big but, there’s two types of linear regression
There is an online gradient descent variant,
which we’re not going to cover in this video.
But there’s also another version which
is an ordinary least squares linear regression model.
That’s a very simple one that I tend to like to go with.
So that one is very susceptible to outliers, so basically
before we do this linear model, we have to filter the outliers.
So if we simply do a box plot of titanic.full$fare.
So anything beyond this, basically,
this core tab, this whisker, is going to be considered a–
what would you call a– an outlier to this model.
So with that in mind, we want to filter these guys out.
So we want to build a linear regression model just
So these outliers, we don’t want them
because if we built a model to predict on them,
this guy, for example, would, basically, completely throw off
our model and, basically, might bring up our regression model,
and might seem like everyone is synthetically richer than they
actually are or paying more for a fare when they actually
So how do I get this core?
What is the value of this whisker?
So if I move over, I can kind of guess it
and say that, OK, if you paid more than, let’s say,
and this is me guessing, I’m eyeballing, right?
We’ll, however, go ahead and filter it out.
But as it turns out in R, R, actually stores it.
So if I just do boxplot.stats, I can actually figured that out.
So basically, titanic.full$fare again.
So notice that this brings me back out all the stats that I
Now, there is one stat that I want,
which is this guy right here, 65.
So this tells me, basically, the first whisker,
the first quartile, the median, the third quartile,
So anything above– anyone who paid more than $65 for a fare
would go ahead and be, in this case, an outlier.
And we would filter those guys out.
Now, I can very quickly and very easily just
Or I could do titanic.full$fare is less than,
less than or equal to, 65.
Now that builds me my filter right away.
And I filtered out the outliers.
But that is not how we build scripts.
Because if this was sales data, the outliers might be–
that whisker might be moving.
The upper bound might be moving on us.
So tomorrow’s sales data could change things.
So let this actually derive what that is.
So if I just type in that same command, boxplot.stats,
notice that if I hit enter here, it brings me, actually–
That means I can reference these things.
So let’s see what happens.
If I want something in stats, I would call
the dollar sign of stats here.
And notice I get this back.
So notice I can call that, and I can get the fifth vector back.
And notice I get 65 . back.
So this gives me my upper bound.
So I’ll call this upper.whisker is equal to the fifth quantile.
So that number is equal to 65.
Now I can go ahead and build my filter.
So I can do outlier.filter is equal to titanic.full$fare that
is less than upper.whisker.
And notice that I’m only, basically,
cleaning the upper bound whisker.
There is also a bottom whisker.
But notice that it’s also 0.
So we don’t have any outliers that go below the minimum
So we’ll go ahead and do that filter.
So this gives me a series of true falses.
Oh, I have to run this code first.
So basically, this upper whisker should be 65.
And now, we will go ahead and do a filter here.
So if I run this, this should be a series of true falses.
So in this case, someone paid more.
Someone paid more for a fare.
So I’m going to put that into, basically, a filter.
Next thing is I’m going to go ahead and do
the actual filtration of the data now.
So titanic.full of outlier.filter.
And notice that we only want the rows that
are basically not an outlier.
So if I run this, this will give me all the rows
The next thing is now we can go ahead and build our model.
What I said, we’ll go ahead and do an lm here.
An lm, where the formula–
we have not defined a formula yet, actually.
We haven’t told it how to predict yet.
So in this case, fare.equation.
So what do we want to do?
So let’s do an str real quick.
So basically, an str of titanic.full.
How do we want to build this predict file?
What is their relationship?
So we’re going to build a model, not to predict survive,
but we’re going to build a model to predict fare.
So build me a model to predict fare.
And then this total will be given.
So basically, it’s like y equals a bunch of stuff.
So we want it to use everything else, OK?
So notice that we’re getting to use Pclass.
We’re going to use gender here, so sex plus sex plus
age plus sibling/spouse plus parent/child, OK?
And then plus embarked here.
All right, that will be our equation.
So fare.equation will be inserted there.
So that’s telling me that build me a predictive model based
upon these other predictors.
And notice I’m not using survived, right?
Because our future data will not have this column.
So we can’t rely on it as a predictor.
And then the data will be the data
in the absence of the outlier.
So earlier, here and here, I went ahead and did that filter.
I said titanic of full, where I only want to see non outliers.
So that’s what’s contained in this outlier filter, OK?
So I want to go ahead and run the equation line.
And I want to run this lm line.
But I also want to stuff this bottom line into, basically,
All right, so I’m going to run this,
and it’s going to build me a predictive model using
a linear, ordinarily squares, model.
Next thing is I want to apply this.
I want to fill in the missing values that
are missing using a predictive model,
using the rest of the data sets on that row that
has the missing value of fare to fill in the value missing
So we’re going to go ahead and do, in this case, a prediction.
OK, so the problem now is where is our model?
So our model will be fare.model.
And what is our new data?
So new data will be any row that has missing values of fare.
And then the next thing is we have to define
And that gets tricky because we have other things in here,
We have embarked or not embarked.
We need to tell it not to use those things.
So this is where it gets a little tricky.
So we have to now query our data to basically isolate the things
So titanic.full, we’re going to do a quick query.
So how do I find if something is missing in fare?
So remember, is.na of titanic.full$fare.
This will find all the missing values
and give me back a vector of true falses.
notice that I can query like this.
All right, the next thing is what columns do I want.
So notice that not every column is needed.
So I only want to query specific columns.
So in this case, I want to query the columns that
So I’m going to go ahead and do this.
And actually, I’m going to do some text processing in Excel.
Now you can do this manually if you want.
But I know that a vector is going to require a comma right
here and quotes in these.
So basically, I’m going to do a find replacement.
So I want to find all plus signs with a space before
and a space after and replace it with a quote before, comma,
So notice if I replace this, this
will build me my vector initialization command.
So notice that I want Pclass, sex, age, sibling, spouse,
So I’m going to go ahead and close that.
And now I want to tell it I only want it to query those columns.
So I’ll paste that in here.
So I want to query only the rows that
have missing values of fare.
And I only want to see Pclass, sex, age, sibling, spouse,
All right, now we can go ahead and fill this in.
Because basically, this is going to return me the rows that
So this will give me a series of rows.
But let’s just run this first, OK?
OK, so notice that this brings me
back the row that has the missing value, which
So this row, notice I didn’t query fare.
That’s the job of the model.
Now we’re going to go ahead and predict on this.
It’s kind of a waste because it’s only
going to run a prediction on one row, this 1,004 row.
So I’m going to go ahead and run this prediction.
I’m going to store that into a label too.
But actually, let’s run it before we store it
Oop, I have not run that filter yet.
So I’m going to go ahead and run that filter.
So that predict is going to go ahead and predict.
So notice that it’s going to predict
for that person, passenger 1,044,
that he or she might have paid $8.25 for a fare, if he paid.
So that prediction actually needs
to be thrown back into the data set as a replacement.
So notice that we just called a prediction
and printed it to constant.
We didn’t store it anywhere.
So we got to just call this fare.prediction.
And then we’re going to go ahead and replace it.
So earlier, we showed you that is.na of titanic.full$fare.
So now we need to query which rows
have missing values of fare.
So this brings us back a true false vector.
This is going to be our query.
And we’re going to go ahead and tell it we only
want to query the fare column.
So we wrap all of that in these brackets.
And we’re going to query that from the titanic.full data set.
And this should bring me back that one
And I want to replace that na with this prediction thing
So if this query had brought back like 10 lines–
and the fare prediction should have 10 lines–
It will, basically, just replace them in order.
So that’s how we’ll go in and mark all these predictions.
All right, and that’s all there is to it.
We have gone ahead and filled in missing values for that.
So that value is now gone.
Now it should be in 888.2 or something like that.
So if I go into my titanic.full and view the 1,044th row,
I should see that the fare is $8.25, OK?
And now I can go ahead and run the rest of the model.
Go ahead and run the rest of the model,
and treat everything else the same.
Do the categorical casting back again.
Go ahead and split out back into train and test.
Cast the survived into training set.
Tell it that we want to predict survived
given all this other stuff.
Use the random forest package.
Do a library of random forest.
And now, we can go ahead and predict on survived, right?
And now, hopefully, our model is making a better educated guess
than it was, and just filling in the median for fare reform.
So now go back and do that for all of the other columns.
So I think age had a bunch of missing values.
So go forth and build yourself a model to predict their age.
This video assumes you have watched part one, if you have not, view it here:
Creating a Titanic Model in R Part 1
Data Set Used:
Titanic Data Set
Download RStudio here:
Full Kaggle Competition Series:
Kaggle Competition Series
More Data Science Material:
[Video] Beginning R Programming Series
[Blog] 30 Data Sets to Uplift your Skills in Data Science