Feature Engineering | Introduction to dplyr Part 4

In this final tutorial of our Introduction to dplyr series, we will cover ways to do feature engineering both with dplyr (“mutate” and “transmute”) and base R (“ifelse”). You’ll learn how to impute missing values as well as create new values based on existing columns. In addition, we’ll go over four different ways to combine datasets. If you’ve followed all the videos in the series, you should be ready to get up and running with dplyr and use it to tackle a range of data manipulation tasks.

Hello everyone today we’re going to talk about how to do feature engineering with
dplyr, including how to create new columns based on the existing features
in the original data set we’re also going to cover how to combine different
data sets into a new entity using different methods so let’s get started
we’re going to continue using the wine ratings data set from Kaggle. If you don’t
know what I’m talking about please refer to previous videos in this series. Make
sure that you already have the wine data set as we produced it last time if
you’re not sure we’re going to post this data set on our Github page so you could
always download it from there. So I just opened up my RStudio and I have the
data loaded on my machine as you can see over here it’s the original data set
after dropping two columns. So now it has nine variables. We can do a quick view of
what this dataset that looks like. And make sure that you do have the exact same set
as I do
so we, I believe we drop the description column and everything else should remain
the same it still has the same number of rows as as the original dataset from
So let’s get started first I’m going to load dplyr. You
should already have dplyr installed on your machine. So I can, I already do, so
I can just call the library. The first thing we’re going to introduce is how to
create a new column based on existing columns using the function called mutate
so I’ll quickly demonstrate how that works. If I do the dataset’s name, and then
use a pipe operator I can do mutate here I’m gonna create a new variable let’s
call it PPratio because I want to find out the points to price ratio if I’m
interested in the value in a wine so in other words if the points variable is
high and price is low, then this PPratio is going to be higher. So I just ran that
and you can see I’m gonna scroll up here. We just created a new column called PP
ratio and these numbers are the division of points by price
one thing to bear in mind is that the mutate function is a non-mutating
function, ironically, meaning we just did this line of code but if we examine our
original data set named wine, it’s not changed even though we used a mutate
function on it. So if we want to make sure, if we do want to append this new
column to our original data set we have to overwrite it by assigning this line
of code to the original object name. So if I run this line of code now it’s
going to overwrite the original data set here so after I hit enter you should see
that this dataset called “wine” is going to have 10 instead of 9 variables and let’s
do that now. So there we go. Another function that’s similar to mutate is
called transmute, and the only difference here is that when you do
transmute instead of mutate, it’s only going to produce the new
variable that we’re, that we’re producing here so we won’t, instead of all these
other columns: regions, winery names, countries, it’s only going to contain
this new variable called PPratio. I’m going to run that here
Oh it’s pretty big. But as you can see it’s only going to, instead of a data
frame that has all, that retains all the original columns from our original data
frame, transmute only produces a new column and if we want to. Again transmute
is also a non-mutating function. So you can see over here. Just to confirm that
our wine object is not changed even though we just called this function
transmute on it. So if you do want to produce a new object let’s create a new
object called wine1 and then run this function on it. Checking over here this
new variable, new object called wine1 is, has the same number of rows as our
original data frame but it only contains one variable which is the PPratio that
we just produced. So now that you understand how transmute and mutate work
now let’s go ahead and find out how many observations there are for each country
using our old friend group-by aggregation. So if I’m just curious and I
want to find out which countries produce more or fewer wines in this set, I can do
something like group_by country and then summarize. Inside the summarize
function I can create a new column as well I can call it whatever but let’s
just call it “total” and I’m gonna do the “n” function here which is going to
count the number of observations by each country since that’s a variable we’re
grouping by here
Alright so now we’re running into a problem because there are five
observations that have the the country variable missing. Missing values are a
common problem when we deal with any kind of real-world data set but don’t
worry too much here because we’re going to learn how to impute the missingness
here so if you want to find out what these five observations are that don’t
have a country specified we can subset this data set by using this syntax so
here I’m saying that out of the wine data set I want to subset all the rows that
have the country variable missing or all the rows that have the country variable
set to nothing. So there’s nothing in these quotes and then I put nothing
after the comma because I’m saying I want all the rows that satisfied this
condition and all the columns. So I’m putting nothing after the comma. So after
I run that I can see that not surprisingly I’m getting five
observations since the previous line of code told me that there are five
missing observations that don’t have country specified. So here we have the
these these observations listed and one reason I decided to subset these rows
while retaining all the columns is that usually we can try to glean some
information from the other columns so even though for these observations the
country column is missing, they do have most of the other columns filled out so
we can always, it’s always a good practice to try to impute the missing
values based on the existing values so in this case we can see that, let’s
start with designation. So if I don’t know what this word “Askitikos” means
remember to always be resourceful and just Google around or ask a friend who
you might, who might be knowledgeable in this area and just, you know, be
resourceful and try to look for information online and come up with
with an answer. So I just copied this word and if I go to Google I’m gonna
search. Alright, so I guess, some results here. If I just click on some of them
looks like this is a Greek wine or it might be a region in Greece, from what I can
gather. So here it’s confirming what I was suspecting so looks like this
designation implies that the country should be Greece. Going down the line
“Shah” might be a little more difficult so we’ll deal with that later
and if we want to find out where Piedra Feliz is referring to, again just google
there we go so looks like this is a region or a place in Chile so now we
know that these last three observations probably have Chile as their country so
“Shah” may be a little more ambiguous but if we go over here we see that the
specific wineries’ names listed. I don’t even know how to pronounce that but I’ll
just copy and paste into Google
and it’s telling me that this is a winery in Turkey. So even though we started
out with five missing values for country we have just come up with a way to
impute all of them. And the syntax for filling out these missing values in
country, one way to do that is using the function called “ifelse.” This is not a
dplyr function; it’s just a base R function but it’s very useful as well
for feature engineering so I’ll quickly show you how that works
I’m gonna go over here to copy and paste
so to explain what this line of code is saying “ifelse” takes usually three
arguments inside the parentheses, so the first one is a condition. So here we’re
saying that in this wine data set if the designation column is this word, if that
condition is true the second argument is saying that we want this
country column, so in front of the equal sign here, to be Greece and remember to
put the word in quotes. So here’s a condition, here is what you do if that
condition is true, and the third argument is what you do you if the condition is
not satisfied. So we’re saying that in this data set wine if the country
column contains, sorry, or if the designation column contains this word
we’re going to fill the country column with Greece but otherwise just set it as
the country column. So in other words leave it unchanged. I’m gonna run that
I see that nothing’s showing up but after we run all three lines of code we
can check to see that everything worked and we’re gonna do the same thing for
Chile. I just press the up arrow on my keyboard to copy the last line of code I
ran, just to save some time and then I can just change the country and
region names
so here I’m saying if an observation’s designation is set to Piedra Feliz I want
the country name to be Chile and otherwise the country name would be the
the country name, so don’t change it and the third one because we arrived at the
country name by winery instead of designation I’m going to change
designation to winery and just copy and paste the winery’s name. Notice that we
probably could have found out Greece and Chile based on the winery names as well
because these tend to be pretty specific but designation worked for us so I’m
gonna leave that. Okay so we just imputed all five missing values for
country and if we, I want to confirm that everything worked I’m gonna just keep
pressing the up arrow button on my keyboard and then do another group_by
summarize and just to make sure that everything worked. So here we go, we see
that there are no missing values any more. One thing to note here is that even though we
just imputed the missing values for country and that was a mutating function
meaning our original dataset called wine has been changed after we filled in the
blanks for the country variable. The actual group-by aggregation functions are
non-mutating. So we did something like this. Every time we just do a group_by and
summarize series it’s not going to impact the original data set we started
out with. So here if we want to create a new data frame
that contains this new column called total and we’re going to arrange that
from top to, from the highest to the lowest values. So if you remember from
our previous videos or if you already know this function “arrange” we’re passing
descending into “arrange” in order to sort from the highest to lowest values we’re
going to name this new data frame. Just call it “newwine” I forgot to actually
pass in a variable within “arrange” so we’re saying
arranging by total from the highest to the lowest. Come over here to make sure
that this new data frame has been created and it has 48 observations and
only two columns if we take a look at what we just created we can do View. So
here you can see that group-by aggregation produces new objects that
only have whatever variable we grouped by and the new variable we create within
summarize function so even though the original wine data set had ten variables
after the group-by aggregation we only have two now; so all the other variables that we didn’t group by
or summarize, like winery names, regions, prices, points, they just all went away so
that’s something to bear in mind. And because of the volume of this data
set we’re going to just create a subset of it
Let’s call it subset1 and we’re just gonna select the first six rows
using the head function and we’re gonna create a second subset and this time
we’re gonna choose the first six rows of the new wine data frame that we just
created over here and you can see that subset1 has six
observations of ten variables because we chose the first six rows of the original
wine data frame that has ten variables and whereas subset2 also has six
variables, observations, excuse me, but only two variables because again this
new wine data frame only has two columns so we’re going to start talking about
how to join data sets now using four different methods to join these two
subsets. The first one is called full_join. Remember that these join
functions are also non-mutating so we have to assign them to a new variable if
we want to, if you want an object to actually contain the product of the
function otherwise if we just use the function, it’s gonna go away. And it’s hard to call it later
on. So the first function is called “full_join”
and I’m just gonna call it, create a new variable called “full.” I’m gonna pass in the
data set names, subset1 and subset2
so here we see that we just create a new data frame. If we want to examine what it
looks like we’ll just call it by its name called “full”
here you can see that we have nine observations but some of these columns
are missing and that’s because we do a full join, RStudio is looking
RStudio is going to include all the observations in both subset1 and subset2
even though these subsets have some distinct or non-overlapping values. So to
demonstrate what I mean here let’s just take a look at what subset1 and subset2
actually contain. So you see that subset1 has only has three different, three
unique values for country, whereas subset2 has not surprisingly six because we
grouped by country. So when we do a full join the result is going to include all
these rows and remember that the columns have to have the exact same name so in
subset1 it’s called country and that’s the exact same name in subset2 as well
and even though subset2 has 3 countries: Italy, Chile, and Argentina that don’t
appear in subset1, when we do a full join the resulting data frame as we can see
here is going to include all these countries. And not only that, it’s only
going to, it’s also going to include all the observations from subset1 so you
see that even though subset1 and subset2 both only have six observations when
we did a full join we get nine observations and that’s because in
addition to subset1 that only has three unique countries we’re adding
Italy, Chile, and Argentina on top of subset1
and because subset2 only has two columns and it doesn’t have all these
other features: designation, point, price, etc, it only has, but it does have this
new column called total for these three new rows that don’t exist in subset1
it’s just going to show n/a for all these columns that don’t appear in
subset2 because we don’t have values for Italy, Chile, or Argentina. So you see
that all these columns are NAs except the very last one called total. The second
type of join is called inner join I’m gonna create a new variable called inner
and assign that to the inner join function so exactly like the syntax for
for full join. I’m just gonna pass in the data set names. So here you can see that
whereas the full object has nine observations, the inner object only had
six. And let’s just take a look at what they are. So here you see that we don’t
have any NA values and notice that we also don’t have Italy, Chile, or Argentina
any more like we did in full. And that’s because whereas full join
will combine everything in these two datasets regardless of missing values
inner join is only going to join the datasets by common values. So because
US, Spain, and France are the only overlapping countries that appear in
both subset1 and subset2, they’re the only ones that will be
retained after an inner join. So pretty much we just lost the values for Italy
Chile and Argentina. The final two join functions are called left_join and
right_join. And similarly I can do something like this and create a new
variable called left and the syntax is exactly the same. You simply pass the
dataframes’ names into it. I’m also gonna do right_ join. So left and right joins
simply means, left_join will mean retain everything from subset1 and right_join
will mean retain everything from subset2. So come over here we can see
that left_join has six observations whereas right has nine
observations and that’s because we created, subset1 is in a way a
subset of subset2, if that makes sense. Because subset2 has six
countries where a subset1 only has three. So when we do a left_join
retaining only unique values from subset1, we get fewer observations than
if we do a right_join. That’s because when we do a right_join we’re retaining
every unique value from subset2 and our subset2 has more unique country
values than subset1. That’s how we end up getting more values from doing a
right_join than a left_join. And the result of the right_join should be the
same as a full_join. If we want to confirm that we can just call it and see that we
have all six countries with NA values for Italy, Chile, and Argentina and if we
want to look at the left variable it looks the same as the inner_join
product. So it doesn’t have the three values that only appear in subset2
and remember that we decided to subset from the two different data frames wine and
newwine, or wine1, rather. It’s because we have over 150,000 observations and
doing joins would take too much time and computing power so in order to
illustrate how the join functions work I selected these subsets just to
make it more time efficient for the purposes of this tutorial
That’s it for today’s tutorial. Thank you for watching the dplyr series. If
you’ve liked what you’ve seen so far please check our other videos on YouTube
as well as follow our blog page. If you have any questions be sure to ask in the
comments below. Thanks for watching!

To get setup with dplyr, watch our first tutorial.

Be sure to also check our accompanying blog post here.

Full Series:
Introduction to dplyr

More Data Science Material:
[Video] Feature Engineering for Bot Detection
[Blog]  Feature Engineering and Data Wrangling in R


About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>