Subset, Reshape, and Summarize Data | Introduction to dplyr Part 2

We go over some basic functions of dplyr including the mighty group_by and summarize combo that makes dividing up datasets a breeze, as well as arrange, select, and filter that help get the data in a cleaner and more organized format. Group-by aggregation is one of the most powerful, yet simple, tools you can use to truly make a difference in the way you manage and approach data. We’ll also walk you through how to subset columns and rows, as well as how to reorder data the way you like it.

dplyr is a a great tool to perform data manipulation. It makes your data analysis process a lot more efficient. Even better, it’s fairly simple to learn and start applying immediately to your work! Oftentimes, with just a few elegant lines of code, your data becomes that much easier to dissect and analyze. For these reasons, it is an essential and foundational skill to master for any aspiring data scientist.

Hi everyone, this is Ningxi with Data Science Dojo
Today we’re going to introduce some very useful functions including group_by, summarize, and arrange
to start manipulating your data. We’re also going to walk
you through some basic visualization techniques with ggplot. As you can see, just a few simple lines of code
can create beautiful plots
So one of my favorite tools to use in dplyr, and I think
this applies not just with, to dplyr but data analysis in general
is the group-by aggregation combo tool that we can use
So it basically groups the dataset by a certain feature
and then it performs some aggregation function on the subgroups
which comes in really handy if we want summary statistics on the different subgroups to do a quick
exam of how the different groups are different in our dataset. Then group-by aggregation is one of the most
powerful tools you’re going to use
So the first function we’re going to demonstrate with that is “arrange”
Also note that this syntax in dplyr is called the “pipe operator.” It’s very handy when it comes to passing
multiple functions sequentially, since it takes whatever object that comes immediately before it as the object
that it’s going to use. So I’m going to do the group-by aggregation, which in dplyr syntax will be “group_by”
that’s its own function, followed by “summarize.” So here we’re going to group by “country” because we want
summary statistics on the different countries. Followed by another pipe operator. So whatever comes, whatever
functions comes after this operator is going to take whatever object created by this function as its argument
So on, so forth, which is very useful
We’ll do “group_by” and I’ll summarize. We want to count how many wines each country produced in this dataset
So when I do this it means I’m creating a new variable, and I could have named this variable whatever
but I’m naming it “count” because that makes sense
I’ll do another function inside the “summarize” function named “n.” And it does not take any argument
in the parentheses. So here it’s counting all the observations grouped by country and then creating a
new variable called “count.” So right now we’re creating a new dataframe that only has two columns
The column that we selected by “group_by,” which is “country,” and the column we just created called “count”
inside the “summarize” function. And we want to know what are the top producers in this dataset
So we’ll do something like “arrange.” That’s going to sort
through all the values. Because “arrange” has its default setting as ascending. It means it’s going to start from
the lowest value to the highest. We want to do something like “desc,” standing for “descending”
by “count,” since we want to know, we want the top to bottom count, so from the largest to the smallest value
So when we do that, we get, like I said, the two-column
dataframe that only has “country” and “count.” And we see that the top ten producers are listed with 30 more
countries not listed, so there are 49 countries in total in this set
Because of the magnitude of this dataset we want to focus our attention on, let’s say, the top ten producers
So I’m going to actually create a new variable called “selected countries” that’s going to do just that
Again, I’m going to start with our original dataset, “wine,” group by “country”
and summarize by “count”
Note that the “n” function is actually the function that’s counting, and this count is just any name that I chose to
give it
So this is just repeating my previous line of code
Here I’m introducing a new function called “top_n”
And inside it I can pass any number I wish. Here I can do 12 or 15, but I’m going to select the top 10 producers
and I’m passing 10
Finally, a “select” function that’s going to let us select
which columns we want. And note that it only works
with subsetting columns, not rows. We’ll see how to
subset rows shortly
Ok, so here I actually forgot to pass “count” inside the
“arrange” function because otherwise it does not know
which variable I want the count to take. So if it was
taking the variable “country” then it doesn’t make any
sense, so remember to pass in “count” into the “arrange” function
And now selecting by count
And if we examine this new object we just created
It is actually a dataframe. If we check out its structure
Here we go
Which we do not want, because as you can see later
we want this object to be in a vector structure so we can
directly reference it
So the way to do that is overwriting this object
using the “as.character” function
And because this object is still a dataframe I can use
this syntax to reference which column, even though it only has one column
I still need to do the dollar sign syntax to explicitly say that I am selecting the “country” column
So now if I check this out again, it’s now a character vector
Just to make sure, there we go
So if we want to create yet another dataframe containing only data on these top ten producers
and we want to find out how their wines were rated in this set. And that’s, as you can see here, accomplished
by the “points” function. So we’re going to subset again
Let’s create a new object. Let’s call it “select point”
Starting from our original dataframe, and here we’re
using this function called “filter,” which is used for
filtering rows. So as opposed to “select,” which works with columns, “filter” works with rows
We’re saying that we only want the rows that appear
in the “selected countries” list
So note the syntax here, this function is basically saying
that we only want the rows that have their value in the “country” column, over here
If their country is listed in this vector, which is here
as you can see
the top ten producers list that we came up with a moment ago
So we’re gonna select only certain columns out of this dataframe. So we only want the “country” and “points”
columns
And we’ll do another “arrange”
by country
And if we check this object out, we’ll see that it is a new
dataframe containing only the top ten countries we
selected and their points
I’m skipping most of the values because there are too many rows
If we want to do a quick examination of how the variable “points” moves with price, so in other words, if we want
to know if a more expensive bottle of wine really is better, or at least so judged by Wine Enthusiast where
this dataset is obtained from. I think that’s a question most of us encounter when we try to pick
a good bottle of wine, at a store or at a bar, because oftentimes we don’t know if more expensive really does
mean better. So let’s do a quick exam there using ggplot
I’ll do a scatterplot so passing the geom_point function
after the ggplot function, and also add a smooth line
just to make it visually easier to see the trajectory of the correlation
And it’s taking a second here
As you can see the system’s still running
And go over to “Plot”
We can see that overall there is a positive correlation between points and price, even though it’s not a huge
jump. And a few outliers here, you can see the most expensive wines. I don’t know this is taking a while
The most expensive wines located towards the top of
the graph do seem to be highly rated as well, except this guy right here. So rest assured, if you’re
paying more for a bottle of wine, chances are, I wouldn’t say very likely, but
likelier than average, say 50%, that you’re actually
getting higher quality than average
So that’s interesting to know
And if we want to delve into the country-specific details of how their wines were rated, we can do a more specific
plot. Or in this case, actually a series of plots – a seires
of boxplots, and see what we get
So I’ll pause here before I type in more code. Here I’m saying that I want the variable on the Y-axis to be points
pretty straightforward. On the X-axis I want country
However, this variable is going to be reordered by the “points” column, and we’re saying that we want the
median of points to be the reordering metric when we’re placing these country values on the X-axis
So moving on to calling the boxplot aesthetic
We’re gonna fill these boxplots by country. Remember to use “fill” instead of “color” because “color”
only does color the outline of the boxplots. It does not fill the entire box
Finally, we’re adding some labels to our plot so it looks pretty. Don’t forget to use quotes
And we can also add a title. Let’s say “Distribution of Top Ten Wine Producing Countries.” Here I’m just gonna
make sure that the title appears right in the center. So I’m gonna do something like “element_text”
since we’re gonna move the textual element, and pass in the “hjust” argument, which stands for horizontal
justification. I’m gonna set it to 0.5 since this is usually treated as a value between 0 and 1 So .5 means it’s
gonna be horizontally centered, essentially. Let’s see what this gives us
So here we have a series of boxplots of our top ten producing countries, ranked by their median point. We
can see that on average, or by the median metric, France is the highest rated country in terms of wine quality even
though it’s not, it does not vary a lot from the runner up, Austria. One caveat when we’re using this dataset is that
all the observations of wines that received a rating of 80 or more, since Wine Enthusiast did not even include
wines that received a rating below 80. So the distribution may be a little tighter, since all these have been
essentially pre-filtered before they were even included in the dataset. So here we can see that we have some
countries that have a wider spread, like the US and Spain. And some have tighter distributions like
New Zealand, Italy, and Austria. And some outliers across the board, really. If we want to find out if there
were any countries that produced high quality wines but are not mass producers, meaning if they’re not in the top
ten producing countries like we found here, but nonetheless produced wines that were rated high
We can do something like the following. So this time around we’re gonna filter by countries that do not appear
on the “selected_countries” list so we’ll do this syntax. So the exclamation point like in many other programming
languages just means that we’re saying exclude, that we’re selecting countries that are not in this list
So, continuing the piping sequence, we’re gonna do another group-by aggregation function. Grouping by
country again and then summarize. This time around we’re gonna create an new variable called “median”
That’s going to be created by doing the “median” function, don’t be confused here, on the “points” variable
I’m gonna do another “arrange.” Again descending order since we want the top rated wines to come up first
So here you go. After filtering for countries that do not appear on the top ten list, and arranging by their median
point, we see that the top high quality wine producing countries are England, India – who would’ve thought
Right? Germany, ok, Slovenia, Canada, and Morocco – that’s an interesting one as well. So on, so forth
So keep an open mind when you go wine shopping next time. If we wanna find out which countries not only
produced a large quantity of wine but also do offer high quality products, we can do an “intersect” function and
find out where the intersecting rows are in two dataframes. Also it’s a handy tool to use with dplyr
So here I’ll quickly create another variable called “top.” And here I’m not filtering by any rows. I’m just gonna
do a quick group-by aggregation again. Group by and summarize, same thing as before. We want to use the
median point as a metric. So here, if you check that out, it’s actually gonna be a dataframe structure
We’re gonna transform that into a vector. I’ll overwrite it. We’re only selecting the “country” column
So there you are. These are the countries that are most highly rated in terms of their wine quality
So this includes both the top producers and otherwise. By now you should be able to properly select columns
and filter rows, and well as use ggplot to work with dplyr to create seamless visualizations. In Part 3 of this
Introduction to dplyr, we’re going to close out by introducing “intersect” and “setdiff” functions to show
you how to find overlapping and distinct values from two different data sources. We’ll see you soon
I know I did “count” as my new count variable before, so
just to switch it up, I’m gonna use the word “number” now
and doing this exact same thing
another pipe operator, and “arrange” function. Again I
want to set it to descending order since I want to find
out, I want the largest value to appear on top
And then I’m selecting the top, say, ten grapes
Oh, looks like there’s a typo
There we go. Let’s check out this variable. It’s again
another dataframe, and I want to set it to a vector
So let’s do something like this to overwrite this object
And now it’s a character vector. So these are the top ten
most represented grape varietals in this dataset
If we want to find out which varietals tend to be higher
rated than others, we can again, with the help of ggplot
visualize this pretty easily
We’re gonna do another “filter” from dplyr
and use the same syntax and filter for rows that only
have values of variety only in the “top wine” list that
we just created
And do another “group_by summarize”
We’ll call this “median” again, but I could name it anything I like
And you see that ggplot works seamlessly with the pipe
operator syntax in dplyr, so here I’m passing it directly
after all the other functions, and again, each function
takes the variable immediately preceding it as the argument. So it’s like a waterfall structure, so to speak
So here I want to order my x-axis, which is gonna be “variety”
but we want to make sure that the order is dictated by
the median of the “median” variable. Don’t be confused here
I’ll do a geom_col
function. Just make sure this is, it’s a column structure, simply counting how many observations appear in each
category
I’m gonna fill it by variety
I’m also adding some labels and title. Let’s say “variety”
and “median point.” It’s a straightforward label
Since there are some grape names that are longer
For example, sauvignon blanc, Bordeaux-style red blend
they’re not going to all appear on the graph, on the x-axis
So I’m gonna say abbreviate the x-axis
And because it is a discrete variable – I’m spelling that
out here, it’s “scale_x_discrete” and pretty straightforward
and the labels are gonna be abbreviated
Let’s see what this gives us
So not a huge dispersion here, but then again, one big
caveat when working with this set is that all the wines
were already pre-selected and they were all rated above
80. But we do see that Bordeaux-style red blend tends
to be the highest rated grape, followed by pinot noir and
syrah. So on, so forth
Finally, if you’re a savvy consumer, not just a curious
oenophile, you might want to know where the best
values are. So in other words, if you want to find bottles
of wine that are of as high quality as possible but don’t
cost so much, we can do another “intersect” function
and find the rows that appear on the top, towards the
top on the ranking list but appear towards the lower end
of things on the price list
So let’s just do that now
Let’s say we want to find out the top 15% of all wines
in terms of ranking, you should know by heart how to do
this now. Making sure it’s gonna be in descending order
according to the “points” variable
Gonna filter
Since we’re selecting the top 15%, the probability in the
quantile function is gonna be .85
So we’re saying that we want the observations that have
a value higher than 85% of the whole set. So in other
words, we’re filtering for the top 15%
There we have it
Now we’re gonna find the cheapest 15%
Now we’re arranging by price instead of points
We only do an “intersect” we want to make sure that
our two dataframes have the same numbers of rows
and without looking at how many rows there are in my
“top15p” object that I just created, I’m just gonna say for
the “cheap15percent” object I want to select however
many rows there are in my other dataframe
So instead of the “top_n” syntax I’m now using the
“head” function, which achieves a similar goal
But instead of passing, explicitly passing a number, I’m
now saying that I want to find however many rows there
are in this other dataset
So there we have it
And finally, we want to know where all the good value
wines are. So let’s do an “intersect” function
and pass the two new dataframes that I just created
There we go
And if we want to examine this object, let’s do
Let’s just call it
There you go
Looks like we have 16 bottles that come from Portugal
France, US, Italy, and Austria
that are among the top 15% in terms of rating and the
bottom 15% in terms of price. So, get more bang for
your buck, so to speak next time you go shopping you
know where to look for good values of wine
I might actually put these on my personal shopping list
before I go to the store
So that’s it for our tutorial today
There are many many other things dplyr can do
But I think the most important takeaway for today’s
video is that group-by aggregation, or “group_by,”
“summarize” functions used as a combo is great way to
slice and dice a dataset. And it really provides a lot of
insight into whatever data you have. I think a lot of
people may not realize, but this simple syntax and tool
that almost anyone could use is so powerful
So you might learn a lot of other more advanced
machine learning techniques, like neural networks
support vector machines, but always remember that a simple group-by aggregation oftentimes can achieve a
lot more than you would otherwise realize. So it really is
a foundation of skill to have and hone continuously
Well if you enjoyed today’s video, check out our other videos on YouTube and follow us on social media
We’re on Facebook, YouTube, LinkedIn, and Meetup
And be sure to check our blog at blog.datasciencedojo.com
as well. We routinely post interesting articles and tutorials such as this one on there
So, thank you for your time and see you next time

To get setup with dplyr, watch our first tutorial.

Be sure to also check our accompanying blog post here.

Full Series:
Introduction to dplyr

More Data Science Material:
[Video Series] Data Mining Fundamentals
[Blog]  Making Your Data Science Skills Marketable

(475)

Avatar
About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>