Intersect & Setdiff – Combining Datasets | Introduction to dplyr Part 3

We introduce functions that make it easy to find overlapping and distinct values from two different data sources, intersect and setdiff. These two functions let you see the shared and unique elements from different vectors, making it easy to spot commonalities and differences. After watching this video, you’ll walk away feeling more empowered to tackle large datasets and pinpoint how much similarity they share.

dplyr is a a great tool to perform data manipulation. It makes your data analysis process a lot more efficient. Even better, it’s fairly simple to learn and start applying immediately to your work! Oftentimes, with just a few elegant lines of code, your data becomes that much easier to dissect and analyze. For these reasons, it is an essential skill to master for any aspiring data scientist.

Hello everyone today we’re going to talk about how to find overlapping and non-
overlapping values from two different data sources using dplyr this is
going to build upon the same data set on wine ratings that we’ve used in
previous videos. If we wanted to find out which rows appear in both the selected
countries list and this top list the global top list and we can do like I
mentioned an intersect function and pass the two vectors in and let’s check out
who the winners are. So these ten countries are both in the top list and
the selected country list. So if you want to further refine the top list and
select only the top, let’s say, let’s do ten again contenders. We’ll do, I’ll overwrite
this top object and only select the first ten entries. So if we examine the
top object again, here are the ones that still remain and let’s do the intersect
function again
and check out this variable. Oh sorry I should have done top instead of both
we see that only three countries appear both on the top 10 producing list and the
top 10 global best list. So next time you go shopping pay attention to French
Austrian and Australian wines. As a complement to intersect we can find out
which observations are not in both lists so in other words which entries only
appear in one of the lists. You do setdiff, which is now searching for
variables that that don’t belong to two bills lists so here we could have
probably guessed that it would be the remaining 7 countries but I typed out
the syntax just so you can see clearly what I meant here so these seven
countries are in the top 10 producing country or rather they are not in the
top 10 producing countries list but they do produce high-quality wines
nonetheless. So after dissecting this data frame from a country perspective we
can now turn to focusing more on the grapes. So in other words if you want to
find out which are the most represented grape varietals in this data set we can
do something as follows. Let’s say top wine, creating a new object again
starting from all the way from the original dataset wine, and do another
group-by aggregation. This time we’re gonna group by variety over here instead
of country and then summarize as usual I know I did count as my new count
variable before so just to switch it I’m gonna use the word
number now and doing the exact same thing
another pipe operator and arrange function again I want to set it to
descending order since I want to find out
I want the largest value to appear on top and then I’m selecting the top say
10 grapes. Oh looks like there is a typo
There we go. Let’s check out this variable it’s again another data frame
and I want to set it to a vector so let’s do something like this to
overwrite this object
and now it’s a character vector so these are the top ten most represented grape
varietals in this data set. If we want to find out which varietals tend to be
higher rated than others we can again with the help of ggplot visualize this
pretty easily. We’re gonna do another filter from dplyr and use the same
syntax and filter for rows that only that have values of variety only in the
top wine list that we just created and do another group_by summarize
I’m gonna call this median again but I could name it anything I like and you
see that ggplot works seamlessly with the pipe operator syntax in dplyr
so here I’m passing it directly after all the other functions and again each
function takes the variable immediately preceding it as the argument
so it’s like waterfall structure so to speak
so here I want to order my x-axis which is gonna be variety but we want to
make sure that the order is dictated by the median of the median variable
don’t be confused here and then to a geom_col function but just make sure this is
it’s gonna be column structure simply counting how many observations appear in
each category. I’m gonna fill it by variety also adding some labels and
title, say variety
and median point it’s a straightforward label. Since there’s some grape names
that are longer. For example sauvignon blanc, Bordeaux-style red blend, they’re
not going to all appear on the graph on the x-axis so I’m gonna say abbreviate
the x-axis and because it is a discrete variable I’m spelling that out here it’s
a scale X discreet and pretty straightforward and the labels are gonna
be abbreviated let’s see what this gives us
so not a huge dispersion here but then again one big caveat when working with this
set is that all the wines were already pre-selected
and they were all rated above 80 but we do see that Bordeaux-style red blend
tends to be the highest rated grape followed by Pinot Noir and Syrah so on
so forth Finally, if you’re a savvy consumer not
just a curious oenophile, you might want to know where the best values are. So in
other words if you want to find bottles of wine that are of as high quality as
possible but don’t cost that much we can do another intersect function and find
the rows that appear on the top towards the top on the, on the ranking list but
appear towards the lower end of things on the price list so let’s do that now
let’s say we want to find out the top 15% of all wines in terms of ranking you
should know by heart how to do this now
making sure it’s gonna be descending order according to the points variable
I’m gonna filter
since we’re selecting the top 15% of the probability, in the quantile function it’s
gonna be .85 so we’re saying that we want the observations that have a
value higher than 85% of the whole set so in other words we’re
filtering for the top 15%. There we have it. Now we’re gonna find the cheapest 15%
now we’ll arrange by price instead of points. When we do an intersect we
want to make sure that our two data frames have the same numbers of rows and
without looking at how many rows there are in my top15p object that I just
created I’m just gonna say for the cheap 15 percent object I want to select
however many rows there are in my other data frame. So instead of the “top_n”
syntax I’m now using the head function which achieves a similar goal
by instead of passing explicitly passing a number I’m now saying that I want to
find however many rows there are in this other data set. So there we
have it and finally we want to know where all the good value wines are so
let’s do an intersect function and pass the two new data frames that I just
created. There we go. And if we want to examine this object
let’s do I just call it so there you go it looks like we have 16 bottles
that come from Portugal, France, US, Italy, and Austria that are among the top 15%
in terms of rating and the bottom 50% in terms of price. So get
get more bang for your buck so to speak. Next time you go shopping you know where
to look for good values of wine. i might actually put these on my personal
shopping list before I go to the store. That’s it for our tutorial today. There
are many many other things dplyr can do but I think the most important
takeaway for today’s video is that group-by aggregation, or group_by summarize
functions, used as a combo is a great way to slice and dice a data set and it
really provides a lot of insight into whatever data you have. In the next video
we’re going to talk about how to do feature engineering using dplyr
including how to create new columns based on existing features in a data set
we also learn different ways to join data sets including full join, inner join
left join, and right join. Thank you for watching today’s video and see you next time

To get setup with dplyr, watch our first tutorial.

Be sure to also check our accompanying blog post here.

Full Series:
Introduction to dplyr

More Data Science Material:
[Video] Experiment Management for Machine Learning
[Blog]  Importance of Data Normalization Prior to Analytics

(342)

Avatar
About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>