dplyr Introduction Part 1 | Setup & Data Preparation

dplyr is a a great tool to perform data manipulation. It makes your data analysis process a lot more efficient. Even better, it’s fairly simple to learn and start applying immediately to your work! Oftentimes, with just a few elegant lines of code, your data becomes that much easier to dissect and analyze. For these reasons, it is an essential and foundational skill to master for any aspiring data scientist. In part 1 we cover how to get setup with R, load the wine data set, and install ggplot2/dplyer packages.

Often one may be surprised how some easy-to-learn functions can make the data analysis process that much more efficient. That is certainly the case with dplyr. In this series, we will teach you how to use this incredibly useful package to mung data, while demonstrating with a Kaggle dataset on wine ratings.

Hello everyone, this is Ningxi with Data Science Dojo. Today we’re going to start
introducing you to a very powerful tool in R called dplyr that’s widely
used for data manipulation and analysis. It’s going to make your data munging
process that much more efficient and easier. We’re going to start talking
about how to set up dplyr and also cover the data preparation process in
this series. As an overview of this series we’re going to talk about why we
use dplyr and what it does. We’re also starting to introduce you to some
basic functions that dplyr can do, including “arrange,” “group_by,” “summarize”
“select,” “filter,” “intersect,” and “setdiff.” Through the series you’re going to learn
how to arrange data, how to do group-by aggregation, as well as how to subset
columns and rows and how to find overlapping and non-overlapping values
from two different data sources. The goal of watching this dplyr series is
that you should be able to use the functions we introduce to perform basic
data manipulation tasks at hand. You should also be able to at a high level
start thinking about the data analysis as a process that you can divide and
conquer into subgroups. So instead of looking at one massive dataset as one
standalone entity, after learning group-by aggregation you should start thinking about how to
divide up the data you have into subparts and dissect it that way. It’s only
going to make your job easier. We’re going to demonstrate how to use dplyr while
working with a real-world dataset from Kaggle on wine ratings
The goal of this dplyr series is to get beginners up to speed quickly and help
you guys select segments that you find most useful so you don’t have to watch
every single video in the series. You could just pick and choose whatever
segment you find most relevant to the task you have at hand. We don’t have hard
prerequisites for you prior to watching the series but you should be able to
have some familiarity with basic R syntax you should be able to code very
simple commands in R before watching this series, since learning how to use
packages such as dplyr and ggplot builds upon a foundation of
understanding how the R programming language works. If you are not familiar
with it please check out our YouTube page. We have a whole series on
introduction to R on our, in our channel. So please go to that page and
watch that series before watching the dplyr series here. In this video as
Part 1 of the series we’re going to demonstrate how to get R, RStudio as
well as wine ratings dataset from Kaggle. We’ll also walk you through how to
install and load dplyr and ggplot, including how to properly load foreign
characters into RStudio. A little bit about myself I started my data science
journey after a career in finance because I wanted to learn more
data-driven techniques. I create content for Data Science Dojo as well as teach
part of our 5-day in-person bootcamp and that we host around the world. And
these are tailored toward working professionals and they’re meant to get
you up to speed and in order to apply data science techniques in your daily
work immediately upon graduation. So I encourage you to check that out. I enjoy
using data to uncover interesting and fun things in life. So as you can see
we’re going to talk about wine ratings today. I hope you guys get a lot out of
it. Ok so now you know what to expect, let’s get right into it. First we’re
going to show you how to download RStudio and R if you don’t have that
on your computer. Just go to Google and type in RStudio
and the first link that comes up should be the place where we can get it from. So I
already typed this in my search bar. Come over here, click Download RStudio. Just
choose the first option. Click Download. Make sure that you also download R
before RStudio, because RStudio is just an IDE that supports the underlying
R language. So make sure you download both. Come over here to download R
Choose your respective operating system and go through the prompts. I’m gonna go
back to the page for RStudio. So once you have downloaded R from CRAN
It’s the network that supports the R programming language. You’re going to
come back to the RStudio page and choose your operating system. And
download the IDE. So for instance, for Mac you click here and Windows here. So on
and so forth. I already have it on my computer so I’m not going to click
through it. But you pretty much just hit “return,” or “enter” continuously until
you have that download on your computer. So once you have downloaded
both R and RStudio we’re going to get our dataset from Kaggle
So just go back to Google type in “wine ratings Kaggle” and that should be the
first link that comes up. So clicking this link will take you to the Kaggle
page where this dataset resides. If you’re interested you can scroll down
and just read about the background of this set, why this user decided to provide it here
Also shows you the different features that are in the set and some related
links. But we’re going to go to the Data tab
and choose the second option from the left-hand side toolbar and click
Download. I’m gonna rename it “wine” just because easier. Save and unzip the file
So because I’m going to load this into my RStudio later and I don’t want to
have to type in all these words I’m just gonna rename this CSV file and just call
it “wine,” making it easier for myself to type in RStudio later on. So once you
have both RStudio and the dataset loaded, open up RStudio. Make sure to
go into the directory where this dataset saved. So for me that’s on my desktop
I’m gonna go to my desktop. And this is very important. Make sure to go over here
Click “More” and “Set As Working Directory.” So this will make sure that your current
working directory is set to where the dataset is saved so when we read the
CSV into RStudio, the system will know where to find the set. Otherwise it’ll
just be confused. So once I’ve done that I’m gonna create a new object. Let’s just
call it “wine” for simplicity. I’m gonna set this object to the content of the CSV
And because I have renamed the dataset previously, now I can just do “wine.csv”
instead of that whole long name that we started out with. Also make
sure to set this parameter “stringsAsFactors” to false because otherwise
RStudio will treat all the characters as factors which we don’t want, since there are
a lot of columns that contain text: different country names and tasting
notes and different wineries’ names. And just there’s no need to treat them
as factors so setting “stringsAsFactors = FALSE” makes sure that all
the text is loaded as characters instead of factors. And also because of the
nature of this set. Go over, go back to the Kaggle page. If you just play around and
quickly browse the first 100 rows you’ll see that a lot of these wines are from
European countries and the foreign languages have different accents. So for
instance here. So French, Spanish, Italian, etc all have accents on their, on certain
letters. And if we don’t do anything and just read the dataset as is, it’s going
to mess up all these words that have accents. So we need to do something
special here and pass another parameter and set it to
“encoding = UTF- 8.” So this makes sure that all the characters are loaded
correctly. So I’m going to load that into my console. As you can see here, it means
the dataset is being loaded. And now it’s done. If you want to take a quick
look at what this dataset looks like, you can do “View.” Make sure V is in uppercase
So do another quick scan of the dataset we just loaded and everything looks good
All the accents are imported correctly. This probably goes without
saying but this is a dplyr video so we’re going to need to install dplyr
into RStudio. We’re also going to use ggplot to do some basic
visualizations so let’s do “install.packages” and just type in dplyr in quotes
Also gonna install ggplot2
So once these two packages are loaded we also need to call them explicitly. So use
the library function and this time around you don’t need quotes because they’re
already, we already have these packages. If you go
over here you can see that these are available in RStudio but they are not
loaded; there’s no check mark here .So we’re gonna actually call the library by
using the library function. I’m gonna do the same thing for ggplot2
and now we’re good to go. We’ll also see that we have this weird
extra column named “X” so we’re going to get rid of that momentarily. We also have
this column named “description” that seems to be sommeliers’ textural statements
about how these wines taste. And because this tutorial is not going to focus on
natural language processing we will soon drop this column as well. So this is a
fairly large dataset. You can see over here, it has over 150,000 observations of
11 variables and we’re not going to use all the columns so let’s clean the set
now. I will overwrite our original dataframe by subsetting only the columns we
want. So we can do. This means we want all the rows, and minus sign means we don’t
want these columns: the first column and the third – that was “description” I believe
now if we do View again you can see that the weird “X” column and the “description”
column have been removed. I hope that by watching today’s video you’re able to
get up and running with using dplyr. In Part 2 of this series
we’re going to cover how to select and filter rows as well as perform some
basic visualizations with ggplot. You will see that ggplot and dplyr often
work seamlessly together to create neat data analysis and visualization results
all in one. So thank you for watching and stay tuned for Part 2 of this series

Items needed:
R Programming Language

dplyr Package:

ggplot2 Package:

Be sure to also check our accompanying blog post here.

Never used R? Watch our series on:
Introduction to R

More Data Science Material:
[Video] Getting started with Python and R for Data Science
[Blog]  R vs Python: Which is better for Data Science?


About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>