Getting started with Python and R for Data Science

In this video tutorial, we will take you through some common Python and R packages used for machine learning and data analysis, and go through a simple linear regression model. Also, we will help you set up Python and R on your Windows/Mac/Linux machine, run your code locally and push your code to a Github repository.

Hi, welcome to this data science dojo beginner tutorial on getting started
with Python and R for data science. In this beginner tutorial we’ll take you
through some common Python and R packages and libraries used for machine
learning and data analysis as well as go through a simple linear regression model.
We’ll also help you setup Python and R on your Windows, Mac, or Linux machine
run your code locally and push your code to a github repository.
So let’s get started with installing Python and R.
To install python on a Windows machine we first
need to check if our machine is 64-bit or 32-bit as this will determine the
appropriate Python program to install. To do this search for “about your PC”
and you’ll see if your machine is 64-bit or 32-bit, in my case, its 64-bit.
next, in your web browser, type “python.org / downloads / windows” and scroll down to
the version of python you wish to download, in my case, I’ll choose the
latest version for 64-bit executable installer.
You can go with the default
installation or you can do a custom installation to include optional
features such as “pip” or you can specify your path directly under C so it’s
easier to locate your Python program later on.
and just click install
once python has installed on your computer you’ll need to add python to your path
to be able to run Python scripts in a directory or folder.
download Git for Windows to set your path and run the Python command. The command using this
program are basically the same when using terminal in Mac or Linux
Alternatively, for Windows, you can use the default command prompt by searching “CMD”
You can also set your local path by searching “environment variables”
and setting your path there
Here’s an example of a Python script
saved in my documents project one folder.
Using a text editor of my choice, such as
notepad plus plus to write my Python code, I saved my file as a .Py file
Then, I open my terminal which is in “C: program files/git/git-cmd”.
I navigate to documents project one
and I set my local Python path.
So we’ll set this up permanently using a bash RC file
with the path to my Python program directly under “C”
now, I simply type “Py”
followed by the name of the file and extension
If using Python 2.7
just type “Python” followed by the name of the file and extension
if we were to hit enter to run this,
it would produce the output of my code which has predicted Heights
using a linear regression model.
The final part of this python windows setup
is installing pip to be able to easily install Python packages and libraries
pip might not have come with your installation if you didn’t customize
your installation or it might not be installed in an older version of Python
so to get pip, type in your web browser “bootstrap.piper.io/git-pip.py
and right click, to save in your Python program folder and
then run the command “python get-pip.py”
so my Python programs under (C:)
Moving on to installing R for windows, simply type in your browser
“cran.r-project.org/bin/windows/base” and select the 32 or 64-bit
Once it is downloaded, press ok
and click “next” to all
Once R has installed on your computer, you can simply open the program
on your desktop and start typing R commands or code.
I recommend you to download R studio as it just makes the process of editing and debugging your code easier.
Otherwise, you’re welcome to use the R command line.
To save an R file, click on “file”, “file history”, and this will save your code so you can run it later if you wish
to set your path or working directory, just simply type “setwd”
followed by the path to where you would like to store your R files locally
You might need to use double backslash for Windows as Windows
understands this to mean separators in the path.
Now, let’s install Python on a Mac
Go to Mac terminal in “finder”, “applications”, “utilities”
and now we’re going to store our command line utilities Xcode
as this will help with the installation
So type “xcode – select – -install”
click “install”
and “agree’
Now, we’re going to use homebrew to install Python
So type “/usr/bin/ruby”
and we’re going to use curl
and we’re going to type the URL to homebrew on github
press return
enter your password if need be
Next add the path, so we will create a bash RC file to permanently add the path
If you get an error message stating “cannot write to path” try the “sudo”
channel command accompanying this video. All commands can be copied and pasted as
they accompany this video.
next we’ll install Python so just brew install Python
or Python 3 if your using Python 3
we’ll also add this to our path
So we’ll create another “- RC” file
Now to check if pip is installed as part of your
Python program, simply type “which pip” and It’ll show you the location where your
pip is installed and if you want to check out the version just type “pip – V”
and it’ll show you which version of people you’ve installed. As mentioned pip
is useful for easily installing Python packages and libraries.
Moving on to R, to install this on a Mac after installing homebrew, simply type
“brew tap homebrew/science”
and then type “brew install r”
To open the our command line simply type “r” and enter.
Now let’s install Python and R on Linux
I’m using Ubuntu, later versions of Ubuntu might already have
Python installed but I’ll take you through the process anyway.
So open your terminal
Okay now we’re going to type “sudo apt-get install python 3.6 or 2.7”
Now we’re going to type “sudo apt – get install Python – set up tools”
lastly, install pip to easily install python libraries in packages by typing
“sudo easy_install pip”
To install R on Linux, simply type
“sudo apt-get -y install r-base”
Now type uppercase “R” and enter to open the R command line
now that we’ve got the setup and installation part of this tutorial out of the way we can now move
on to more fun stuff. Let’s have a quick play with some data to get you familiar
with some key data analysis and linear regression concepts as well as basic
scripting for this. I’m going to go through an example of a simple linear
regression in Python and R using simulated data on people’s height in
centimeters and their weight in kilograms. The model is based on a
formula which can be produced using Python and R functions that gives a
predictor out come or estimated y-value given a certain x-value at a certain
constant and slope. Here is what’s called the “regression line” I like to think of
it as a line of predicted values along the x-axis for a given x-value the line
predicts the y-value to fall about here in height the actual values are slightly
above and below the line, but the model is generalized enough to take into
account where most cases would probably fall. The formula gives a constant value
here which we add this to a given x- value multiplied by a given coefficient
or slope. The constant means when X is at 0, y is at this value and the slope means
for every one unit increase in X, Y increases by this number of units. So we
can use this formula to plug in any new x-value of a person’s weight to predict
their height or y-value. Of course there are many other factors not only weight
that could influence a person’s height, hence we’re just looking at a very
simple model to get started with
To implement linear regression in Python we first need to install a few commonly
used packages. We’ll open our terminal and install “sklearn” for modeling
If using Python 2.7, just type “python -m pip install”
Now, we’re going to pip install pandas for data importing
We’ll also install matplotlib for plotting
The last package we need to install is just “scipy”
Next, go to your text editor and save a new Python file in “Documents/project 1” or a folder of your choice
So I’ll just call my file “LM model”, save it as a Python file
Also, don’t forget to CD into this folder in terminal so you can run your script later.
Now we’re going to import these packages at the beginning of the script
when it runs, so at the top of the file we’ll type “from sklearn import linear model”
So our linear regression tool.
We’re also going to important data frame from pandas
we also want to use pandas as PD
and we’ll just use it as pandas
and we want to import matplotlib and use it as PLT
Now we need to read in our data which you can download as part of this
tutorial and save in your current folder. Will use the pandas read table function for this
So we’ll put our data and variable and we’ll just call it input data
and we’ll use the read table function
and we’ll give the data file name an extension in our folder
its comma separated as it’s a CSV file
and we have headers and they start at line 0 and we’ll give our X&Y; headers specific names
This automatically infers the data types for each column too.
before applying a linear regression model, let’s plot the data using matplotlib’s
plot function to see if the data naturally follows a linear pattern and
the normal distribution as linear regression is not appropriate or useful
for datasets that don’t follow this assumption
So we’ll use a scatter plot
and we’re just plotting weight versus height. So weight is on our x-axis
and height is on our y-axis
We’ll need to show this graph, so it can render on our screen
now save and run the script
As we can see, the data is linear and follows a normal distribution making
linear regression appropriate to use on these data
Now we’ll define our X predictor variable weight and our Y outcome variable height
So we’ll use PD as pandas and we use the data frame function
and we’ll use weight, as our predictor
and we’ll make height our outcome variable
Now we’ll fit a model to the
data using the fit function and use this to predict height to given weight
So we’re using a linear regression model
and we’ll fit the model to the data
We can now compare the first, say, six predicted values using the predict
function with the actual height values to see if they’re on par
So first we’re going to get all the predicted values
and we’re going to use our predictor variable to predict the outcome
and we’ll just print some sub heads to differentiate the list of predicted
values from the actual
and we’ll have a look at the first 0 to 6 predictions and we’ll compare
with the first 0 to 6 actual values
All right, we’ll save and run the script
A quick eyeball of the first few predictions with the actual shows the model was not far off
the mark. Which is good, however, to properly assess a model, we can use
measures such as R squared which is the percentage of explained variants
So we’ll go back to our script and we’re going to use the score function to get
the R squared
and we want to print this obviously
Now we’re just going to comment out the
above lines as we no longer want to view these
we’ll save and run our script again
as we can see, a high r-squared shows the model explained most or nearly
all of the variance which is good however relying solely on r-squared is
probably not good enough when assessing and measuring our models predictions
sometimes it can be misleading to look at the r-squared, but the course will go
through other measures you can use
To perform the same analysis in R, we’ll
first install commonly used R package, ggplot2, which is used for effectively
visualizing and analyzing data
I’ll select a cran mirror that’s close to me
We need to load ggplot2 whenever we want to use it
We’ll read in our data using the read table function
we’ll put our data in a variable
we use read table
we’ll give it our file in our current working directory
its comma separated
and we do have headers and we’ll just use the default header names x and y
This automatically infers data types too
will also attach our data frame so we
can refer to column headers or variable names without having to refer to the
name of our data each time making this more convenient
Now we’ll plot the data to see its normal distribution, but we can also use
ggplot2 to plot the regression line or the line of best fit
So we’ll plot our x and y, which is weight and height
and in the smooth function, we’ll specify a linear model
as we could see before
the actual heights are close to the predictions of the line
implementing a simple linear regression in R
is quite easy using the LM function
Now, to see the first few predictions of height we’ll use the predict function
we first need to get all of the predictions
and we’re just going to print the first few to have a quick look
so the first 0 to 6
and we’ll compare with our actual values
As seen before, for the first few cases, the predictions are pretty close
To print the r-squared or percentage of explained variants for assessing the
model we’ll use summary
As seen before, it explains nearly all the variants, but it’s a good idea to
also look at errors or other measures for this. Finally now that we’re finished
we’ll detach our data
In the last part of this tutorial we’ll push our code to a github repository so
you can share your code publicly or store it privately if you wish. You can
create a github account for free you can also follow a data science dojo to clone
or access a copy of the code provided as part of the course material.
Once you have created an account add a new repository without initializing via
the github website. The instructions to push your code to github are on the website
but I’ll take you through the process anyway. First open your terminal and CD
into your current project directory and you’ll need to configure your user name
and user email
now configure your username
We’ll initialize our project directory as our git repository
Then we’ll add all
files in our project folder, we’re not pushing it live yet, it’s just selecting the files
commit your files to track the first mission with the message should
you wish to publish updates later on
So I’m just gonna say first go at implementing
simple linear regression
as you can see all the files in project 1 folder are there
Now we’re going to give the URL of our main repository
so go to the main page of your github repo
and copy the URL and we’re going to paste it
into the terminal when adding a remote repo
Finally we’re going to push our code to the repo and github master branch
Now, if you have a look at your github repo, you can see all your files are there
All the work we have done in this tutorial is here.
alternatively, after
initializing your github repo via the site, you can simply drag and drop your
project folder onto the main page of your repo
Now that you’ve gone through the basics you should feel ready to dive into the
course and gain a deeper and wider understanding of data science.
You know how to set up Python and R in your machine, how to do basic scripting for
reading and visualizing data, how to apply a model and assess it, and now you
can share your hacks and projects on github. The data used in this tutorial
the coded examples, the commands, the URLs to programs, and so on are all
accompanying this video. My name is Rebecca Merrett, feel free to reach out
to me by commenting on this video I’m more than happy to help you get ready
before you start your course thanks for watching and happy analyzing

Table of Contents:
– Installing Python on Windows: 1:09
– Installing R on Windows: 4:16

– Installing Python on Mac: 5:39
– Installing R on Mac: 8:10

– Installing Python on Linux: 8:41
– Installing R on Linux: 9:48

– Simple linear regression model in Python: 11:59
– Simple linear regression model in R: 21:01

– Pushing code to Github Repository: 25:26

Repository:
All commands, scripts, and data

Downloads:
· Python Programming Language
· get-pip.py Script
· R Programming Language
· RStudio
· Git for Windows

Text Editor:
Notepad ++
or
Sublime Text

More Data Science Material:
[Video Series] Beginning R Programming
[Video Series] Introduction to Data Mining
[Video] Time Series in Python Part1: Read and Transform your Data
[Blog] R Programming: An Introduction

(1476)

Rebecca Merrett
About The Author
- Rebecca holds a bachelor’s degree of information and media from the University of Technology Sydney and a post graduate diploma in mathematics and statistics from the University of Southern Queensland. She has a background in technical writing for games dev and has written for tech publications.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>