Web Scraping in R Part 1 | Writing your Script in rvest

In part 1 of introduction to web scraping in r, you will learn how to write standard web scraping commands in R, filter timely data based on time diffs, analyze or summarize key information in the text, and send an email alert of the results of your analysis.

Hi there, welcome to this Data Science Dojo video tutorial on automating the
tasks of web scraping for analyzing text. In this video tutorial we’ll be web
scraping for text analysis periodically. There are many applications that require
text analysis on recent or timely content, or capturing changing events and
commentary or analyzing trends in real-time and so on. As fun as it is to
do an academic exercise or web scraping for one-off analysis, it’s not useful
when wanting to use timely or frequently updated data. So I’ll take you
through the process of writing standard web scraping commands in R,
filtering timely data based on time diffs, analyzing and summarizing key
information in the text and sending an email alert of the results of your
analysis. Then in Part 2 I’ll show you how to automate running your script
every hour or periodically so you can run this in the background of your
computer and free yourself to work on more interesting tasks. So let’s imagine
you would like to tap into new sources to analyze the events happening in
Bitcoin or in any other area that frequently changes by the hour. These
events could be analyzed to summarize the key changes of movements in between
read the overall sentiment of recent discussions, capture important events and
so on. So you need a collection of recent Bitcoin events or news script every hour
so you can analyze these events. We’re going to use market-watchers Bitcoin
articles as an example data source, but it can be any other news website or any
other topic area. If you check out our blog tutorial linked below this
video, you’ll see an example of using Reddit to scrape political news and
events every hour, however just keep in mind whenever using a repository such as
Reddit, it’s easy to filter pages that were published within a time frame as
they usually marked as published X hours ago, X minutes ago but when you’re
dealing with dates in different time zones, it’s not that simple.
So we’ll also tackle this problem in this video tutorial. So let’s start with
a quick demonstration of scraping the main head and body text of a single web
page just to get familiar with the basic commands. We’re going to use the library
rvest for this so if you just uncomment this and run this you’ll install this
this library here this also applies to all other libraries that we’re going to
use in the tutorial. And don’t forget to load in to R. So we’re just going to
escape a single web page so if we look at our example page here
we’re just going to use this one here as an example so copy that URL and we’re
going to call it market watch webpage. I’m gonna use the read HTML function and
we’re gonna feed it our data source. Okay, great.
Now that we’ve read in our data source we want to scrape the title of the
webpage, so how do we get the title? Basically, look at the source code and
search for the title text, Bitcoin jumps. So as you can see the title lies within
the title tags. So what this means is the program’s going to search for the title
tags and then grab everything in between these tags. So let’s go ahead and write
this so we’ll refer to our source. And we want to look for the HTML node called
“title” and we want to grab the text. Okay, great.
now when we run this command it should output the title of the webpage so let’s
go ahead and do this. And as you can see it successfully grabbed the title. Now
what we want to do is grab the body. So the body text usually lies within the
paragraph tags of a web page so going to write a similar command here market WAP
wepage. And we want to get all nodes or all instances of the paragraph tag
And we want to grab the text. Okay, great. When we run this command we should get all
the text that lay within a little paragraph tag so let’s go ahead and run
this, and I’ll show you up here. Okay, great looks like we grabbed all the body
text that was lying within these paragraph tags. Now that we’ve had a
quick play let’s get right into it. So we’re going to read in our source our
source in this case is just going to basically be a search results page of
everything on Bitcoin recently published, so let’s go ahead and run this. Okay,
great. Now we want to get the URLs on this webpage so our URLs out our
articles and they basically lie within this specific div tag here and we also
want to get the href attribute so we don’t want to get the text per say we
want to get the href attribute for our URLs. So let’s go ahead and run
this, and let’s check our URL’s. Okay, great. As you can see there’s 15
URLs or 15 use articles that we’re interested in now we want to get the
published date times of these news articles and depending on the time of
day some of them are made invisible and some of them are not so let’s go ahead
and check this. Okay, it looks like they’re all visible now so what we’re
going to do is modify this code a little bit. We’re just going to get rid of
invisible. We’re gonna rerun this.
Okay, great. We have all our 15 date times. If this was made invisible what we would
do is run two commands and join the invisible and visible date times
together. Now what we need to do is treat these date times accordingly so we need
to clean them up a bit we need to convert them into standard date/time
formats and we also need to take time differences. So a good package to do all
these kind of tasks is called lubridate. It’s designed for this kind of thing so
we’re going to install and load this into R. Okay, great.
So as you can see here lubridate finds it difficult to interpret a.m. and p.m.
with periods in them so what we need to do is basically remove these to make it
easy for lubridate to understand a.m. or p.m. So we’re just going to remove it by
replacing it with an empty string. So let’s go ahead and run this command and
now it’s ready to pass into the past date time function so this is going to
create a standard date-time format it just makes it easier to work with later
when it’s in a standard date-time format. So let’s go ahead and do this and we’ll
have a look at it. Okay, great now all our date times are in standard date/time
formats now before we go further let’s have a look at our example article here
You can see that all the articles are published with Eastern Standard
Time but what if we’re not in Eastern Standard Time? What we’re going to do is
take time differences between the date time of the article and our current time
So it’s going to be difficult for us to do those differences if we’re working in
different time zones. So what we need to do is take these date times and we’re
going to first ask for it in Eastern Time and then we’re going to ask for it
to be converted into our local time and in my case it’s US
Pacific time. So let’s go ahead and run these and let’s have a look at the
converted date times. Okay, great everything’s in Pacific time
Now we need to create a data frame so we have our date times here and we have our
webpage URLs that we grabbed before. Now we’re just going to stick them into a
data frame one column called webpage another column called date/time so let’s
go ahead and run this and we should have 15 rows with two columns. And let’s check
this is the case. And it is. Now we’re going to create another column in our
data frame called diffhours and we’re basically going to
take our current time or our system time and compare it with the date time that
the article was published and we want to get the differences in hours you can get
it in minutes or another unit measure but we’re just interested in hours so
let’s go ahead and run this and let’s have a look at our differences.
Okay, great now it’s not clear whether these are in their proper double datatypes so
let’s just make sure they treat it as a proper data double type here. Let’s go
ahead and run this command. Now let’s have a look at it. Okay, great
they treat it as doubles now which is what we need. Now that we’ve got these
values we’re going to create a column and stick them in diffhours. Gonna add it to
our data frame so let’s go ahead and run this command. Now that we’ve got our data
frame of web page URLs we’ve got the date times and the differences in hours
we’re going to use these differences in hours to subset them down to everything
that was published say one hour ago or two hours ago, however long you would like.
So in my case I’m just interested in everything that happened say seven hours
ago. so what we’re going to do is look at everything that was published or a date
difference of less than seven hours ago and basically filter these rows down to
the ones we’re interested in. So let’s go ahead and run this and let’s have a look
at it. Okay, so we’ve got one article that was published within seven hours ago so
what we’re going to do now is take this filtered list or this new data
frame with the filtered webpages we’re going to read them we’re going to grab
the title of each webpage and we’re going to grab all the paragraph tags and
collapse them into a single body and place them into their respective titles
and body vectors so let’s go ahead and run this and now
we’re going to add all the titles into the title column of our data frame and
we’re going to do the same with body so let’s go ahead and run these. Okay, great
now if we have a look at all the names or the column names of our data set we
should see a more complete data set. Okay, great so we’ve got the webpage URL we’ve got
the date/time we’ve got the differences in hours, we’ve got the title, we’ve got
the body what we’re gonna do is just inspect the body text a little bit more
so this is what we’re going to use to analyze or summarize on later so if we
have any major issues here we want to know that now rather than later when
we’re ready to analyze it so let’s just have a look at this and we’ll just look
at the first case, well we’ve only got one case anyway, okay as you can see here
there’s a few problems with the text so we’ve got these random new lines carried
returns we’ve got random whitespace happening here so it’s gonna make it
difficult to analyze on when the text needs has a bit of problem so what we’re
going to do is just clean out the major junk we’re not gonna go too far into
cleaning in sense of you know normalizing the text by lower casing it
removing stop words etc we’re just going to get rid of all the obvious junk so a
good package to do all this is called string up so let’s install and load this
into R and we’re going to use this function here just to get rid of all the
major junk in the text so those new lines carry returns
random whitespace etc we’re gonna ply that on to the body so let’s go ahead
and run this and let’s have a look at our text to make sure it’s clean.
Okay,great. We did a pretty good job at cleaning it up
so got rid of all those major problems there. it’s gonna make it easier for us
to summarize on later. Okay, in this part of the tutorial we’re actually ready to
summarize the text. so we’re going to use LSA fund library which uses a simple
ranking algorithm to summarize the text but there are more sophisticated ways to
summarize text and extracting information other types of analysis can
be done on the body or the title text and data science dojo bootcamp covers
text analytics and how to write programs to make sense of texts if you want to
take this further. but in the meantime we’re just going to use a simple
summarizer. so what we’re going to do is loop through each body text and
grab the top three sentences with the most relevant information so the most
relevant in terms of having the most keywords and most information rich
sentences and then we’re going to stick it into its respective summary vector so
let’s go ahead and run this. first of all, we need to install and load the library
and let’s have a look at our summary here or summaries if we had more than
one case. okay, great so now we have a much more condensed version of the text
so when I email this to myself later I don’t have to email the entire body text
of the article I just want to get like a quick snapshot of the key events that
happened in this article and not have to read the entire thing so we’re going to
do now is add it to our data frame. I’m gonna put it under a column called
summary so let’s go ahead and do this.
now we’re basically ready to email it to ourselves, so we’re going to use this
library here we’ll install and load that into R. this is basically going to
allow us to email text to ourselves now I could simply take the data frame and
email that to myself but I’m gonna go a bit beyond that. I’m only
really interested in the title and the summary from this data frame so what I’m
going to do is create a vector that prints things in certain order so I
wanted to print it in the order of title 1 followed by summary 1 then title
2 followed by summary 2 and so on and so forth. so let’s run this code here so we
can have things that print in order. and let’s just check it does print it in
order so let’s have a look at this. ok, great so the body of my email is going
to have the title followed by the summary and if there was more than one
case it would be title 2 followed by summary 2 and so on. ok now we’re ready
to set up the parameters these are the parameters that we’re going to input
into this function here to send our email so we’ve got the from email the to
email the subject line we’ve got the body text, which is going to be our
titles and summaries and we, if you’re using Gmail you want to
specify Google’s SMPT you could also have more than one email you could send
it to so you can create a vector or a list of emails here but for the purpose
and demonstration I’m just going to send it to myself so let’s run these
okay, great. when you see this here you know that it’s successfully sent and
here’s one that basically just sent to myself not too long ago, you’ve got the
title of the article followed by the body and basically this gives me
a nifty little snapshot of everything that happened in the Bitcoin world in
the last few hours or so.
Thanks for watching, in the next part of this video
tutorial I’ll show you how to run your script hour hourly, so you can automate
this process even further.
if you found this video useful give us a like you can
also check out other videos at data science dojo tutorials

Packages used:
rvest – for downloading website data
lubridate – for cleaning, converting date-time data
stringr – for cleaning text in r
LSAfun – for ranking/summarizing the text

Recommended for medium level R users. See our Introduction to R to get up-to-speed with basic R commands:

The R full script for this video tutorial can be accessed here

To see an example of web scraping timely political news events and commentary from Reddit, check out Data Science Dojo’s blog tutorial on KDnuggets: https://www.kdnuggets.com/2018/12/automated-web-scraping-r.html

More Data Science Material:
[Video] Web Scraping in R Part 2 | Scheduling your Script using taskscheduleR
[Video] Web Scrap in python using BeautifulSoup
[Blog] Web Scraping in 30-minutes

(1973)

Rebecca Merrett
About The Author
- Rebecca holds a bachelor’s degree of information and media from the University of Technology Sydney and a post graduate diploma in mathematics and statistics from the University of Southern Queensland. She has a background in technical writing for games dev and has written for tech publications.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>