In part two of our introduction to web scraping in r, we will use taskscheduleR to set up our automated web scraping script to run as a background task on our computer. This means we do not have to manually run the script in R. Our script will be scheduled to run hourly, as we are grabbing text on Bitcoin events from the last hour or so.
Hi, welcome to this Data Science Dojo tutorial on part two
of automating the task of web scraping for analyzing text.
In part one of this video tutorial you followed along while
we wrote an R script to scrape hourly bitcoin use,
summarize the text and send the summary in an email alert to ourselves
In part two we’ll set up our web scraping script to run as a background task on our computer
so you don’t need to manually run the script in R yourself.
The script will be scheduled to run hourly as we’re grabbing text on bitcoin events from the last hour or so.
You can access the full script, just see the link below the video.
You can set this up as a task in task schedule in windows, or you can do this through Rstudio itself
I’ll show you how to use the task scheduler R package to easily schedule
your web scraping script in Rstudio.
Otherwise, you can check out our KD Nuggets tutorial link below on
how to set this up using the task schedule interface
Now if you’re using OSX, automated would be the equivalent tool for this and for Linux
It would be genome schedule. And to do this in R studio
the equivalent package to task scheduler R for Mac users
And for Linux is called cronR. The installation of cronR is fairly simple and a link is
provided below on how to do this.
The functions in cronR are similar to those in R task scheduler R too.
Let’s go ahead and install and load taskscheduleR into R. So just uncomment and run this line to install
And don’t forget to also run this line to load it into R
Now you can use the add-in interface. If you prefer to use the add-in interface
just install these packages here and
Then once installed go to the add-ins drop-down menu at the top here
select schedule R scripts on windows
And you can upload your script here and
you can schedule it hourly, daily, weekly or whatever time frame you’re interested in and just
Now, you might want to have
your output data and logs go into another directory folder on your computer otherwise
by default it goes into the taskscheduleR extension data folder inside your R folder.
And also these interfaces are very similar in cronR, too.
So let’s just use some taskscheduleR functions to schedule your web scraping script
to have it run every hour
So we’ll use the taskscheduleR create function
And now we’re going to give it all our inputs into the function so we’ll start
with giving our task a name.
And I think I’ll just call it R web scraping
Okay, cool. And we’ll give it the full path to where R scripts sits, so in my case it’s
Let’s just sit here my Web auto scripts folder
And if you’re in Windows don’t forget to use double backslash
Okay, cool. And we want to schedule our script to run every hour.
And we’ll input the start time. Now, you can specify a start time, but I’m just happy
to go with the default time,
which is my current time, or my system time and have it like kick off within sixty two seconds.
And I’ll just follow that hour/minute format.
And you can also specify a date
But once again, I’m happy just to go with the default, which is my current date.
Now, you just need to make sure that this matches your computer systems date format.
So in my case it’s month followed by day.
And we’re also going to give it the R executable file to run our R script.
And this usually sits within the bin folder in R.
Okay, cool, let’s run this.
Okay, the output states this was successful, so we have now set up our
web scraping script to run every hour.
It’s also just a good idea, I’ll show you what I mean, to check out your logs here.
So, the reason why you want to check your logs is to see if there were any errors
Running during the script causing it to halt or anything like that.
So, any data saved as the output and the logs are stored in the
same directory path of where your script lies and
We gave this to task schedule function up here
And basically, I’ve put little print statements in my web scraping script to help with debugging and such
and my actual output is email alert.
So, I’ll either receive an updated email summarizing Bitcoin events within the last hour or not
Just a couple of other things that I wanted to show you
You might decide to stop running your script later or delete it altogether
And in this case you simply want to use the taskscheduleR stop function
and just feed it to the task
So let’s go ahead and do that.
I’ll just copy and paste the thing I’ve used up here.
Okay, great we’ve successfully kind of stopped our task
and to delete it, we’ll just use the task scheduler delete function
And that’s it. After, you know, successfully creating a task
you can close out of Rstudio and this will run in the background of your computer
Something to take note of, you do need to keep your computer on in order to have
the script run in the background.
So you just can’t let it go to sleep
Power consumption is something you might want to think about
You can change your power and sleep settings in windows and if you’re, you know, using a Mac
It would be your power and it’ll be your energy saving settings
Thanks for watching. If you found the video tutorial useful, give us a like. Otherwise,
You can check out our other video tutorials at: tutorials.datasciencedojo.com
cronR – R script scheduler
taskscheduleR – Schedule R scripts with Windows Task Manager
Cron R installation
Recommended for medium level R users. See our Introduction to R to get up-to-speed with basic R commands:
The R full script for this video tutorial can be accessed here
More Data Science Material:
[Video Series] Automated Web Scraping in R
[Video] Web Scraping in R Part 1: Writing your Script in rvest
[Blog] Automated Web Scraping Reddit