R Tutorial: Automated Web Scraping in R
How to automatically web scrape periodically so you can analyze timely/frequently updated data.
There are many blogs and tutorials that teach you how to scrape data from a bunch of web pages once and then you’re done. But one-off web scraping is not useful for many applications that require sentiment analysis on recent or timely content, or capturing changing events and commentary, or analyzing trends in real time. As fun as it is to do an academic exercise of web scraping for one-off analysis on historical data, it is not useful to when wanting to use timely or frequently updated data.
Scenario: You would like to tap into news sources to analyze the political events that are changing by the hour and people’s comments on these events. These events could be analyzed to summarize the key discussions and debates in the comments, rate the overall sentiment of the comments, find the key themes in the headlines, see how events and commentary change over time, and more. You need a collection of recent political events or news scraped every hour so that you can analyze these events.
What we’ll do:
We’ll go through the process of writing standard web scraping commands in R, filtering timely data, analyzing or summarizing key information in the text, and sending an email alert of the results of your analysis. We’ll set up our script to run every hour so that text is scraped and analyzed periodically to capture changing events and commentary or analyze trends in real time. Feel free to bring your laptop and follow along!
Let’s go fetch your data!
About the Presenter:
Rebecca holds a bachelor’s degree of information and media from the University of Technology Sydney and a post-graduate diploma in mathematics and statistics from the University of Southern Queensland. She has a background in technical writing for games dev and has written for tech publications.
Rebecca just moved here recently from Sydney, Australia and joined Data Science Dojo after attending a bootcamp in 2015!