Python Tutorial: Web Scraping using Beautiful Soup

Web scraping is a very powerful tool to learn for any data professional. With web scraping, the entire internet becomes your database. In this python tutorial, we introduce the fundamentals of web scraping using the python library, beautifulsoup. We show you how to parse a web page into a data file (csv) using a Python package called BeautifulSoup.

There are many services out there that augment their business data or even build out their entire business by using web scraping. For example there is a steam sales website that tracks and ranks steam sales, updated hourly. Companies can also scrape product reviews from places like Amazon to stay up-to-date with what customers are saying about their products. Web scraping is only a small part of the big world of data science, if you are interested in learning more, check out our data science bootcamp.

Repository:
R code, scripts, and supplemental items

Sublime:
https://www.sublimetext.com/3

Anaconda:
https://www.anaconda.com/distribution/#download-section

JavaScript beautifier:
https://beautifier.io/

If you are not seeing the command line, follow this tutorial:
https://www.tenforums.com/tutorials/72024-open-command-window-here-add-windows-10-a.html

More Data Science Material:
[Video] Learn how to web scrap in R
[Video] Setup Python and R for Data Science
[Video] Time Series in Python Part 1

The Code

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client

# URl to web scrap from.
# in this example we web scrap graphics cards from Newegg.com
page_url = "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"

# opens the connection and downloads html page from url
uClient = uReq(page_url)

# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")
uClient.close()

# finds each product from the store page
containers = page_soup.findAll("div", {"class": "item-container"})

# name the output file to write to local disk
out_filename = "graphics_cards.csv"
# header of csv file to be written
headers = "brand,product_name,shipping\n"

# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)

# loops over each product and grabs attributes about
# each product
for container in containers:
    # Finds all link tags "a" from within the first div.
    make_rating_sp = container.div.select("a")

    # Grabs the title from the image title attribute
    # Then does proper casing using .title()
    brand = make_rating_sp[0].img["title"].title()

    # Grabs the text within the second "(a)" tag from within
    # the list of queries.
    product_name = container.div.select("a")[2].text

    # Grabs the product shipping information by searching
    # all lists with the class "price-ship".
    # Then cleans the text of white space with strip()
    # Cleans the strip of "Shipping $" if it exists to just get number
    shipping = container.findAll("li", {"class": "price-ship"})[0].text.strip().replace("$", "").replace(" Shipping", "")

    # prints the dataset to console
    print("brand: " + brand + "\n")
    print("product_name: " + product_name + "\n")
    print("shipping: " + shipping + "\n")

    # writes the dataset to file
    f.write(brand + ", " + product_name.replace(",", "|") + ", " + shipping + "\n")

f.close()  # Close the file

(4335)

Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

1 Comment

  • Avatar
    Houston Muzamhindo
    Reply

    It’s giving errors. I have copied everything but it’s not working. I dug into the code and could make it return some values but the shipping part is not working. This is my updated code:

    # -*- coding: utf-8 -*-
    “””
    Created on Thu Sep 5 10:25:58 2019

    @author: houston.muzamhindo
    “””

    from bs4 import BeautifulSoup as soup #HTML data structure
    from urllib.request import urlopen as uReq #Web Client

    #URL to web scrap from
    page_url = “http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH”

    #open the connect and download the html page from url
    uClient = uReq(page_url)

    #parse the HTML into a soup data structure to traverse html as if it were a json data type
    page_soup = soup(uClient.read(), “html.parser”)
    uClient.close()

    #find each product from the store page
    containers = page_soup.findAll(“div”, {“class” : “item-container”})

    #name the output file to write to local disk
    out_filename = “graphics_cards.csv”

    #header of the csv file to be written
    headers = “brand,product_name,shipping\n”

    #open file and write headers
    f = open(out_filename, “w”)
    f.write(headers)

    #loop through each product and grab attributes about each product
    for i in range(len(containers)):

    i = 1

    #find all link tags “a” form within the first div
    make_rating_sp = containers[i].select(“img”)

    #grab the title from the image title attribute
    #then do proper casing using .title()
    brand = make_rating_sp[0][“title”].replace(“\n”, “”).replace(“\r”, “”).replace(” “, “”)

    #grab the text within the second “a” tag from within the list of queries
    product_name = containers[i].div.select(“a”)[1].text

    #grab the product shipping information by searching
    #all lists with the class “price-ship”
    #then clean the text of white space with strip()
    #clean the strip of “Shipping $” if it exists to just get a number
    shipping = containers[i].findAll(“li”, {“class” : “price-ship”}).text.strip().replace(“$”, “”).replace(” Shipping”, “”)

    #print the dataset to console
    print(“brand:” + brand + “\n”)
    print(“product_name:” + product_name + “\n”)
    print(“shipping:” + shipping + “\n”)

    #write the dataset to file
    f.write(brand + “, ” + product_name.replace(“,”, “|”) + “, ” + shipping + “\n”)

    #close the file
    f.close()

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>