Web Scraping using Python and Beautiful Soup

Web scraping is a very powerful tool to learn for any data professional. With web scraping, the entire internet becomes your database. In this python tutorial, we introduce the fundamentals of web scraping using the python library, beautiful soup. We show you how to parse a web page into a data file (csv) using a Python package called Beautiful Soup.

There are many services out there that augment their business data or even build out their entire business by using web scraping. Companies can also scrape product reviews from places like Amazon to stay up-to-date with what customers are saying about their products. Web scraping is only a small part of the big world of data science, if you are interested in learning more, check out our data science bootcamp.

Hello, ladies and gentlemen of the Internet. My name is Phuc Duong, Senior Data Engineer
for Data Science Dojo. And I’m here to teach you how to web scrape with Python.
So in front of you, you see, is actually a website that employs web scraping. So this
web scrape’s actually a storefront of a website called Steam. So steam sells video games.
And the cool thing about Steam is that they do flash sales every day. So the user has
to come back every day and study this page. What is a good deal? What is not a good deal?
And it’s a lot of information. This is how they’ve gamified shopping online.
Now there’s a website that actually scrapes steam’s front page in real time and shows
you the best deals, and ranks them. OK. So a lot of people ask me, how do I get all of
my data? And actually, in the absence of APIs, if you learn, web scraping, it is actually
a very important tool for data scientists and a data engineer to know, because the entire
internet becomes your database.
So– I can scrape any storefront– Nordstrom, Macy’s. Study the sales. Web scrape reviews.
I can web scrape baseball stats, baseball players in real time. Wikipedia is also a
good place to web scrape. For example, you can see that this frame over here of this
Harry Potter character, Ron Weasley, it’s very standardized. I could write a web scrape
script and then loop over every single Harry Potter character, very quickly, and create
a data set.
All right. Today we’re going to learn how to do that. So today I’m on Windows, so you
can normally install Python if you’re on Linux, but if I don’t if you’re on Windows, I highly
recommend installing Anaconda instead. So if you go to Google, and just type in Anaconda,
it should be a continuum dot and you should just download based upon your operating system.
OK.
Next thing I’ll be using, I’ll be using a text editor called Sublime Text. So you can
just go ahead and go to Google and type in Sublime Text and then install that. I like
using Sublime Text 3. OK. All right. That’s where you get those things.
All right. So once you’ve installed this, this is actually– if you’re using Anaconda,
its actually a pretty big file. It’s like 500 megabytes, OK? So be warned of that.
All right. So what I’m going to do is I’m going to go ahead and open up my command line.
And for those of you who don’t know, if you go to a folder, any folder, and then just
hold down the Shift button and right click and say Open Command Window Here, this opens
up the command line for you. And this is where you can work with Python. So if you type in
Python right here, right, and if you’ve installed either Python or Anaconda, well this is show
up, right? So notice that I’m using Python 3.5 with Anaconda. And if I just do a very
quick two plus two, it should equal 4. That’s how I know I’m inside of my console.
All right, next thing is yes, now that I know that if I push down control and hit C, Control
plus C, basically if I do a copy on Windows, it will exit this console. OK and I get back
to, basically, the Windows command line.
So what I’m going to do now is, I’m going to go ahead and install a package called Beautiful
Soup. That’s the package that we’re going to use to web scrape, actually. It’s a very
powerful package. I encourage those of you who want to go further beyond this introduction
to go ahead and learn this package. So all you’ve got to do is do a pip, install, bs4.
OK, bs4 stands for Beautiful Soup 4.
So here we are. So Beautiful Soup has been installed. And how do I know if it’s been
installed? Well, if I type in python, and I type in import bs4, right? It should just
not err. OK. Awesome. So that’s how I know that the packet is online and ready to go.
Next thing I want is, I need a web client. So Beautiful Soup is a good way to parse HTML
text. That’s all is. It’s a good way to traverse HTML text within Python. Now I actually need
a web client to grab something from the Internet. And how you do that in Python, is, actually,
you would use a package called your URL lib. And inside of URL lib, there is a module called
request, and inside of that module is a function called URL open. OK? I know it’s a lot to
take in. But settle down, we’re going to do step by step.
I’m going to do a really quick import all-in-one line kind of step. All right. So I can do
from URL lib dot request. So I’m calling a package called URL lib. If you’re on Python
2, this is a different package. It’s called URL lib 2. So I’m a calling module within
that. So notice, I’m importing only what I need. I don’t need all of URL lib. I just
need the request module. And I’m going to import out of that.
OK, URL open, the one, basically, function that I need. And it’s going to import all
the basic dependencies, as well. And I’m going to give it a name because I don’t want to
type in URL open every time. I want to say U request, uReq for short. That’s how I tend
to do things. And also, I can also modularize the import of Beautiful Soup, as well. So
I can do from BS4 import. And this is important, capital B Beautiful, and then capital S for
soup. And then I’m going to just call it as soup. So I don’t have to call out Beautiful
Soup again every time I want to use this package.
And this is me working in the console. This is me playing around. So if you want to, you
can actually start typing it into a script. So in this case, I have Sublime open. And
I’m going to do a Control Shift P to open up the command console. And then I’m going
to say set syntax is equal to Python. OK. Beautiful.
So now I can do the same commands in here. So if I just select this into the command
line, hit the Enter button, that will copy it. So that way I can paste it into my script
here. OK? So there you have it, the first two lines of this.
So now I’m ready to go. So Beautiful Soup is going to parse the HTML text, and then
URL lib is actually going to grab the page itself.
But what do we want to web scrape? Well I like graphics cards. I’m going to web scrape
graphics cards off newegg.com. So some of you might know it. It’s basically Amazon but
for, basically, hardware electronics.
So I’m going to type in, for example, graphics cards. So these are a bunch of graphics cards
that have shown up in my search bar. And it would be nice to basically tabularize and
turn it into a data set. And notice that, if a new data set, if a new graphics card
is introduced tomorrow, or if ratings change tomorrow, or phrases change tomorrow, I run
the script again and it updates it into, basically, whatever it is that I loaded into. I can log
into a database, a CSV file, and Excel file, it doesn’t matter.
So in this case, I’m going to grab this URL. OK. That’s all I’m going to do. So basically
I’m going to copy this URL, and I’ll pasted into my script. So, in this case, I can do
my URL is equal to– so that is the URL I want to use of this.
And in this case, I will actually run it in my console. So when I’m web scraping, I like
to also prototype it into the command line, as well, so I know that the script is going
to work. And then once I know that it works, I will go ahead and paste that back into my
Sublime. OK so this is my URL. So I’ve gone ahead and called a variable and placed a string
of the URL into it. Now this is going to be good. So now I will actually open up my web
client. So in this case, I would do U request, right?
So notice I’m calling you URL lib, and I’m calling it from the shorthand variable that
I called it earlier. So notice I called from URL lib dot request import URL open as U request.
So I’m actually calling the function called URL open right now, inside of a module called
request, inside of a package called URL lib.
So the next thing is, I’m going to throw my URL into this thing. So what this is going
to do, it’s going to open up, basically, a connection, it’s going to open up this connection,
grab the web page, and basically just download it. So it’s a client. So I’m going to call
it a U client is equal to U request of my URL. It’s going to take a while depending
on your Internet connection because it’s actually downloading the web page. I noticed that.
OK it’s done. So the minute I want it, I can do a read, a U client dot read. If I do read,
it’s going to dump everything out of this right away. I can’t reuse it. So before it
gets dumped, I want to store it into something, a variable. So I’m going to call, I guess,
page underscore– since this is the raw HTML, I’m just going to call it HTML– page HTML
is equal to U client dot read.
I can go ahead and show you this thing, but it might– depending on how big the HTMO file
is– I can actually crash the console. So I’m going to show it to you once it’s inside
of Beautiful Soup. Bear with me here. And any web client, since this is an open Internet
connection, I want to actually close it when I’m done with it. So U client dot close is
what I’m going to do.
And knowing that all of these lines of code have worked so far, I can just go ahead and
copy them into my script. So my URL is that. And U client is– and just add some documentation,
opening up connection, grabbing the page. OK. And then what this does is, it offloads
the content into a variable. And then what this is going to do, it’s going to close the
client.
Then the next thing I need to do is I need to parse the HTML, because right now the HTML
is a big jumble of text. So what I need to do right now is I need to call the Soup function
that I made earlier. So notice I called from BS4 for import Beautiful Soup soup. So if
I call soup as a function, it’s going to call it the Beautiful Soup function within the
BS4 package.
So in this case, I will do soup of, basically, my page HTML. And then if I do a comma here,
I will have to tell it how to parse it, because it could be an XML file, or, in this case,
I will tell it to parse it as an HTML parse file. And I need to store it into a variable
or else it’s going to get lost. So in this case, I’ll call it a page soup.
I know it’s kind of weird that they call it a soup, but it’s standard notation. Now, when
you say soup, people understand that this is the data type of it. It’s derived from
the Beautiful Soup package.
All right. So in this case this does my HTML parsing.
OK. So now, if I go to the page soup, and I just try to look at the H1 tag, page soup
dot H1, I should see the header of the page. So this does say video cards and video devices.
So I should see that somewhere. So notice that they grab this header right here.
And just just, for good measure, let’s just see what else is in there. So Beautiful Soup
dot, maybe there’s a P tag in there I can look at. So newegg.com, a great place to buy
computers. So I think that might be at the very bottom. Great place to– actually, no,
it might be something that’s hidden. It might be just in a tagline.
All right. But I am on this page. So now what we need to do is traverse the HTML. So basically
what I’m going to do is, I’m going to convert every graphics card that I see into a line
item, into a CSV we file. To do that, to traverse– now that I have a Beautiful Soup data type,
I can’t actually traverse, basically, the dom elements of this HTML page
So let me show you how to do that real quickly. So if I inspect the element of this page,
so if I go find the body tag, for example. I think the body type– it starts off as a
body. So if I do a body, page soup dot body, and then I can keep going. I can keep going
dot within the– so notice that this body tag can go even further into an A tag or span
tag. So if I type in the span tag, I should find this span tag. Or body dot pan. See that?
Span class no CSS skip to. See that? No CSS skip to. That’s awesome.
So the next thing I’m going to do, let me just make this HTML a little bit bigger so
you guys can see it even further. All right. So what I want is if I’m in Chrome, you can
also use the Firefox Firebug to inspect the HTML elements of a page. So I’m going to just
select this, the name of this graphics card right here, and try to inspect that element.
It jumps me directly to this A tag. It jumps me directly into this A tag. And I want to
grab the entire container that the graphics card is in, because I know that graphics card
container contains other goodies, such as the original price, its sale price, its make,
its review type, and the card image itself.
So I go out. So since HTML is an embedded kind of tagging language, I can go out until
I find what it is that is containing all of this. So notice that this div right here with
the class of item dash container, contains and houses all of the items inside of this
thing. So basically I would need to set a loop. I would write my script first on how
to parse one graphics card, and then once I’m done with that, I can loop through all
of the class containers, and go ahead and parse out every single graphics card into
my data file.
So in this class, I need this class. I want to grab everything that has this class. So
I want to go ahead and do that right now.
So I want to go to– my page soup — There is a function called find all. And it’s capital
A with find all. And I want to find, what do I want to find? I want to find all divs
that have the class item dash container. So I would go back, and I would say, find me
all divs comma, and then I would feed it an object. And the object says what is the name
of the tag that you’re looking for? So it’s a class. If it was an ID, I would put ID here.
And then I would go ahead and paste in the item that container is what it’s called.
So in this case, I will feed this into a variable called, I guess, containers. We’ll call it
by what the class is. I’m going to copy this, as well, and paste it into my script. Hopefully
it works. So from this, I will grab, grabs each product. So notice that even though I’m
writing this for graphics cards, I’m betting that Newegg has actually standardized its
HTML enough so that I can actually parse any page, any product, on Newegg, if I just run
the script over.
So if I call this containers, so let’s check the length of the containers to see how many
things did it find. So it found 12 objects. So it found one, two, three, four , five,
it found 12 graphics cards, basically, is what that did. And look, there’s six of them.
Yes, that is true.
OK so let’s look at the first one. So if I go to containers of the zero index, I should
see HTML for this thing. So I am actually just going to copy this out into my text file,
and I’m going to read it in there, because sometimes when you load a page, there are
some post-loading loading done via JavaScript. And some things will show up, some things
won’t show up.
So just to be sure, I’m just going to paste it into my Sublime. And from my Sublime, I
can go ahead and figure out what is actually in there. So I’m going to go Control new and
Sublime, paste it in. But notice, it’s not very pretty. So we’ll deal with that in a
minute. I’m going to set my syntax to become HTML. OK it’s in HTML now. But that’s not
pretty. I want to use an external service called JS Beautifier. So it’s going to do
all the spacing when there needs to be spacing. So JS Beautifier, you basically just copy
an ugly code, and it turns it pretty. See that? Everything is all now nicely spaced
and deliminated.
Here we are. Now let’s read what’s actually in this thing. So if I open this up now, I
know it’s going to be a little bit hard to read. What kind of things do we want out of
this thing? If we go through, we can see that there’s some pretty useful things. We can
see that the items have ratings.
It has a product name. We want to grab the product name for sure. Let’s see, there is
its brand. I can grab its brand. So notice that they call the image the name of the brand,
which is useful. So if I grab the title of this image– Notice that the image itself,
it says it says EVGA, but that’s an image, I can grab the image. I can grab the image,
I just can’t parse what it says unless I use image recognition. But notice that the title
encodes what type of brand it is for us. So that’s very convenient. So this is something
that we want to grab.
And also I want to be sure I want to grab things that are true of everything. So if
not, I’m going to have to run into some corner case if-else statements. So notice that this
guy right here is special. He doesn’t have any egg reviews. So if I wrote something to
parse reviews, I’m going to need to write an if else statement, or I’m going to do I’ll
have to do a try and catch with an index out of error catch. OK. And then notice that it
doesn’t even have what this number is. I think it’s the number of reviews here.
So I’ll let you guys go ahead and handle the scraping of that, but I’m going to scrape
things that are present in all of them. Notice that I’m going to scrape the names. All of
them seem to have the names of the brand or the names of the product. And then I’m going
to go ahead and scrape the product itself. And not all of them have a price. You see
that? I have to add it to the cart to see the price.
And let’s see what else is good. And they all seem to have shipping. So I’m going to
grab shipping to see how much they all cost. So once you learn how to scrape one, it’s
the same really for all of it. Now if you want to loop through all of it, you have to
do those if else statements to catch all the loose cases that aren’t there.
So notice that if I do a container right now, a container of zero a container of zero–
going to throw container 0 into just a variable called container. Later I’m going to do a
for loop that says for every container in containers. Right so right now I’m prototyping
the loop before I want to build the loop. So I want to make sure it works once before
I even build the loop.
So this container contains a single graphics card in it. I will call it container instead
of contain. So container dot, dot what? Let’s see what is in here. Notice that container
dot A will bring me this thing back. So if I do container dot A, this brings me back
exactly what I thought it would. It would bring me the item image. So the item image,
not that useful to us.
Let’s see if there’s anything that we can redeem in here. The title, we might be able
to redeem the title, but it seems that we can also grab that down here which I think
this might be the more efficient way to grab it. So let’s get it from there instead, because
that’s what the customer sees. That’s what you will see when you go and visit the space.
So we will go instead of doing dot A, we will do dot div. We’ll go jump from this A, directly
into this div.
So I’ll go ahead and push up, and say container dot div. So that will jump me into this div
right here, and everything inside of it. OK. Boom.
OK. So if I go into that container dot div, I will just probably assume this is the right
one. I know web scraping HTML tends to be hard because it hurts your eyes, unless you
know how to read HTML very well. But it’s something just to get used to.
So I know that I’m in this div and I want to go into another div called item branding.
So div dot div. And inside of that div there is, I think, an A tag. This A tag actually
contains some things that we want, which is this guy right here. What is the make of this
graphics card dot div dot A. And there we have it.
So here’s the H ref of the link. So what I’m grabbing is this guy right here, this EVGA
thing that I’m grabbing. Notice I hover. It’s a clickable link. That link is this guy right
here. But what I really want is this title, the title of this link.
So what do I want? I want to do container dot a dot image. So I want to grab this image
tag now. So notice I’m just using these handles. I’m just referencing as if it was a JSON file.
And notice that I’m inside of the image now. So the image is here. Now I need to grab this
title. So this is an attribute inside of the image tag. So how do you grab an attribute?
Well you would reference it as if it was an index, or I mean, a So I would say title of
this is equal to EVGA
So now that I have prototyped it, I can go ahead and add that to my script. So I can
go ahead and copy this right here, and paste that into my script.
Inside of my script, this is where I actually can do that preemptive loop now. I can write
that loop now. So for container in containers. It’s going to go loop through, and it’s going
to grab container dot div dot div dot A of that image of that title is going to equal
to the brand or the make. So the that’s the first thing I grabbed.
So who makes this graphics card? That’s the first thing it’s going to do. So what else
do I want to grab while I’m inside of this thing? So let’s grab two more things. All
right. Just grab two more things just to have a really good file, because a CSV file with
one column seems a little tiny bit pointless.
All right the next thing I want to do is, I want to go ahead and grab the name of this
graphics card, which is right here. Notice that it’s embedded within this A tag, and
this A tag is embedded within this div tag. And this div tag is embedded within this div
tag. In theory, if we do a container, dot div dot div dot A, it actually brings out
it seems like it brought out the item brand instead. So the item brand is actually this
A tag, which is not what we wanted. We wanted this A tag.
So notice that it’s having trouble finding this particular A tag. So what I want to do,
actually, is I want to do– I can do a Find All, and find just the direct class that I
want. So in this case, I can do a find me all the A tags that have item dot title. So
in this case, I can do container dot find all is equal to, I want to see the A tag,
comma, and then I want to throw it into an object. And the object is, I’m going to say,
look for all classes that will go ahead and start with item title.
So this will give me a data structure back that has everything that it found. So hopefully
should only be one thing so that we don’t have to loop over it. So in this case, container
equals that which would be title underscore container. If I look at the title underscore
container, I should have what I’m looking for. Beautiful.
So the name of the graphics card is somewhere in this thing. I’m going to put this and I’m
going to throw it into my script so I can run it later. So going back– So the title
container, notice this isn’t the actual title yet. I still have to extract the title out
of this thing. So in my title container — notice that it’s inside of the bracket bracket, which
means it’s inside of an array, or in this case it’s a list if you’re in Python.
So in this case, if I go to zero, I want to grab the first object. And inside of that
first object, I want to grab, nope it’s not inside of the I tag, it’s actually a text
inside of the A tag. So if I do dot text, this should get me what I want. Yes. So I
do title dot of zero dot text, and that gives me exactly what I want.
So I’m going to place that in there, and I want to call this the title, so the product
name. So product name is equal to title container dot text. So that is that.
So I’ve got the brand, the make of the graphics card, and the name of the graphics card again.
And now we can go ahead and grab shipping, because shipping seems like something else
that they might all have.
So what we’re going to do is figure out where this shipping tag is inside of all of it.
How much does it cost for shipping, because I think some of them cost differently for
shipping. Yes, this is $4.99 shipping. So in this case, I need to find all LI classes–
basically, LI stands for a list– with the class price dot dash ship. So I want to go
ahead and do that.
I’m going to copy this class. And I want to do container dot find all of LI comma of class
is equal to price ship. And this will give me, hopefully, a shipping container. Shipping
underscore container and, hopefully, there should only be one tag in this thing that
has shipping in it. And I need to close that function. So my shipping underscore container,
if I can just copy this, shipping container.
You will see that it gives me back an array of things that qualify. So in this case, only
one thing came back. So I can do that same thing I did earlier where I reference the
first element, and then I think it’s also in the text again, right? So I can do dot
text again. And this brings me back. It looks like there’s a lot of open space.
Notice there’s a return, and then there’s a new line. There’s a return, and then there’s
a new line. So in this case, I want to clean it up a little bit because I just want the
text. So in this case I will say strip. So strip removes whitespace before and after
new lines, all that good stuff. So it just says free shipping now. So I can go ahead
and grab this, and throw it into my script, as well.
So now I’ve grabbed three things. So in this case, I also need the find all that I did
earlier. So if I go up a few times, I can find it. So the shipping container itself
will be placed in here. And then if I close, actually, the find all function, and there
we go.
So now there are the three things that I want. So the product name, the brand, and the shipping
container will be actually shipping.
OK. So cool. So now this is ready to be looped through. But before that, I want to print
it out. So I want to show you why is Sublime is my favorite editor. It does multi-line
editing. S in this case, I’m going to go ahead and enter three blank lines. I’m going to
copy my three variables. OK, copy, copy, copy. I’m going to paste them in here. I’m just
go ahead and make it nice and formatted.
So I will print all of these things out into the console, just so I can see. So in this
case I will copy this, as well. So that way, I can go ahead and just say quote, and then
paste that. So I can see what it is when it actually does print out. And then I can do
a plus for for a string concatenation. It’s going to print each of these three things
out for me, so the brand, and the product name, and the shipping.
And basically, before I throw this into a CSV file, I want to just make sure that this
loop works. So I want to save this web scrape thing, too. I want to call this web my first
web scrape dot py. OK. So if I open this, there should be a file here. If I right click
and open up another console, so notice I have accounts before. But this one is running Python.
I want to open up this one.
And I want to tell it. So notice that I’m inside of this file path now. So this file
path is a file path that contains this script already in it. So what I need to do is just
do Python. So I want tell it to run Python. And I want tell it, OK now that I’m in Python,
execute this script. So my first web scrape dot py. Hit Enter. And then, hopefully, look
at that. It went through. It did that loop. And it grabbed every other graphics card for
me.
So all I have to do now is throw this into a CSV file. And I can then open it in Excel.
So let’s go ahead and do that real quick. Just finish up our code. And I don’t really
need the prototype for this, because I know that the script works now.
To open up a file, you would do just the simple Open. And then, in this case, I need a file
name. So the file name is equal to, I guess, products dot CSV. OK so I want to open up
a file name. I need to instantiate a mode. So in this case W for write. So I want to
open up a new file and write it in it. So this would be called F. So the normal convention
for a FileWriter is F.
And I want to write the headers to this thing. So, in this case, F dot write is equal to,
now I need to call some headers to a CSV file which usually has headers. In this case, headers
will equal to, I think I’ll make it, brand name, let’s call it product name, because
if you load us into a SQL database later, name is a key word in SQL. So product name,
and then I’ll call this shipping. OK. And then I also need to add a new line because
CSVs are delineated by new line.
So I’m going to tell it to write the first line to be a header And then the next thing
is, I want to tell it to every time to loop through, I want to write a file. So instead
of printing it to the console, which I’ll let it do actually, I’m going to do F dot
write. So F dot right is going to write so these three things. So product, product name,
shipping. I paste that in there. That’s going to paste all three of them for me.
But what I need to do is actually concatenate them together. And I need to concatenate them
with a comma in the middle. So comma. And let me just double check something real quick.
See if my strings are clean. And no it is not. So notice that the product names have
commas inside of them. So what that’s going to do is it’s going to create extra columns
inside of my CSV file.
So before I print the product names out, I actually need to do a string replace. So I
need to call a replace function as every time you see a comma, let’s replace it with something
else. And I like to do a pipe, but you can delineate it as anything you want. This is
programming. You can do whatever you want as long as it doesn’t err. In this case, I
would go ahead and do that. And also, don’t forget this, it needs to be deliminated by
a new line.
So every time is going to loop through, it’s going to grab and parse all of the data points.
And then it’s going to write it to a file as a line in the file. And what I need to
do is, once it’s done looping, I will have to close to file. Because if you don’t close
the file, you can’t open the file. Only one thing can open the file at a time.
All right. So I will run the script again. So notice if I just push up, it runs the script.
So you have to save the script first. I’m going to do Control S to quickly save it.
When you do control– syntax error! I forgot to add a concatenation with the plus N. So
I need to do a plus N to tell it to concatenate that. So I go Python my first web scrape.
It went through.
So after running that script, it’s gone ahead and scraped everything and printed everything
to the console. But more importantly, it rewrote everything to this file. I told it to write
everything to the CSV file. So if I open it up right now, you can see that it has gone
ahead and scraped the entire page and thrown every data point as a row, every product as
a row, into this CSV file.
So you can go ahead and scrape the other details, like whether or not it is a sales price or
not, what the image tag might be. And then there’s multiple pages. So if you go to Amazon,
for example, there’s multiple pages of probably products. So you can start looping through.
So usually up here, there’s a page equal something. So you can just do a loop and just say, in
this case, do page two instead of page one.
And that concludes today’s lesson on how to web scrape with Python. And I hope you guys
learned a lot and had fun doing it.
Now I want to really know from you guys, did you guys enjoy this kind of video? Do you
guys want more coding videos? More data science videos? And if there’s a better way to code
something, also let me know. I’m always happy to hear from you guys. What do you guys enjoy?
I want to make this content for you guys. All right. Now I’ll see you guys later, and
happy coding.

Repository:
R code, scripts, and supplemental items

Sublime:
https://www.sublimetext.com/3

Anaconda:
https://www.anaconda.com/distribution/#download-section

JavaScript beautifier:
https://beautifier.io/

If you are not seeing the command line, follow this tutorial:
https://www.tenforums.com/tutorials/72024-open-command-window-here-add-windows-10-a.html

More Data Science Material:
[Video] Learn how to web scrap in R
[Video] Setup Python and R for Data Science
[Video] Time Series in Python Part 1

The Code

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client

# URl to web scrap from.
# in this example we web scrap graphics cards from Newegg.com
page_url = "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"

# opens the connection and downloads html page from url
uClient = uReq(page_url)

# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")
uClient.close()

# finds each product from the store page
containers = page_soup.findAll("div", {"class": "item-container"})

# name the output file to write to local disk
out_filename = "graphics_cards.csv"
# header of csv file to be written
headers = "brand,product_name,shipping\n"

# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)

# loops over each product and grabs attributes about
# each product
for container in containers:
    # Finds all link tags "a" from within the first div.
    make_rating_sp = container.div.select("a")

    # Grabs the title from the image title attribute
    # Then does proper casing using .title()
    brand = make_rating_sp[0].img["title"].title()

    # Grabs the text within the second "(a)" tag from within
    # the list of queries.
    product_name = container.div.select("a")[2].text

    # Grabs the product shipping information by searching
    # all lists with the class "price-ship".
    # Then cleans the text of white space with strip()
    # Cleans the strip of "Shipping $" if it exists to just get number
    shipping = container.findAll("li", {"class": "price-ship"})[0].text.strip().replace("$", "").replace(" Shipping", "")

    # prints the dataset to console
    print("brand: " + brand + "\n")
    print("product_name: " + product_name + "\n")
    print("shipping: " + shipping + "\n")

    # writes the dataset to file
    f.write(brand + ", " + product_name.replace(",", "|") + ", " + shipping + "\n")

f.close()  # Close the file

(4931)

Phuc H Duong
About The Author
- Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.

1 Comment

  • Avatar
    Houston Muzamhindo
    Reply

    It’s giving errors. I have copied everything but it’s not working. I dug into the code and could make it return some values but the shipping part is not working. This is my updated code:

    # -*- coding: utf-8 -*-
    “””
    Created on Thu Sep 5 10:25:58 2019

    @author: houston.muzamhindo
    “””

    from bs4 import BeautifulSoup as soup #HTML data structure
    from urllib.request import urlopen as uReq #Web Client

    #URL to web scrap from
    page_url = “http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH”

    #open the connect and download the html page from url
    uClient = uReq(page_url)

    #parse the HTML into a soup data structure to traverse html as if it were a json data type
    page_soup = soup(uClient.read(), “html.parser”)
    uClient.close()

    #find each product from the store page
    containers = page_soup.findAll(“div”, {“class” : “item-container”})

    #name the output file to write to local disk
    out_filename = “graphics_cards.csv”

    #header of the csv file to be written
    headers = “brand,product_name,shipping\n”

    #open file and write headers
    f = open(out_filename, “w”)
    f.write(headers)

    #loop through each product and grab attributes about each product
    for i in range(len(containers)):

    i = 1

    #find all link tags “a” form within the first div
    make_rating_sp = containers[i].select(“img”)

    #grab the title from the image title attribute
    #then do proper casing using .title()
    brand = make_rating_sp[0][“title”].replace(“\n”, “”).replace(“\r”, “”).replace(” “, “”)

    #grab the text within the second “a” tag from within the list of queries
    product_name = containers[i].div.select(“a”)[1].text

    #grab the product shipping information by searching
    #all lists with the class “price-ship”
    #then clean the text of white space with strip()
    #clean the strip of “Shipping $” if it exists to just get a number
    shipping = containers[i].findAll(“li”, {“class” : “price-ship”}).text.strip().replace(“$”, “”).replace(” Shipping”, “”)

    #print the dataset to console
    print(“brand:” + brand + “\n”)
    print(“product_name:” + product_name + “\n”)
    print(“shipping:” + shipping + “\n”)

    #write the dataset to file
    f.write(brand + “, ” + product_name.replace(“,”, “|”) + “, ” + shipping + “\n”)

    #close the file
    f.close()

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>