The Art of Web Scraping

Master the practice of extracting data from a website as efficiently as possible.

Sam Berry
Better Programming

--

Image of person at keyboard
Photo by Damian Zaleski on Unsplash

It’s pretty much automated copy/pasting, and it turns out it has some really quite useful applications. The web is packed with valuable information and resources, but it takes time and effort for a human to find and process that, which is why the practice of web scraping is popular.

Web scraping was at its most popular in June 2020, and it’s not over yet

Let’s look at an example.

Let’s say you’re a consumer looking to buy a new graphics card from eBay. You could check the prices of similar items that come up from a search every day and write down the average price in Excel so that you know the best time to buy — but that’s labor-intensive. And time-consuming. And boring.

What you should do is write a program to do it for you. But how would you do that?

Firstly, you’d have to learn a programming language if you hadn’t already. For the purposes of this article, we’ll say Python. You’d also have to know about how websites work, which means some basic HTML and CSS.

Next, you’ll need to decide how you’re going to be using this language to extract information from this website. You’ll need to inspect the website to get an idea of what you’ll be scraping. Will you have to interact with the page to extract data, or can you simply download the page data and process that text?

We’re searching for the GTX 1660 graphics card, so let’s have a look at eBay now:

Immediately you see that the search takes us to the URL: “https://www.ebay.co.uk/sch/i.html?_nkw=gtx+1660”, and the results seem to load instantly. Now is a good time to inspect the source of the page. Right-click on a product, and press “Inspect.” Here are the results:

Upon inspection of the product listings, we see that each listing is an item in an unordered list. The list has some unique classes, which you can use to single this HTML element from the rest of the page: srp-results and srp-list, and each list item has class s-item.

Every product listing has a span with class s-item__price. Here’s the HTML:

Because of the nature of this task, we can use Beautiful Soup, a Python library that processes markup documents so that they are easy to navigate and extract data from. If you had to interact with the page (i.e., click buttons, provide input), it’d be best to use a library like Selenium or Mechanize, that automates an entire browser, but it’s best to avoid that where possible because it is not the most efficient method in many cases.

To install Beautiful Soup:

pip3 install bs4

When using the built-in “requests” library to get the URL’s source, I was returned response code 200. You can read about different response codes here, but 200 means the request was okay, and we can continue. But, sometimes, you’ll have to use a user-agent so that the site thinks the request is coming from a browser rather than a scraper. Here’s the code:

“The User-Agent string is one of the criteria by which Web crawlers may be excluded from accessing certain parts of a website using the Robots Exclusion Standard (robots.txt file).” — Wikipedia

In summary, sometimes a user-agent must contain a certain string for the request to be accepted, or must not contain a certain string, or must simply contain something rather than nothing.

The HTTP request returns the source of the page — it’s the same way a browser fetches the HTML it renders and displays when you’re browsing the web

We can use Beautiful Soup to parse the HTML and process the product listings by using the classes we identified earlier. Here’s the code:

from bs4 import BeautifulSoup
import requests
# get source
r = requests.get("https://www.ebay.co.uk/sch/i.html?_nkw=gtx+1660")
# parse source
soup = BeautifulSoup(r.text, 'html.parser')
# find all list items from search results
results = soup.find("ul",{"class":"srp-results"}).find_all("li",{"class":"s-item"})

We can iterate over the results to find the text in the span with class s-item__price . Here’s the code to do that:

for result in results:
priceSpan = result.find("span",{"class":"s-item__price"})
print(priceSpan.text)
Image of scraping results

You get the following output. To convert this to an integer we should first remove the first character which is always “£,” then you would be able to use the Python float method on the rest of the characters — if it weren’t for some exceptions.

Sometimes, there are multiple prices:

And, other times, there are commas to separate the thousands:

To deal with this, you could ignore the listings that contain the string ‘to’, and use the Python replace method to remove any commas, and then convert to float. The code is as follows:

for result in results:
priceText = result.find("span",{"class":"s-item__price"}).text
if "to" in priceText:
continue
price = float(priceText[1:].replace(",",""))

Every eBay product listing also contains a span with class s-item__shipping. We can use this to find the shipping and total the cost of the items, as follows:

for result in results:
shippingSpan = result.find("span",{"class":"s-item__shipping"})
print(shippingSpan.text)

The shipping text takes the form of + £(PRICE) postage or Free postage if the shipping is free. You can process the first example by using the Python split() function, select the second item, and do the same thing as above: parsing the string without the first character as a float. For the second instance, you can check if the price contains “Free,” and set the price value to 0 if so.

prices = []for result in results:
priceText = result.find("span",{"class":"s-item__price"}).text
if "to" in priceText:
continue
price = float(priceText[1:].replace(",",""))

shippingText = result.find("span",{"class":"s-item__shipping"}).text
if "Free" in shippingText:
shipping = 0
else: # is not free
shipping = float(shippingText.split()[1][1:])
prices.append(price+shipping)

Above, I have implemented this described method and added the item and postage cost combined to a list, so it can be processed later on.

Image of 1660TI graphics card. $9,999 euros

There are some anomalous listings — possibly made to trick auto-purchasing bots into buying hugely overpriced items in times of supply chain shortages.

Before you take an average, you should exclude these outliers from your data. This can be done with standard deviation and normal distribution. If this is new to you, I recommend reading the Wikipedia page, and this article about excluding anomalies in Python.

“Our approach was to remove the outlier points by eliminating any points that were above (Mean + 2*SD) and any points below (Mean — 2*SD) before plotting the frequencies.” — Punit Jajodia, Chief Data Scientist

Using Numpy, the above can be implemented. First, make sure you have Numpy installed. Here’s the command:

pip3 install numpy

Add this method to your code:

import numpy as npdef reject_outliers(data, m=2):
return data[abs(data - np.mean(data)) < m * np.std(data)]

This will remove outliers from a Numpy array. Now you can use it with your prices list and take an average, as follows:

prices = reject_outliers(np.array(prices))
avgPrice = np.mean(prices)

To be able to analyse this data in Excel, the mean should be written to a CSV file. I used the date as the first column and the price as the second.

import csvfields=[date.today().strftime("%b-%d-%Y"),np.around(avgPrice,2)]
with open('prices.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerow(fields)
Image of prices, etc. in an Excel file.

After running my script for a few days, I used Excel to plot a graph of the data, which gave the above results. Here’s the code:

Web scraping is a powerful tool, and it’s great fun to learn and practice. I recommend running your programs like this on a Raspberry Pi, or something similar, since it can run 24/7 consuming little power. Also, you can program your script to run every day instead of running the program every day yourself.

I urge you to learn more about this topic, which is a hooking introduction to data science. I hope this article was helpful.

--

--