Turn Website Data Into Data Sets: A Beginner’s Guide to Python Web Scraping

Extract information from websites in no time in a highly automated manner

Christopher Kindl
Better Programming

--

list of real estate listing on the left connected by an arrow to a table of data on the right
Illustration by author (Unsplash images used for fictive listing images)

Overview

What the article covers

  • Technical and legal considerations of web scraping
  • Example for scraping a search-based platform using the HTML-based method with Python’s Beautiful Soup library
  • Common techniques to tackle anomalies and inconsistencies in data when scraping
  • An outlook of how the discussed example can be transformed into a data pipeline using Amazon’s cloud-computing platform, AWS, and Apache Airflow to regularly collect data

What is web scraping?

Web scrapers access the underlying code of a website and gather a large amount of data which is later saved on a local file or database. It has become an established discipline in data science and also in business: Companies collect competitor trends, pursue market studies, and perform in-depth analyses, all on data that can be accessed publicly.

diagram showing the three layers of a website
3 layers of a website (Illustration by author)

How do web scrapers work in general?

There are different approaches to web scraping. This article focuses on HTML scrapers and covers a brief overview of other methods in the last section. As the name already reveals, these types of scrapers use the underlying HTML code of a website to retrieve the desired information. This method informally accesses the website, which is why using code with a customized logic is required in order to decipher the desired data.

Difference compared to API calls

An alternative method of fetching website data makes use of processes called API calls. An API (application programming interface) is officially provided by the website owner and allows requests of particular information in which the data is directly accessed from the database (or plain files). This usually requires permission from the website owner and is highly secured (e.g., API keys and tokens). However, APIs are not always available, which is why scraping is highly appealing, but it also raises the question of legality.

Legal considerations

Web scraping might violate copyright norms or terms of service of a website owner, especially when it is used for competitive advantage, a financial windfall, or abusive purposes in general (e.g., sending requests within very short time intervals). However, scraping data that is publicly available and is used for 1) private consumption, 2) academic purposes, or 3) other cases without commercial intention can generally be considered legally harmless.

If data 1) is protected behind a login or paywall, 2) is explicitly prohibited from being scraped by the website owner, 3) contains confidential information, or 4) compromises the privacy of the individual, any kind of scraping activity must be avoided¹. Therefore, bear in mind to always act responsibly and follow compliance first.

Beautiful Soup Setup in Python

Beautiful Soup is a Python library for collecting data from websites by accessing the underlying HTML code.

Install the latest version of Beautiful Soup in your terminal:

$ pip install beautifulsoup4

Install requests to be able to call websites (the library sends HTTP requests):

$ pip install requests

Within a Python or Jupyter notebook file, import these libraries:

from bs4 import BeautifulSoup
import requests

And some standard libraries for data processing and transformation steps:

import re
from re import sub
from decimal import Decimal
import io
from datetime import datetime
import pandas as pd

0. Introduction

Imagine we want to scrape a platform that contains publicly available ads of properties. We want to obtain information such as the 1) price of the property, 2) its address, and the 3) distance, 4) station name, and 5) transport type to the nearest public transport stations to find out how property prices are distributed across public transport stations in a particular city.

For example, what is the average housing price for a public transport station XY if we consider the 50 closest properties to this station?

diagram showing a section of the London Metro with the average real estate price at each station
Analysis idea (Illustration by author)

Important note: We explicitly avoid using a specific website as we do not want to promote a run for the same website. This why the code in this article is generalized and a fictive website with fictive HTML code is used. Nevertheless, the example code highly corresponds to real-life websites.

Assume that a search request for properties will lead to a results page that looks like this:

list of fictional property listings with photos and key data
Fictive results page (Illustration by author using Unsplash images for fictive listings)

Once we know what layout and structure the ads are shown in, we need to come up with a scraping logic that allows us to fetch all desired features (e.g., price, address, public transport information features) from every ad available on the results page.

Whenever I am confronted with a scraping task of this nature, I approach it by the following steps:

  1. How to get one data point for one feature?
    (E.g., get the price tag from the first ad.)
  2. How to get all data points for one feature from the entire page?
    (E.g., get price tags of all ads on the page.)
  3. How to get all data points for one feature available across all results pages?
    (E.g., get price tags of every ad shown for a particular search request.)
  4. How to tackle inconsistency if the data point of interest is not always applicable in an ad?
    (E.g., there are some ads in which the price field says “Price on application.” We would end up having a column consisting of numeric and string values, which does not allow a ready-to-go analysis in our case. Of course, we could simply exclude string values when doing the analysis. This step is just to demonstrate how to anticipate a cleaner data set from the beginning on, which might be even more valuable in other cases.)
  5. How to better extract complex information?
    (E.g., assume every ad contains public transportation information, such as “0.5 miles to subway station XY.” What does the logic need to look like so that we can store this mix of information directly in the right format: distance = [0.5], transport_type = [“underground”], station = [“name XY”])
diagram of a web-scraping framework for a typical search platform
Web scraping framework (Illustration by author)

1. Logic to get one data point

Important note: All code snippets discussed below can also be found in a complete Jupyter Notebook file in my repository on GitHub.

Call website

First, we replicate the search request we have done in the browser in the Python script:

# search area of interest
url = 'https://www.website.com/london/page_size=25&q=london&pn=1'

# make request with url provided and get html text
html_text = requests.get(url).text

# employ lxml as a parser to extract html code
soup = BeautifulSoup(html_text, 'lxml')

The variable soup now contains the entire HTML source code of the results page.

Search feature-specific HTML tags

The trick is now to find distinguishable HTML tags, either classes or ids, that refer to a particular information point of interest (e.g., price, see illustration below).

For this step, we need the help of the browser. Some of the popular browsers offer a convenient way to get the HTML information of a particular element directly. In Google Chrome, you 1) mark the particular feature field and do a 2) right-click to get the option to inspect the element (or simply apply the keyboard shortcut Cmd + Shift + C). Source code is then opened next to the browser view and you directly see the HTML information.

In the example of the price, using the HTML class css-aaabbbccc will return the information £550,000 as shown in the browser view.

diagram showing process by which data is extracted from property listings
Get data points based on HTML class (fictive code) (Illustration by author using Unsplash images)

Understanding HTML class and id attributes

HTML classes and ids are used by CSS and sometimes JavaScript to perform certain tasks². These attributes are mostly used to refer to a class in a CSS style sheet so that data can be displayed in a consistent way (e.g., display price tag on this position in this format).

The only difference between class and id is that ids are unique in a page and can only apply to at most one element, while classes can apply to multiple elements³.

In the example above, the HTML class used for retrieving the price information from one ad also applies for retrieving prices from other ads (which is in line with the main purpose of a class). Note that an HTML class could also refer to price tags outside of the ads section (e.g., special deals that are not related to the search request but are shown on the results page anyway). However, for the purpose of this article, we are only focusing on the prices within the property ads.

This is why we target an ad first and search for the HTML class only within the ad-specific source code:

the price “500,000 pounds”
Output for the variable price (Image by author)

Using .text at the end of the method find() allows us to only return the plain text as shown in the browser. Without .text it would return the entire source code of the HTML line that class refers to:

source code of HTML line
Output for the variable price without using .text (Image by author)

Important note: We always need to provide the HTML element, which is p in this case. Also, pay attention to not forget the underscore at the end of the attribute name class_.

Get other features with the same logic

  1. Inspect feature of interest
  2. Identify corresponding HTML class or id
  3. Decode source code using find(...).text

For example, getting the address, the HTML class might look like this:

(...)# find address in ad
address = ad.find('p', class_ = 'css-address-123456').text
# show address
address
the address “Holland Road London SW17”
Output for variable address (image by author)

2. Logic to get all data points from one single page

To retrieve the price tags for all the ads, we apply the method find.all() instead of find() for catching the ad:

# get all ads within one page
ads = ad.find_all('p', class_ = 'css-ad-wrapper-123456')

The variable ads now contains the HTML code for every applicable ad of the first results pages, as a list of lists. This storage format is very helpful as it allows to access ad-specific source code by index:

# identify how many ads we have fetched
len(ads)
# show source code of second ad
print(ads[0])

For the final code, to get all price tags, we use a dictionary to collect the data and iterate over the ads by applying a for-loop:

Important note: Incorporating an id allows identifying ads in the dictionary:

# show first ad
map[1]
code output of showing price and address
Output of map[1] (Image by author)

3. Get data points from all available results pages

A typical search-based platform has either pagination (click the button to hop to the next results page) or infinity scroll (scroll to the bottom to load new results) to navigate through all available results pages.

Case 1. Website has pagination

URLs that result from a search request usually contain information about the current page number.

URLs with the elements that indicate page number highlighted
Hopping to the next page by URL (Illustration by author)

As can be seen in the illustration above, the ending of the URL refers to the results page number.

Important note: The page number in the URL usually becomes visible from the second page. Using a base URL with the additional snippet &pn=1 to call the first page will still work (in most cases).

Applying another for-loopon top of the other one allows us to iterate over the results pages:

Identify the last results page

You may wonder how to identify the last results page. In most cases, after the final page is reached, any request with a larger number than the actual number of the last page will lead us back to the first page. Consequently, using a very large number to wait until the script has finished does not work. It will start to collect duplicate values after a while.

To tackle this problem, we can incorporate a logic that checks whether a link in the pagination button is applicable:

diagram that illustrates the logic for continuing to the next page
Stop scraping by logic (Illustration by author)

Case 2. Website has infinity scroll

If the website follows an infinity scroll approach, an HTML scraper might not be helpful as it requires reaching the bottom of the page browser-wise to load new results. This cannot be simulated by an HTML scraper and demands a more sophisticated approach (e.g., Selenium — see alternative scraping methods further down in the article).

4. Tackling information inconsistency

Data is never in the right shape, especially if it is collected. As mentioned at the beginning of the article, a possible scenario might be that we see price tags that do not represent numeric values.

two property listings, one showing a price and the other the letters “POA” in the price field
Tackling information inconsistency (Illustration by author using Unsplash images)

If we want to avoid noise data from the beginning, we can apply the following workaround:

  1. Define function to detect anomalies

2. Apply function in the data collection for-loop

3. Optional: On-the-fly data cleaning

You may have already noticed that the price format £ XX,XXX,XXX still represents a string value due to the presence of the currency sign and comma delimiters. Act efficiently and do the cleaning while scraping:

Define function:

Incorporate function into data collection for-loop:

5. Extracting nested information

The features of the public transport section are mixed together. For the analysis, it would be best to store the values for distance, station name, and transport type separately.

diagram of a property listing linked by an arrow to code, to show how nested information is extracted
Illustration by author (Unsplash images used for fictive listing images)

1. Distill information by rules

As illustrated above, every piece of information about public transport is represented in this format: “[numeric value] miles [station name].” See how “miles” acts as a delimiter here, and we can use it to split the string by this term.

Always try to identify a rule to distill information instead of deciding by best guess. We can define the logic as follows:

Initially, the variable transport stores two lists in a list as there are two public transport information strings (e.g., “0.3 miles Sloane Square,” “0.5 miles South Kensington”). We iterate over these lists using the len of transport as index values and split each string into two variables, distance and station.

2. Search for additional HTML attributes to decode visual information

If we dive deeply into the HTML code, we find an HTML attribute, testid, that reveals the name of the icon that is used to display the transport type (e.g., “underground_station” — see illustration above). This information serves as metadata and is not shown in the browser view. We use the corresponding HTML class css-StyledIcon to get the entire source code of this section and add testid to carve out the information:

This example shows how useful it can be to dig deeper into the source code and watch out for metadata that can reveal meaningful information for optimizing the web scraper.

6. Transform to data frame and export as CSV

When the scraping task is done, all fetched data is accessible in a dictionary of dictionaries.

Let’s consider only one ad first to better demonstrate the final transformation steps.

Output of the first ad in the dictionary:

# show data of first ad
map[0]
output showing price, location, and other data extracted from a property listing
Output of map[1] (Image by author)

1. Transform dictionary into a list of lists to get rid of nested information

We transform the dictionary into a list of lists so that each category only holds one value.

output in the form of a list of prices and locations for different property listings
List of lists as pre-processing step to create the data frame (Image by author)

See how we got rid of the multiple values in the public transport section and created an additional entry for the second public transport data point. Both entries are still identifiable by the id 1 we have inherited.

2. Create data frame using the list of lists as input

df = pd.DataFrame(result, columns = ["ad_id", "price", "address",  
"distance", "station", "transport_type"])
final table showing price, address, distance, and station
Final format for using one ad (Image by author)

The data frame can be exported as a CSV as follows:

# incorporate timestamp into filename and export as csv
filename = 'test.csv'
df.to_csv(filename)

Transformation to transfer all ads into data frame

We made it! That was the final step. You have built your first scraper that is ready to be tested!

7. Limitations of HTML scraping and alternative methods

This example has shown how straightforward HTML scraping can be for standard cases. Extensive research of library documentation is not really necessary. It demands, rather, creative thinking as opposed to complex web development experience.

However, HTML scrapers also have downsides⁴:

  • Can only access information within the HTML code that is loaded directly when the URL is called. Websites that require JavaScript and Ajax to load the content will not work.
  • HTML classes or ids may change due to website updates (e.g., new feature, website redesign).
  • Cannot transmit user content to the website, such as search terms or login information (except search requests that can be incorporated into the URL, as seen in the example).
  • Can be detected easily if the requests appear anomalous to the website (e.g., a very high number of requests within a short time interval). This is prevented by defining rate limitations (e.g., only allow the performance of a limited number of actions in a certain time) or other indicators, such as screen size or browser type, in order to identify a real user.
table describing and comparing alternative methods of web scraping
Overview of alternative scraping methods⁴ (Illustration by author)

Learn how to transform a simple web-scraping script into a cloud-based data pipeline

As a next step, we could have turned this script into a data pipeline that automatically triggers scraping tasks and transfers results to a database — everything in a cloud-based way.

A first step would be to create or identify an id that is purely unique for one ad. This would allow fetched ads to be re-collected and matched to run historical price analyses, for instance.

How this can be achieved and what technical steps are required to deploy a data pipeline in a cloud environment will be covered in the next article.

diagram showing ongoing process of data collection and presentation
Building data pipelines using AWS and Airflow (Illustration by author using Unsplash images)

References

[1]: Tony Paul. (2020). Is web scraping legal? A guide for 2021 https://www.linkedin.com/pulse/web-scraping-legal-guide-2021-tony-paul/?trk=read_related_article-card_title. Retrieved 10 May 2021.

[2]: W3schools.com. (2021). HTML class Attribute https://www.w3schools.com/html/html_classes.asp. Retrieved 10 May 2021.

[3]: GeeksforGeeks.org. (2021). Difference between an id and class in HTML? https://www.geeksforgeeks.org/difference-between-an-id-and-class-in-html/#:~:text=Difference%20between%20id%20and%20class,can%20apply%20to%20multiple%20elements. Retrieved 10 May 2021.

[4]: JonasCz. (2021). How to prevent web scraping https://www.w3schools.com/html/html_classes.asp. Retrieved 10 May 2021.

[5]: Edward Roberts. (2018). Is Web Scraping Illegal? Depends on What the Meaning of the Word Is https://www.w3schools.com/html/html_classes.asp. Retrieved 10 May 2021.

--

--