7 Easy Steps for Creating Your Own Web Scraper Using Python

Extract web content efficiently

Dan Suciu
Better Programming

--

View of the beach from inside a cave
Photo by marina on Unsplash.

Manually extracting a large amount of data from a website can take a lot of time and effort. And as you know, time is money.

That’s where web scraping comes in handy by making the job simpler and faster. Making a basic scraper isn’t difficult, either.

So if you want to know more about web scraping and how to create your own version in Python, buckle up!

What Is Web Scraping?

Web scraping is an automated data extraction method used to collect unstructured information from websites and format it in the desired layout so the user can easily read it. There are different ways to do that. You can either use online services, APIs, or just do it yourself.

Diagram showing how web scraping works
Photo by the author.

Before we get into the step-by-step guide to creating your own web scraper in Python, let’s take a look at how you can use it.

Why Is Web Scraping Useful?

As we mentioned above, web scraping is used to collect a large amount of information fast. How could it be useful? Well, a lot of businesses take advantage of this tool for the following reasons:

  • Gathering email addresses: Companies that use newsletters and email marketing to promote themselves need as many addresses as possible to reach their target audience. You can use a web scraper to download useful contact info from websites within your domain of interest. Hunter.io is a handy tool that does just that.
  • Pricing optimization: You can scrape to see how much your competitors charge for a product or service and easily keep tabs on how the market changes. Even if you’re just looking to buy something, data extraction tools help you find the best offer.
  • Research: Collecting data reports and statistics is crucial to completing high-quality research projects. With a web scraper, you waste less time manually copying large amounts of data on your own.
  • Social media: Scraping social media websites can help you determine what’s trending at present and see what methods may help you and your business stand out. It’s also a great way of monitoring what people are thinking and saying about your brand.
  • Testing: You can’t know for sure what your own website can handle or how it interacts with users without testing. By using a web scraping tool, you can send a large volume of requests to see if the site can handle it or use a proxy from a different location to check the response time.

Why Use Python?

Python is a popular programming language because it’s easy to use and learn and good practice for beginners. Here are just some of the advantages that make Python an excellent option:

  • Easy-to-read syntax: Python has a clean syntax that’s often called “executable pseudocode.” It’s especially readable thanks to the indentations used to indicate blocks.
  • Easy to use: There is no need for semicolons (;) or curly braces ({}) to indicate a block. Again, the indentation makes the code less messy and more readable.
  • Community: The Python community is one big family, and it’s growing every day. If you get stuck with your code, you can always ask for help. You probably won’t be the first programmer to encounter the issue in question.
  • Rich library collections: Python has many useful libraries such as Selenium, BeautifulSoup, and pandas, which we will use later on for web scraping.
  • Dynamically typed: This means that the variable type is determined only during runtime, which saves us some precious time.
  • Less writing: Plenty of code doesn’t necessarily mean good code. In Python, small code fragments can do quite a lot of work! Hence, you save time even while writing code.

Create Your Own Web Scraper

Now you know why web scrapers and Python are cool. Next, we will be going through the steps to creating our web scraper.

1. Choose the page you want to scrape

In this example, we will scrape Footshop for some nice sneaker models and their prices. Then, we’ll store the data in CSV format for further use. We want to know details about Nike sneaker models on this website, so the URL we’ll be using for our scraper is https://www.footshop.eu/en/2311-nike-men-s-shoes.

2. Inspect the website code

Data is found in nested tags, so we’ll need to inspect the page and see under which tag the information we need is. To inspect a page, right-click on the element and select “Inspect.”

Inspecting a page

A “Browser Inspector Box” window will pop up:

Browser Inspector Box

Now, I know that it might seem a bit intimidating at first, but don’t worry. Navigating through a website’s code is a lot simpler than it seems — and it only gets easier with experience.

3. Find the data you want to extract

The data we wish to extract is nested in the highlighted <div> tag. We need the product name and its price. By opening the <div> tag, a lot more tags will show up on the screen.

Notice how every tag has a “class.” In our case, to get the name of each product, we need to extract the information located in the <h4> tag with the class Product_name_3eWGG.

Depending on what you are looking for, the tag and class name may differ. You may search for links to different websites or even images.

4. Prepare the workspace

First, you need to download and install Python.

You can use whatever IDE suits you, but I recommend using PyCharm because it works like a charm!

After you’ve created a new project, you will need the following libraries:

  • Selenium: Used for web testing and automating browser activities.
  • BeautifulSoup: Used for parsing HTML and XML documents.
  • pandas: Used for data manipulation and analysis. You can extract and store data in the format you desire.

You can install them by opening a terminal in your project and using this command line:

python -m pip install selenium pandas beautifulsoup4

5. Write the code

Let’s import the libraries we installed a minute ago:

from selenium import webdriver 
from bs4 import BeautifulSoup
import pandas as pd

Now we need to configure the webdriver to use the Chrome browser by setting the path to chromedriver. It doesn’t matter where the chromedriver is located as long as the path is correct. Don’t forget to add the name of the executable at the end — not just its location!

driver = webdriver.Chrome("/your/path/here/chromedriver")

Declare the variables and set the URL of the website you wish to scrape:

models = []
prices = []
driver.get('https://www.footshop.eu/en/2311-nike-men-s-shoes')

Almost done!

We need to extract the information needed from the website, which is located in the nested <div> tags. Find the tags with those respective class names and store the data in the variables declared above:

6. Run the code

To run the code, use this command (you’re basically telling Python to run the .py file where you wrote the code):

python main.py

7. Store your extracted data

You extracted the data, but what are you going to do with it? Storing it in a preferable format for further analysis is one solution. In this example, we will store it in CSV (Comma Separated Value) format, as it can be imported easily:

df = pd.DataFrame({'Product Name': models, 'Price': prices})
df.to_csv('sneakers.csv', index=False, encoding='utf-8')

If we rerun the code, a file named sneakers.csv will be created. If you get a “Failed to read descriptor from node connection” error, there is no need to panic. It is just a harmless warning.

Table with shoe prices

And… we’re done!

I hope this article has helped you understand the basics of web scraping using Python.

Notice that this method is a very handy and easy one to use, but not the most efficient because you can only scrape one webpage at a time. You also need to select the tags nested on the website manually.

But, it’s a lot faster than doing it manually — especially if you want to scrape several similar pages. For example, if we want to check out Adidas shoes next, we only need to change a few code lines.

If you wish to scrape en masse, you can find more advanced tools. For starters, check out what web scraping APIs can do. WebScraping API wrote a guide on choosing an API that also includes some recommendations.

Happy coding and scraping!

--

--