Better Programming

Advice for programmers.

Web Scraping With Python — Get the Top 100 Most-Streamed Songs on Spotify

Vitor Xavier
Better Programming
Published in
6 min readSep 13, 2022

--

Hello! First I’d like to suggest you put on your headphones and listen to your favorite album while we go through this project, because music makes everything better.

Today we are going to do a Web Scraping in a dataset from Wikipedia, where we have the Top 100 most-streamed songs of 2022 on Spotify, as well as their Ranking, the number of Streams (billions), the Artists, and the Publish Date of the Songs.

Let’s import the libraries we’re going to work:

from bs4 import BeautifulSoup
import requests

And now, connect to the website and pull in data:

URL = 'https://en.wikipedia.org/wiki/List_of_most-streamed_songs_on_Spotify'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}page = requests.get(URL, headers=headers)

You can get your User Agent on this website.

Now that we’ve connected to the website’s data, let’s use Beautiful Soup for getting the page content, and also to structure the HTML in a ‘pretty’ html design using .prettify:

soup1 = BeautifulSoup(page.content, 'html.parser')soup2 = BeautifulSoup(soup1.prettify(),'html.parser')

We can see the difference between soup1 and soup2:

soup2 is structured in a much more visible way than soup1, because the .prettify function is used to facilitate the analysis.

Let’s take a look at how the data is on Wikipedia:

So, the first thing we want is to get the Rank values. For this I’ve pressed F12 to open the DevTools, see the page’s html structure and select the Inspect Function:

And then inspect the ranking element:

Looking closely to the html function:

It’s possible to see that the ranking’s text is inside a th function with attributes: style=”text-align:center;

So, let’s create a list to get all the values range 1–100 from ranking, using .find_all to search for the references mentioned above:

ranking = []
for i in range (0,100):
rank = soup2.find_all('th', attrs = {'style': 'text-align:center;'})[i].text
rank = rank.replace('\n','')
rank = rank.strip()
ranking.append(rank)

Notice that I also have done a little cleaning, taking off strings ‘\n using .replace and removing blank spaces from beginning and end of the string using .strip(), if this was not done, the result would be this:

So, after the cleaning we have this ranking list:

Nice! Now that the ranking list is done, we’re going to look at the next variable: songs.

I tried using the same method of ranking in the songs but it didn’t work. So the method I found most effective was to check the number of intervals between the song’s registers:

So, after seeing that the song’s registers appear every 5 rows, and the registers are inside a td function, I created the song variable:

song_list = []
for i in range(0,500,5) :
song = soup2.find_all('td')[i].text
song = song.replace('\n','')
song = song.replace('"','')
song = song.strip()
song = ' '.join(song.split())
song_list.append(song)

Notice that if we do not use this combination of join() and split() function, the song ‘I Took a Pill in Ibiza (Seeb Remix)’ would have a lot of duplicated blank spaces, like this:

So, after the cleaning, we have this song_list:

Now let’s get the number of streams, in billions:

streams_list = []
for i in range(1,500,5) :
streams = soup2.find_all('td')[i].text
streams = streams.replace('\n','')
streams = streams.strip()
streams_list.append(streams)

Notice that the range pattern continues (every 5 rows), and inside a td function. For streams_list we have:

After streams, we need to get the artists:

artist_list = []
for i in range(2,500,5):
artist = soup2.find_all('td')[i].text
artist = artist.replace('\n','')
artist = artist.strip()
artist = ' '.join(artist.split())
artist_list.append(artist)

Here we also had to use the combination of join() and split() functions, otherwise the artist_list would be like this:

And, after all the cleaning the outcome for artist_list is:

And, to finish, let’s get the publish date of the songs:

date_list = []
for i in range(3,500,5):
date = soup2.find_all('td')[i].text
date = date.replace('\n', '')
date = date.strip()
date_list.append(date)

The result for date_list is:

That’s good!! All we got to do now is put these lists in a dictionary in order to transform the data into a dataframe:

data = {
'Rank' : ranking,
'Song' : song_list,
'Streams' : streams_list,
'Artist' : artist_list,
'Date' : date_list
}

Now, let’s import pandas and turn the dictionary into a dataframe:

import pandas as pd
df = pd.DataFrame(data)

That’s nice! now we have our dataset:

We can also import this dataframe to a CSV, and define a function for the Web Scraping, so that we can update the data(CSV) from the website every time we set the function to work.

To import to CSV, all we got to do is:

df.to_csv('Top100-MostStreamedSongs-Spotify.csv')

And now we are ready to analyze the dataset and take some insights from it.

But let’s reserve the analysis part for another time in the future, since the main goal here in this project is to show how we can do web scraping.

It’s important to know that if the data you want from a website isn’t displayed in just 1 page, you can also loop the URL in order to get all this data (remember always to get the patterns).

If you want to access the full code that we used here, you can check it at my GitHub.

That’s all folks!

Want to Connect?You can also find me at Linkedin and of course, on Spotify :).

--

--