Scraping Memes From Reddit With the Python Reddit API

Fun weekend project — writing code to scrape memes from r/ProgrammerHumor

asjad anis
Better Programming

--

Photo by Kon Karampelas on Unsplash

For a long time, I used to think scraping data from the internet was too boring, inspecting the webpage in DevTools, finding the DOM nodes of your interest — it seemed too much of a hassle to me.

Until, one day, I tried it using Beautiful Soup and was really inspired seeing how easy it is to play with the parsed dom and gather data of your interest.

Since then, I have been exploring the world of scraping and recently came across PRAW, which is the Python Reddit API Wrapper and makes it very easy to access Reddit data.

After exploring the package for a while, I really wanted to do a fun little weekend project and what’s better than writing code to scrape memes from r/ProgrammerHumor.

For this tutorial, we will need:

  • Python.
  • Reddit account.
  • Client ID & client secret to access the Reddit API.
  • User-agent.
  • urllib.
  • pandas

Now that we know the requirements, let’s first create a Reddit app and get our client ID, client secret, and user agent.

Go to app preferences and click on create app or create another app which will take you to this screen. For the redirect URL, put in http://localhost:8080, as described in the documentation.

create-reddit-app

Once you have put in the details, hit create app and you will be taken to this screen. Note down your client ID and client secret here as we will need these later for auth purposes.

Client-id, secret, and user agent

Now that we have created an app, let’s install the Python dependencies.

Open up a terminal/cmd and run:

pip install praw pandas

As we have our client ID, secret, and user-agent we can now get on with the code and start using the Reddit API.

PRAW initialization

The above piece of code will give us access to the praw.Reddit instance and now we can access subreddits, get posts from a specific Reddit, comments, etc.

Next, we will get posts from /r/ProgrammerHumor and loop through the posts, collecting the data we want and saving the image file.

The above piece of code gets the subreddit r/ProgammerHumor and fetches the hot posts, limiting it to 10.

Next, you can iterate through the posts and save the appropriate data. Look into PRAW’s documentation for a deeper look into all the available methods.

Now that we have access to the URLs we will simply download the memes to the local filesystem using urllib, which is a high-level interface for interacting and fetching data across the internet.

Downloading to fs

Now we just iterate through the URLs and check if they are one of the allowed extensions, we download them to the filesystem using urllib.urlretrieve and, finally, we can save the data into a CSV using pandas.

Exporting to CSV

Here’s the final script in action:

Visit my GitHub for the GitHub repo.

--

--