I made a Python web scraping guide for beginners I've been web scraping professionally for a few years and decided to make a series of web scraping tutorials that I wish I had when I started. The series will follow a large project I'm building that analyzes political rhetoric in the news. We would like to show you a description here but the site won’t allow us.
Scrape Images From Reddit
Today I’m going to walk you through the process of scraping search results from Reddit using Python. We’re going to write a simple program that performs a keyword search and extracts useful information from the search results. Then we’re going to improve our program’s performance by taking advantage of parallel processing.
Tools
We’ll be using the following Python 3 libraries to make our job easier:
- Beautiful Soup 4,
- Requests to access the HTML content,
- LXML as the HTML parser,
- and Multiprocessing to speed things up.
multiprocessing
comes with Python 3 by default as far as I know, but you may need to install the others manually using a package manager such as PIP:
Old Reddit
Before we begin, I want to point out that we’ll be scraping the old Reddit, not the new one. That’s because the new site loads more posts automatically when you scroll down:
The problem is that it’s not possible to simulate this scroll-down action using a simple tool like Requests. We’d need to use something like Selenium for that kind of thing. As a workaround, we’re going to use the old site which is easier to crawl using the links located on the navigation panel:
Scraper v1 - Program Arguments
Let’s start by making our program accept some arguments that will allow us to customize our search. Here are some useful parameters:
- keyword to search
- subreddit restriction (optional)
- date restriction (optional)
Let’s say we want to search for the keyword “web scraping”. In this case, the URL we want to go is:https://old.reddit.com/search?q=%22web+scraping%22
If we want to limit our search with a particular subreddit such as “r/Python”, then our URL will become:https://old.reddit.com/r/Python/search?q=%22web+scraping%22&restrict_sr=on
Finally, the URL is going to look like one of the following if we want to search for the posts submitted in the last year:https://old.reddit.com/search?q=%22web+scraping%22&t=year
https://old.reddit.com/r/Python/search?q=%22web+scraping%22&restrict_sr=on&t=year
The following is the initial version of our program that builds and prints the appropriate URL according to the program arguments:
Now we can run our program as follows:
Scraper v2 - Collecting Search Results
If you take a look at the page source, you’ll notice that all the post results are stored in <div>
s with a search-result-link
class. Also note that unless it’s the last page, there will be an <a>
tag with a <rel>
attribute equal to nofollow next
. That’s how we’ll know when to stop advancing to the next page.
Therefore using the URL we built from the program arguments, we can collect the post sections from all pages with a simple function that we’ll call getSearchResults
. Here’s the second version of our program:
Scraper v3 - Parsing Post Data
Now that we have a bunch of posts in the form of a bs4.element.Tag
array, we can extract useful information by parsing each element of this array further. We can extract information such as:
Information | Source |
---|---|
date | datetime attribute of the <time> tag |
title | <a> tag with search-title class |
score | <span> tag with search-score class |
author | <a> tag with author class |
subreddit | <a> tag with search-subreddit-link class |
URL | href attribute of the <a> tag with search-comments class |
# of comments | text field of the <a> tag with search-comments class |
We’re also going to create a container object to store the extracted data and save it as a JSON file (product.json
). We’ll load this file in the beginning of our program which may contain data from other keyword searches. When we’re done scraping the current keyword, we’ll append the new content to the existing data. Here’s the third version of our program:
Now we can search for different keywords by running our program multiple times. The extracted data will be appended to the product.json
file after each execution.
Web Scraping Python Reddit Pdf
Scraper v4 - Scraping Comments
So far we’ve been able to scrape information from the post results easily, since this information is available in a given results page. But we might also want to scrape comment information which cannot be accessed from the results page. We must instead parse the comment page of each indiviadual post using the URL that we previously extract in our parsePosts
funciton.
If you take a close look at the HTML source of a comment page such as this one, you’ll see that the comments are located inside a <div>
with a sitetable nestedlisting
class. Each comment inside this <div>
is stored in another <div>
with a data-type
attribute equal to comment
. From there, we can obtain some useful information such as:
Information | Source |
---|---|
# of replies | data-replies attribute |
author | <a> tag with author class inside the <p> tag with tagline class |
date | datetime attribute in the <time> tag inside the <p> tag with tagline class |
comment ID | name attribute in the <a> tag inside the <p> tag with parent class |
parent ID | <a> tag with the data-event-action attribute equal to parent |
text | text field of the <div> tag with md class |
score | text field of the <span> tag with score unvoted class |
Let’s create a new function called parseComments
and call it from our parsePosts
function so that we can get the comment data along with the post data:
Scraper v5 - Multiprocessing
Our program is functionally complete at this point. However, it runs a bit slowly because all the work is done serially by a single process. We can improve the performance by handling the posts by multiple processes using the Process
and Manager
objects from the multiprocessing
library.
The first thing we need to do is to rename the parsePosts
function and make it handle only a single post. To do that, we’re simply going to remove the for
statement. We also need to change the function parameters a little bit. Instead of passing our original product object, we’ll pass a list object to append the results obtained by the current process.
results
is actually a multiprocessing.managers.ListProxy
object that we can use to accumulate the output generated by all processes. We’ll later convert it to a regular list and save it in our product. Our main script will now look like as follows:
This simple technique alone will greatly speed-up the performance. For instance when I perform a search involving 163 posts in my machine, the serial version of the program takes 150 seconds to execute, corresponding to approximately 1 post per second. On the other hand, the parallel version only takes 15 seconds to execute (~10 posts per second) which is 10x faster.
You can check out the complete source code on Github. Also, make sure to subscribe to get updates on my future articles.