How to Scrape Wikipedia Articles with Python
Web scraping is the process of programmatically retrieving and parsing data from websites. It‘s a powerful technique that enables you to extract information from online sources and use it for a wide variety of applications, such as data mining, natural language processing, price monitoring, and more.
In this guide, we‘ll walk through how to scrape Wikipedia pages using Python. By the end, you‘ll be able to retrieve the title and body content of articles, find and follow links to related pages, and crawl Wikipedia to discover new topics and connections.
Is it legal to scrape Wikipedia?
Before we dive into the technical details, let‘s address an important question: is web scraping allowed?
The answer depends on the website and how you use the scraped data. Some sites explicitly prohibit scraping in their terms of service, while others may allow it with certain restrictions.
As a general rule, you should respect the website‘s policies and avoid scraping any content behind login walls or that isn‘t publicly accessible. When in doubt, consult the site‘s robots.txt file, which specifies rules for web crawlers, or contact the website owner directly to ask for permission.
Fortunately, Wikipedia is licensed under Creative Commons and allows scraping and reusing its content for free, as long as you properly attribute the source. The Wikimedia Foundation even offers a web API and downloadable database dumps to make it easier to access their data.
However, they do request that you avoid overly aggressive crawling that could strain Wikipedia‘s servers. You should limit your request rate, cache pages you‘ve already scraped, and avoid hammering the site with too many concurrent connections.
Overview of the web scraping process
At a high level, web scraping in Python involves the following steps:
-
Send an HTTP request to the URL of the web page you want to scrape. This retrieves the HTML content of the page.
-
Parse the HTML to extract the data you‘re interested in. This typically uses a library like Beautiful Soup to navigate and search the HTML tree.
-
Store the extracted data, whether that‘s writing it to a file, saving it to a database, or using it for further processing.
-
Optionally, find links to other pages and repeat the process to scrape those as well. This allows you to traverse multiple pages and websites.
Let‘s see how to implement these steps in Python to scrape Wikipedia.
Scraping a Wikipedia article
To retrieve the contents of a web page in Python, we can use the popular Requests library. First install it via pip:
pip install requests
Then import the library and make a GET request to the Wikipedia article you want to scrape:
import requests
url = "https://en.wikipedia.org/wiki/Web_scraping"
response = requests.get(url)
print(response.status_code)
This sends an HTTP request to the specified URL and retrieves the HTML content of the page. The status_code
attribute of the response tells us whether the request was successful. A status code of 200 indicates success, while 4xx or 5xx codes indicate client or server errors.
Next, we need to parse the HTML to extract the data we want. For this, we‘ll use the Beautiful Soup library, which provides an intuitive interface for navigating and searching the HTML tree. Install it with:
pip install beautifulsoup4
Then pass the HTML text to the BeautifulSoup constructor to parse it:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
The second argument specifies which underlying parsing library to use. The default is Python‘s built-in HTML parser, which works fine for most cases.
Now we have a BeautifulSoup
object that represents the parsed HTML. We can use its methods to locate and extract specific elements on the page.
For example, to scrape the title of the article, we can search for the h1
tag with id "firstHeading":
title = soup.find(id="firstHeading").text
print(title)
This finds the first element matching that criteria and extracts its text content. Running this code should print out "Web scraping", the title of our example article.
We can extract other information, like the page summary, using similar techniques. The summary is contained in a div
with class "mw-parser-output":
summary = soup.find("div", class_="mw-parser-output").find("p").text
print(summary)
This locates the first p
tag inside the div
and extracts its text. The result is a brief introduction to the topic of web scraping.
Following links to other articles
One of the most powerful aspects of web scraping is the ability to automatically discover and follow links to new pages. This allows you to traverse a website and find information that may not be immediately apparent.
On Wikipedia, each article contains many links to related topics and concepts. By extracting these links and recursively scraping the linked pages, we can explore the vast web of knowledge on Wikipedia in a programmatic way.
To find all the links in the article, we can use Beautiful Soup‘s find_all()
method with the a
tag:
links = soup.find("div", id="mw-content-text").find_all("a")
This locates all the a
elements within the main content div
. However, not all of these are links to other Wikipedia articles. Some may be links to external websites, citations, or other types of pages.
To filter for internal Wikipedia links, we can check if the href
attribute starts with "/wiki/":
wiki_links = [link for link in links if link["href"].startswith("/wiki/")]
This uses a list comprehension to extract only the links whose URL path begins with "/wiki/", indicating an internal link to another article.
We can then choose a random link to follow and repeat the scraping process on that page:
import random
if wiki_links:
random_link = random.choice(wiki_links)
next_url = "https://en.wikipedia.org" + random_link["href"]
print("Scraping", next_url)
scrape_article(next_url)
else:
print("No more links found")
Here we randomly select a link from the list, construct the full URL by appending the relative path to the Wikipedia base URL, and then call a scrape_article()
function to perform the same scraping steps on the new page.
By recursively following links in this manner, we can traverse a large portion of Wikipedia and discover new and unexpected connections between topics. However, be aware that Wikipedia is a vast site with millions of pages, so a naive recursive approach may quickly overwhelm your computer‘s memory or exceed Wikipedia‘s request limits.
In practice, you‘ll want to add some constraints to your scraper, such as a maximum depth limit, a visited page cache to avoid redundant requests, and a rate limiter to throttle your requests and be a good citizen.
Putting it all together
Here‘s the complete code for our Wikipedia scraper, with some additional improvements:
import requests
from bs4 import BeautifulSoup
import random
def scrape_article(url, depth=0, max_depth=2):
if depth > max_depth:
return
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find(id="firstHeading").text
print("Title:", title)
summary = soup.find("div", class_="mw-parser-output").find("p").text
print("Summary:", summary)
content_div = soup.find("div", id="mw-content-text")
links = content_div.find_all("a")
wiki_links = [link for link in links if link["href"].startswith("/wiki/")]
if wiki_links:
random_link = random.choice(wiki_links)
next_url = "https://en.wikipedia.org" + random_link["href"]
print("Scraping", next_url)
scrape_article(next_url, depth + 1, max_depth)
else:
print("No more links found")
scrape_article("https://en.wikipedia.org/wiki/Web_scraping", max_depth=2)
This code defines a scrape_article()
function that retrieves the HTML of a Wikipedia page, extracts the title, summary, and internal links, prints them out, and then recursively calls itself on a randomly chosen link.
The depth
and max_depth
parameters keep track of how many levels deep the scraper has gone and limit it to a maximum depth to avoid runaway recursion.
When run, this script will start at the "Web scraping" article, print its title and summary, and then follow a random link to another article. This process repeats until reaching the maximum depth or encountering a page with no more internal links.
Of course, this is just a simple example to illustrate the basic concepts. In a real-world scraping project, you‘d likely want to do more sophisticated processing of the extracted data, such as storing it in a structured format like JSON or a database, or using it for analysis or visualization.
You may also need to handle more complex cases, like pages that load content dynamically with JavaScript, or websites that attempt to block scraping by detecting and banning suspicious traffic. In these cases, you might need to use a headless browser like Selenium or Puppeteer to simulate real user interactions, or distribute your scraping requests across multiple IP addresses to avoid detection.
Conclusion
Web scraping is a powerful technique for extracting data from websites, and Python provides a rich ecosystem of libraries for making HTTP requests, parsing HTML and XML, and manipulating data.
In this guide, we‘ve seen how to use the Requests and Beautiful Soup libraries to scrape Wikipedia articles, extract key information like titles and summaries, and recursively follow links to explore the site‘s content.
Of course, this just scratches the surface of what‘s possible with web scraping. With a bit more code, you can build sophisticated scrapers that can navigate complex websites, handle authentication and sessions, and store scraped data in databases or files for further analysis.
Some key takeaways and best practices to remember:
- Always respect the website‘s terms of service and robots.txt file, and don‘t scrape any content behind login walls or that isn‘t publicly accessible.
- Be gentle with your request rate and avoid hammering the site with too many concurrent connections. Use caching and rate limiting to minimize your impact.
- Structure your code modularly and use functions to encapsulate different scraping tasks. This will make your scraper more maintainable and extensible.
- Handle errors and edge cases gracefully, and log any issues for debugging. Web scraping can be brittle, so it‘s important to anticipate and handle failures.
- Consider using a headless browser or a third-party scraping service for more complex websites that heavily use JavaScript or CAPTCHAs.
With these tools and techniques, you‘re well on your way to becoming a web scraping pro! Happy scraping!