- Why Extract Data from Wikipedia?
- Prerequisites
- Method 1: Using the wikipedia Library for Basic Text Content
- Method 2: Extracting Structured Data with BeautifulSoup
- Automating Wikipedia Data Extraction for Multiple Entries
- Applications of Structured Wikipedia Data
- Best Practices for Wikipedia Scraping
- Alternatives: Wikipedia APIs and Wikidata
- Conclusion
- Frequently Asked Questions(FAQ’s)
There is a wealth of information on Wikipedia. Extracting properly organized data from Wikipedia can simplify and speed up your research, NLP training or content scraping processes. Nonetheless, the website’s fast-changing and HTML content can be hard to deal with directly.
If you’re looking to build real-world skills in data extraction and automation, start by mastering Python itself. Enroll in this Python language course from Great Learning to strengthen your foundation in Python programming, data structures, and practical projects.
In this guide, you’ll learn how to get structured data from Wikipedia in Python, with the help of wikipedia
, BeautifulSoup
, and pandas
libraries. We’ll walk through practical examples from fetching article content to parsing infoboxes and tables.
Why Extract Data from Wikipedia?
Wikipedia offers:
- Rich encyclopedic content across domains
- Structured metadata via infoboxes
- Up-to-date information maintained by a global community
- Open access under Creative Commons license
These features make it an ideal data source for NLP, machine learning (ML), and data visualization projects.
Prerequisites
To follow along, ensure you have:
- Python 3.x installed
- Basic familiarity with web scraping and data structures
- Installed libraries
pip install wikipedia beautifulsoup4 requests pandas
If you’re new to scraping or want a refresher on the basics, this Python web scraping guide covers how to work with HTML, requests, and extraction tools in a beginner-friendly way.
Method 1: Using the wikipedia
Library for Basic Text Content
The wikipedia
library provides a simple API for fetching article summaries and page content.
Example: Fetching a Page Summary
import wikipedia
# Set language (optional)
wikipedia.set_lang("en")
# Fetch summary
summary = wikipedia.summary("Machine learning")
print(summary)
Fetch Full Page Content
page = wikipedia.page("Machine learning")
print(page.content)
This gives you raw text, but not structured data like infoboxes or tables.
Method 2: Extracting Structured Data with BeautifulSoup
To parse HTML for structured elements like infoboxes, tables, or categories, we use requests
and BeautifulSoup
.
Step 1: Fetch HTML of a Wikipedia Page
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Machine_learning"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
Extracting Infobox Data
Infoboxes are structured in <table class="infobox">
. Here’s how to extract key-value pairs:
infobox = soup.find("table", {"class": "infobox"})
data = {}
for row in infobox.find_all("tr"):
header = row.find("th")
value = row.find("td")
if header and value:
data[header.text.strip()] = value.text.strip()
print(data)
Now you have a Python dictionary of clean infobox fields.
Extracting Tabular Data with pandas
Many Wikipedia pages include HTML tables that can be parsed directly using pandas
.
Example: Extracting All Tables
import pandas as pd
tables = pd.read_html(url)
print(f"Found {len(tables)} tables")
# Display the first table
print(tables[0].head())
This method is ideal for statistical data, comparison tables, and historical records.
Cleaning the Extracted Data
Wikipedia content often includes citations (e.g., [1]) or nested tags. Here’s how to clean them:
from bs4 import NavigableString
def clean_text(tag):
return ''.join([str(t) for t in tag if isinstance(t, NavigableString)]).strip()
cleaned_data = {}
for row in infobox.find_all("tr"):
header = row.find("th")
value = row.find("td")
if header and value:
cleaned_data[header.text.strip()] = clean_text(value)
print(cleaned_data)
Handling Redirects and Disambiguation
Wikipedia pages can redirect or lead to disambiguation pages. The wikipedia
library handles this:
try:
page = wikipedia.page("Mercury")
except wikipedia.DisambiguationError as e:
print("Disambiguation required. Options:", e.options)
You can then choose the specific page from the list.
Automating Wikipedia Data Extraction for Multiple Entries
Here’s how you can loop over multiple topics:
topics = ["Machine learning", "Artificial intelligence", "Data science"]
for topic in topics:
summary = wikipedia.summary(topic)
print(f"\n--- {topic} ---\n{summary}")
For bulk table or infobox scraping, combine this with BeautifulSoup and pandas workflows in batch.
Applications of Structured Wikipedia Data
- Knowledge graphs: Extract entity relationships from infoboxes
- NLP training datasets: Use raw text or metadata for supervised learning
- Trend analysis: Scrape historical or statistical tables
- Data journalism: Fetch and visualize open-access data
Best Practices for Wikipedia Scraping
Tip | Description |
---|---|
Use respectful rate limits | Avoid hammering Wikipedia’s servers |
Cache results | Reuse fetched data for repeated runs |
Check for updates | Wikipedia pages evolve frequently |
Handle exceptions | Always check for redirects or page errors |
Alternatives: Wikipedia APIs and Wikidata
For more structured queries:
- MediaWiki API
- Wikidata SPARQL queries: For semantic-level data extraction
These tools allow deeper integrations if you’re building advanced pipelines or tools.
Conclusion
Extracting structured data from Wikipedia with Python opens up powerful opportunities from building datasets to automating knowledge retrieval. With just a few libraries wikipedia
, BeautifulSoup
, and pandas
you can transform unstructured encyclopedia content into usable data.
To master more web scraping and data handling techniques, check out the Web Scraping with Python course by Great Learning. Learn how to build robust data pipelines with real-world projects.
Frequently Asked Questions(FAQ’s)
1. How can I extract internal links from a Wikipedia page?
You can use BeautifulSoup to find all <a>
tags with href
attributes starting with /wiki/
, then filter out administrative or special pages (like those containing colons :). This is useful for building knowledge graphs or crawling linked topics.
2. Can I extract images or media files from a Wikipedia page?
Yes. Images are embedded in <img>
tags. You can extract the src
attribute and prepend https:
to form a complete URL. Keep in mind that many image URLs point to Wikimedia Commons.
3. Is it better to use the Wikipedia API instead of scraping HTML?
Yes, for many structured data needs (like page content, categories, links), using the official MediaWiki API is more stable and ethical than scraping raw HTML, especially for large-scale or automated tasks.
4. How do I get the categories a Wikipedia article belongs to?
Categories are typically located at the bottom of the HTML page under the class “mw-normal-catlinks”. With BeautifulSoup, you can extract these links to classify or cluster articles.
5. Can I extract data in multiple languages using Python?
Yes. The wikipedia
Python library supports language switching using wikipedia.set_lang('xx')
, where ‘xx’ is the language code (e.g., ‘fr’ for French, ‘es’ for Spanish). This allows multilingual scraping and comparative analysis.