How to Extract and Clean Data from Wikipedia Using Python

Want to turn Wikipedia’s raw content into clean, structured datasets? This guide walks you through Python-based methods to extract text, tables, infoboxes, and more using wikipedia, BeautifulSoup, and pandas.

Extract Data from Wikipedia Using Python

There is a wealth of information on Wikipedia. Extracting properly organized data from Wikipedia can simplify and speed up your research, NLP training or content scraping processes. Nonetheless, the website’s fast-changing and HTML content can be hard to deal with directly.

If you’re looking to build real-world skills in data extraction and automation, start by mastering Python itself. Enroll in this Python language course from Great Learning to strengthen your foundation in Python programming, data structures, and practical projects.

In this guide, you’ll learn how to get structured data from Wikipedia in Python, with the help of wikipedia, BeautifulSoup, and pandas libraries. We’ll walk through practical examples from fetching article content to parsing infoboxes and tables.

Why Extract Data from Wikipedia?

Wikipedia offers:

  • Rich encyclopedic content across domains
  • Structured metadata via infoboxes
  • Up-to-date information maintained by a global community
  • Open access under Creative Commons license

These features make it an ideal data source for NLP, machine learning (ML), and data visualization projects.

Prerequisites

To follow along, ensure you have:

  • Python 3.x installed
  • Basic familiarity with web scraping and data structures
  • Installed libraries
pip install wikipedia beautifulsoup4 requests pandas

If you’re new to scraping or want a refresher on the basics, this Python web scraping guide covers how to work with HTML, requests, and extraction tools in a beginner-friendly way.

Method 1: Using the wikipedia Library for Basic Text Content

The wikipedia library provides a simple API for fetching article summaries and page content.

Example: Fetching a Page Summary

import wikipedia

# Set language (optional)
wikipedia.set_lang("en")

# Fetch summary
summary = wikipedia.summary("Machine learning")
print(summary)

Fetch Full Page Content

page = wikipedia.page("Machine learning")
print(page.content)

This gives you raw text, but not structured data like infoboxes or tables.

Method 2: Extracting Structured Data with BeautifulSoup

To parse HTML for structured elements like infoboxes, tables, or categories, we use requests and BeautifulSoup.

Step 1: Fetch HTML of a Wikipedia Page

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Machine_learning"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

Extracting Infobox Data

Infoboxes are structured in <table class="infobox">. Here’s how to extract key-value pairs:

infobox = soup.find("table", {"class": "infobox"})
data = {}

for row in infobox.find_all("tr"):
    header = row.find("th")
    value = row.find("td")
    if header and value:
        data[header.text.strip()] = value.text.strip()

print(data)

Now you have a Python dictionary of clean infobox fields.

Extracting Tabular Data with pandas

Many Wikipedia pages include HTML tables that can be parsed directly using pandas.

Example: Extracting All Tables

import pandas as pd

tables = pd.read_html(url)
print(f"Found {len(tables)} tables")

# Display the first table
print(tables[0].head())

This method is ideal for statistical data, comparison tables, and historical records.

Cleaning the Extracted Data

Wikipedia content often includes citations (e.g., [1]) or nested tags. Here’s how to clean them:

from bs4 import NavigableString

def clean_text(tag):
    return ''.join([str(t) for t in tag if isinstance(t, NavigableString)]).strip()

cleaned_data = {}
for row in infobox.find_all("tr"):
    header = row.find("th")
    value = row.find("td")
    if header and value:
        cleaned_data[header.text.strip()] = clean_text(value)

print(cleaned_data)

Handling Redirects and Disambiguation

Wikipedia pages can redirect or lead to disambiguation pages. The wikipedia library handles this:

try:
    page = wikipedia.page("Mercury")
except wikipedia.DisambiguationError as e:
    print("Disambiguation required. Options:", e.options)

You can then choose the specific page from the list.

Automating Wikipedia Data Extraction for Multiple Entries

Here’s how you can loop over multiple topics:

topics = ["Machine learning", "Artificial intelligence", "Data science"]

for topic in topics:
    summary = wikipedia.summary(topic)
    print(f"\n--- {topic} ---\n{summary}")

For bulk table or infobox scraping, combine this with BeautifulSoup and pandas workflows in batch.

Applications of Structured Wikipedia Data

  • Knowledge graphs: Extract entity relationships from infoboxes
  • NLP training datasets: Use raw text or metadata for supervised learning
  • Trend analysis: Scrape historical or statistical tables
  • Data journalism: Fetch and visualize open-access data

Best Practices for Wikipedia Scraping

Tip Description
Use respectful rate limits Avoid hammering Wikipedia’s servers
Cache results Reuse fetched data for repeated runs
Check for updates Wikipedia pages evolve frequently
Handle exceptions Always check for redirects or page errors

Alternatives: Wikipedia APIs and Wikidata

For more structured queries:

These tools allow deeper integrations if you’re building advanced pipelines or tools.

Conclusion

Extracting structured data from Wikipedia with Python opens up powerful opportunities from building datasets to automating knowledge retrieval. With just a few libraries wikipedia, BeautifulSoup, and pandas you can transform unstructured encyclopedia content into usable data.

To master more web scraping and data handling techniques, check out the Web Scraping with Python course by Great Learning. Learn how to build robust data pipelines with real-world projects.

Frequently Asked Questions(FAQ’s)

You can use BeautifulSoup to find all <a> tags with href attributes starting with /wiki/, then filter out administrative or special pages (like those containing colons :). This is useful for building knowledge graphs or crawling linked topics.

2. Can I extract images or media files from a Wikipedia page?

Yes. Images are embedded in <img> tags. You can extract the src attribute and prepend https: to form a complete URL. Keep in mind that many image URLs point to Wikimedia Commons.

3. Is it better to use the Wikipedia API instead of scraping HTML?

Yes, for many structured data needs (like page content, categories, links), using the official MediaWiki API is more stable and ethical than scraping raw HTML, especially for large-scale or automated tasks.

4. How do I get the categories a Wikipedia article belongs to?

Categories are typically located at the bottom of the HTML page under the class “mw-normal-catlinks”. With BeautifulSoup, you can extract these links to classify or cluster articles.

5. Can I extract data in multiple languages using Python?

Yes. The wikipedia Python library supports language switching using wikipedia.set_lang('xx'), where ‘xx’ is the language code (e.g., ‘fr’ for French, ‘es’ for Spanish). This allows multilingual scraping and comparative analysis.

Avatar photo
Great Learning Editorial Team
The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.

Academy Pro Subscription

Grab 50% off
on Top Courses - Free Trial Available

×
Scroll to Top