How to Extract and Clean Data from Wikipedia Using Python

There is a wealth of information on Wikipedia. Extracting properly organized data from Wikipedia can simplify and speed up your research, NLP training or content scraping processes. Nonetheless, the website’s fast-changing and HTML content can be hard to deal with directly.

In this guide, you’ll learn how to get structured data from Wikipedia in Python, with the help of wikipedia, BeautifulSoup, and pandas libraries. We’ll walk through practical examples from fetching article content to parsing infoboxes and tables.

Why Extract Data from Wikipedia?

Wikipedia offers:

Rich encyclopedic content across domains
Structured metadata via infoboxes
Up-to-date information maintained by a global community
Open access under Creative Commons license

These features make it an ideal data source for NLP, machine learning (ML), and data visualization projects.

Academy Pro

Python Programming Course

In this course, you will learn the fundamentals of Python: from basic syntax to mastering data structures, loops, and functions. You will also explore OOP concepts and objects to build robust programs.

11.5 Hrs

51 Coding Exercises

Learn Python Programming

Prerequisites

To follow along, ensure you have:

Python 3.x installed
Basic familiarity with web scraping and data structures
Installed libraries

pip install wikipedia beautifulsoup4 requests pandas

If you’re new to scraping or want a refresher on the basics, this Python web scraping guide covers how to work with HTML, requests, and extraction tools in a beginner-friendly way.

Method 1: Using the `wikipedia` Library for Basic Text Content

The wikipedia library provides a simple API for fetching article summaries and page content.

Example: Fetching a Page Summary

import wikipedia

# Set language (optional)
wikipedia.set_lang("en")

# Fetch summary
summary = wikipedia.summary("Machine learning")
print(summary)

Fetch Full Page Content

page = wikipedia.page("Machine learning")
print(page.content)

This gives you raw text, but not structured data like infoboxes or tables.

Method 2: Extracting Structured Data with BeautifulSoup

To parse HTML for structured elements like infoboxes, tables, or categories, we use requests and BeautifulSoup.

Step 1: Fetch HTML of a Wikipedia Page

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Machine_learning"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

Extracting Infobox Data

Infoboxes are structured in <table class="infobox">. Here’s how to extract key-value pairs:

infobox = soup.find("table", {"class": "infobox"})
data = {}

for row in infobox.find_all("tr"):
    header = row.find("th")
    value = row.find("td")
    if header and value:
        data[header.text.strip()] = value.text.strip()

print(data)

Now you have a Python dictionary of clean infobox fields.

Extracting Tabular Data with pandas

Many Wikipedia pages include HTML tables that can be parsed directly using pandas.

Example: Extracting All Tables

import pandas as pd

tables = pd.read_html(url)
print(f"Found {len(tables)} tables")

# Display the first table
print(tables[0].head())

This method is ideal for statistical data, comparison tables, and historical records.

Cleaning the Extracted Data

Wikipedia content often includes citations (e.g., [1]) or nested tags. Here’s how to clean them:

from bs4 import NavigableString

def clean_text(tag):
    return ''.join([str(t) for t in tag if isinstance(t, NavigableString)]).strip()

cleaned_data = {}
for row in infobox.find_all("tr"):
    header = row.find("th")
    value = row.find("td")
    if header and value:
        cleaned_data[header.text.strip()] = clean_text(value)

print(cleaned_data)

Handling Redirects and Disambiguation

Wikipedia pages can redirect or lead to disambiguation pages. The wikipedia library handles this:

try:
    page = wikipedia.page("Mercury")
except wikipedia.DisambiguationError as e:
    print("Disambiguation required. Options:", e.options)

You can then choose the specific page from the list.

Automating Wikipedia Data Extraction for Multiple Entries

Here’s how you can loop over multiple topics:

topics = ["Machine learning", "Artificial intelligence", "Data science"]

for topic in topics:
    summary = wikipedia.summary(topic)
    print(f"\n--- {topic} ---\n{summary}")

For bulk table or infobox scraping, combine this with BeautifulSoup and pandas workflows in batch.

Applications of Structured Wikipedia Data

Knowledge graphs: Extract entity relationships from infoboxes
NLP training datasets: Use raw text or metadata for supervised learning
Trend analysis: Scrape historical or statistical tables
Data journalism: Fetch and visualize open-access data

Best Practices for Wikipedia Scraping

Tip	Description
Use respectful rate limits	Avoid hammering Wikipedia’s servers
Cache results	Reuse fetched data for repeated runs
Check for updates	Wikipedia pages evolve frequently
Handle exceptions	Always check for redirects or page errors

Alternatives: Wikipedia APIs and Wikidata

For more structured queries:

MediaWiki API
Wikidata SPARQL queries: For semantic-level data extraction

These tools allow deeper integrations if you’re building advanced pipelines or tools.

Conclusion

Extracting structured data from Wikipedia with Python opens up powerful opportunities from building datasets to automating knowledge retrieval. With just a few libraries wikipedia, BeautifulSoup, and pandas you can transform unstructured encyclopedia content into usable data.

To master more web scraping and data handling techniques, check out the Web Scraping with Python course by Great Learning. Learn how to build robust data pipelines with real-world projects.

Frequently Asked Questions(FAQ’s)

1. How can I extract internal links from a Wikipedia page?

You can use BeautifulSoup to find all <a> tags with href attributes starting with /wiki/, then filter out administrative or special pages (like those containing colons :). This is useful for building knowledge graphs or crawling linked topics.

2. Can I extract images or media files from a Wikipedia page?

Yes. Images are embedded in <img> tags. You can extract the src attribute and prepend https: to form a complete URL. Keep in mind that many image URLs point to Wikimedia Commons.

3. Is it better to use the Wikipedia API instead of scraping HTML?

Yes, for many structured data needs (like page content, categories, links), using the official MediaWiki API is more stable and ethical than scraping raw HTML, especially for large-scale or automated tasks.

4. How do I get the categories a Wikipedia article belongs to?

Categories are typically located at the bottom of the HTML page under the class “mw-normal-catlinks”. With BeautifulSoup, you can extract these links to classify or cluster articles.

5. Can I extract data in multiple languages using Python?

Yes. The wikipedia Python library supports language switching using wikipedia.set_lang('xx'), where ‘xx’ is the language code (e.g., ‘fr’ for French, ‘es’ for Spanish). This allows multilingual scraping and comparative analysis.

How to Extract and Clean Data from Wikipedia Using Python

Why Extract Data from Wikipedia?

Prerequisites

Method 1: Using the `wikipedia` Library for Basic Text Content

Example: Fetching a Page Summary

Fetch Full Page Content

Method 2: Extracting Structured Data with BeautifulSoup

Step 1: Fetch HTML of a Wikipedia Page

Extracting Infobox Data

Extracting Tabular Data with pandas

Example: Extracting All Tables

Cleaning the Extracted Data

Handling Redirects and Disambiguation

Automating Wikipedia Data Extraction for Multiple Entries

Applications of Structured Wikipedia Data

Best Practices for Wikipedia Scraping

Alternatives: Wikipedia APIs and Wikidata

Conclusion

Frequently Asked Questions(FAQ’s)

1. How can I extract internal links from a Wikipedia page?

2. Can I extract images or media files from a Wikipedia page?

3. Is it better to use the Wikipedia API instead of scraping HTML?

4. How do I get the categories a Wikipedia article belongs to?

5. Can I extract data in multiple languages using Python?

How to Parse HTML in Python Using Regular Expressions

Learn Python Functions by Solving Everyday Problems

How to Choose the Right Data Structure in Python: Lists, Tuples, Sets, or Dicts?

5 Common Python Errors and How to Fix Them Easily

Who is a Python Developer and How to Become One?

12 Python Features Every Programmer Should Master

How to Extract and Clean Data from Wikipedia Using Python

Why Extract Data from Wikipedia?

Prerequisites

Method 1: Using the wikipedia Library for Basic Text Content

Example: Fetching a Page Summary

Fetch Full Page Content

Method 2: Extracting Structured Data with BeautifulSoup

Step 1: Fetch HTML of a Wikipedia Page

Extracting Infobox Data

Extracting Tabular Data with pandas

Example: Extracting All Tables

Cleaning the Extracted Data

Handling Redirects and Disambiguation

Automating Wikipedia Data Extraction for Multiple Entries

Applications of Structured Wikipedia Data

Best Practices for Wikipedia Scraping

Alternatives: Wikipedia APIs and Wikidata

Conclusion

Frequently Asked Questions(FAQ’s)

1. How can I extract internal links from a Wikipedia page?

2. Can I extract images or media files from a Wikipedia page?

3. Is it better to use the Wikipedia API instead of scraping HTML?

4. How do I get the categories a Wikipedia article belongs to?

5. Can I extract data in multiple languages using Python?

How to Parse HTML in Python Using Regular Expressions

Learn Python Functions by Solving Everyday Problems

How to Choose the Right Data Structure in Python: Lists, Tuples, Sets, or Dicts?

5 Common Python Errors and How to Fix Them Easily

Who is a Python Developer and How to Become One?

12 Python Features Every Programmer Should Master

Method 1: Using the `wikipedia` Library for Basic Text Content