Back to blog

How to Scrape Wikipedia: Complete Beginner's Tutorial

Wikipedia has over 60 million articles, making it a valuable resource for machine learning training data, research datasets, and competitive intelligence. This tutorial guides you through extracting your first article to building crawlers that navigate Wikipedia's knowledge graph. You'll learn to extract titles, infoboxes, tables, and image references, then scale up to crawling entire topic clusters.

Justinas Tamasevicius

Dec 16, 2025

23 min read

Wikipedia has over 60 million articles, making it a valuable resource for machine learning training data, research datasets, and competitive intelligence. This tutorial guides you through extracting your first article to building crawlers that navigate Wikipedia's knowledge graph. You'll learn to extract titles, infoboxes, tables, and image references, then scale up to crawling entire topic clusters.

Why scrape Wikipedia?

Wikipedia serves as the foundational data layer for these five business and technical workflows:

  1. Market intelligence & data enrichment. Data teams use Wikipedia to validate and enrich internal databases. By extracting structured infobox metadata, such as revenue figures, headquarters, or C-suite executives, you can normalize entity records for competitive analysis at scale.
  2. Specialized research datasets. Wikipedia's official database dumps are massive (20GB+) and require complex XML parsing. Scraping allows the targeted extraction of specific tables, such as the "List of S&P 500 companies", directly into clean CSVs for immediate analysis.
  3. Fueling agentic workflows. Autonomous agents need reliable ground truth data to verify facts before taking action. Wikipedia serves as the primary reference layer for entity resolution, allowing agents to confirm that a company, person, or event exists and is correctly identified before executing code.
  4. Synthetic data for SLMs. Small language models (SLMs) running locally require high-quality text to learn reasoning. Wikipedia provides structured content for generating the AI training data needed to fine-tune these models for instruction following.
  5. GraphRAG and reasoning engines. For complex queries, standard AI search is evolving into GraphRAG. This uses structured data (like the infoboxes you'll scrape) to map relationships, allowing AI to understand connections across different articles rather than just retrieving isolated keywords.

Understanding Wikipedia's structure

Wikipedia's consistency makes extraction predictable. Every article follows the same HTML patterns once you know the right CSS selectors.

Wikipedia article structure

Right-click any Wikipedia page and select Inspect Element (or press F12 on Windows / Cmd+Option+I on Mac). You'll see these key structural elements:

1. The title #firstHeading – every article uses this unique ID for the main title.

2. The content container .mw-parser-output – the actual article text is wrapped in this class. We target this to avoid scraping the sidebar menu or footer.

3. The infobox table.infobox – this table on the right side contains structured summary data (like founders, industry, or headquarters).

Not every article has an infobox, and table structures vary. Your scraper needs to handle missing elements without crashing.

Skip the code, get the data

Decodo's Web Scraping API extracts Wikipedia content as clean Markdown with automatic retry logic, proxy rotation, and zero maintenance.

Setting up your scraping environment

Before building the scraper, set up an isolated Python environment to avoid dependency conflicts.

Prerequisites

Make sure you have:

  • Python 3.9+ installed
  • Basic terminal/command line knowledge
  • A text editor (VS Code, PyCharm, etc.)

Creating a virtual environment

Create and activate a virtual environment.

# Create virtual environment
python -m venv wikipedia-env
# Activate it
# macOS/Linux:
source wikipedia-env/bin/activate
# Windows Command Prompt:
wikipedia-env\Scripts\activate.bat
# Windows PowerShell:
wikipedia-env\Scripts\Activate.ps1

Install required libraries

Install the necessary libraries:

pip install requests beautifulsoup4 lxml html2text pandas

Library breakdown:

  • requests – sends HTTP requests and supports connection retries with configuration
  • beautifulsoup4 – parses HTML and navigates the document tree
  • lxml – a high-performance XML and HTML parser that speeds up Beautiful Soup
  • pandas – a data analysis library for extracting tables to CSV
  • html2text – converts HTML to Markdown format

New to these libraries? See our guides on Beautiful Soup web scrapinglxml parsingtable scraping with Pandas, and converting to Markdown.

Freezing dependencies

Save your library versions to make the scraper shareable:

pip freeze > requirements.txt

Building the Wikipedia scraper

Let's build the Wikipedia scraper in steps. Open your code editor and create a file named wiki_scraper.py.

Step 1: Import libraries and configure retries

A good scraper needs to handle network errors. Start by importing libraries and setting up a session with retry logic.

Copy this into wiki_scraper.py:

import requests
from bs4 import BeautifulSoup
import html2text
import pandas as pd
import io, os, re, json, random
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# Rotate user agents to mimic different browsers
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
]
def get_session():
"""Create a requests session with automatic retry on server errors"""
session = requests.Session()
# Retry on server errors (5xx) and rate limit errors (429)
retry = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"],
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("https://", adapter)
return session
SESSION = get_session()

The key components:

  • User agents. Websites block the default Python requests user agent (python-requests/X.X.X). These headers identify requests as coming from Chrome browsers, preventing automated blocks.
  • Session object. Reuses TCP connections from a pool instead of creating new connections for each request, significantly improving speed.
  • Retry logic. Automatically retries up to 3 times with exponential backoff on server errors (500, 502, 503, 504) and rate limits (429).

Step 2: Extract infoboxes and tables

Create helper functions for specific extraction tasks.

Add the extract_infobox function:

def extract_infobox(soup):
"""Extract structured data from Wikipedia infobox"""
box = soup.select_one("table.infobox")
if not box:
return None
data = {}
# Extract title
title = box.select_one(".infobox-above, .fn")
if title:
data["title"] = title.get_text(strip=True)
# Extract key-value pairs
for row in box.find_all("tr"):
label = row.select_one("th.infobox-label, th.infobox-header")
value = row.select_one("td.infobox-data, td")
if label and value:
# Regex cleaning: Remove special chars to make a valid JSON key
key = re.sub(r"[^\w\s-]", "", label.get_text(strip=True)).strip()
if key:
data[key] = value.get_text(separator=" ", strip=True)
return data

soup.select_one("table.infobox") finds the first table with class infobox and returns None if not found (the function continues without crashing). The re.sub(...) regex removes special characters from keys – labels like "Born:" or "Height?" can't be used with dot notation in Python (data.Born: is invalid syntax). We use data cleaning to convert them into valid identifiers.

Add the extract_tables function:

def extract_tables(soup, folder):
"""Extract Wikipedia tables and save as CSV files"""
tables = soup.select("table.wikitable")
if not tables:
return 0
os.makedirs(f"{folder}/tables", exist_ok=True)
table_count = 0
for table in tables:
try:
dfs = pd.read_html(io.StringIO(str(table)))
if dfs:
table_count += 1
dfs[0].to_csv(f"{folder}/tables/table_{table_count}.csv", index=False)
except Exception:
pass
return table_count

Wikipedia uses table.wikitable as the standard class for data tables. The pd.read_html() function requires HTML strings wrapped in io.StringIO in newer pandas versions (older versions accepted raw strings but now show deprecation warnings). The function converts HTML tables to pandas DataFrames, which are then saved as CSV files. The try-except block catches all pandas parsing errors – tables that fail conversion are silently skipped.

Step 3: Build the scrape_page function

Combine everything into one pipeline: fetch → parse → clean → save.

Add the scrape_page function:

def scrape_page(url):
# 1. Fetch
headers = {"User-Agent": random.choice(USER_AGENTS)}
try:
response = SESSION.get(url, headers=headers, timeout=15)
except Exception as e:
print(f"Error: {e}")
return None
# 2. Parse
soup = BeautifulSoup(response.content, "lxml")
# 3. Setup output folder
title_elem = soup.find("h1", id="firstHeading")
title = title_elem.get_text(strip=True) if title_elem else "Unknown"
safe_title = re.sub(r"[^\w\-_]", "_", title)
output_folder = f"Output_{safe_title}"
os.makedirs(output_folder, exist_ok=True)
# 4. Extract structured data (before cleaning!)
infobox = extract_infobox(soup)
if infobox:
with open(f"{output_folder}/infobox.json", "w", encoding="utf-8") as f:
json.dump(infobox, f, indent=2, ensure_ascii=False)
extract_tables(soup, output_folder)
# 5. Clean and convert to Markdown
content = soup.select_one("#mw-content-text .mw-parser-output")
# Remove noise elements so they don't appear in the text
junk = [".navbox", ".reflist", ".reference", ".hatnote", ".ambox"]
for selector in junk:
for el in content.select(selector):
el.decompose()
h = html2text.HTML2Text()
h.body_width = 0 # No wrapping
markdown_content = h.handle(str(content))
with open(f"{output_folder}/content.md", "w", encoding="utf-8") as f:
f.write(f"# {title}\n\n{markdown_content}")
print(f"Scraped: {title}")
return {"soup": soup, "title": title}
if __name__ == "__main__":
scrape_page("https://en.wikipedia.org/wiki/Google")

The function rotates through user agents using random.choice() to distribute requests across different browser identities, making traffic patterns less detectable. The 15-second timeout prevents the scraper from hanging indefinitely on slow connections – timeouts raise exceptions that are caught and printed. The function returns None when scraping fails, allowing the crawler to handle errors gracefully.

We use response.content (raw bytes) instead of response.text because the lxml parser handles encoding detection more reliably with binary input. The get_text(strip=True) method removes leading and trailing whitespace from the title, which is essential for creating clean folder names.

The safe_title regex replaces any character that isn't a word character, hyphen, or underscore with underscores – this prevents filesystem errors from titles containing characters like slashes, colons, or asterisks that are invalid in folder names across operating systems.

When saving the infobox JSON, indent=2 creates readable, pretty-printed output, and ensure_ascii=False preserves Unicode characters, which is necessary for non-English names and special characters in the data.

The scraping order is critical: extract structured data (infoboxes, tables) first, then remove Wikipedia's navigation and metadata elements. The junk selectors target navigation boxes, reference lists, citation superscripts, disambiguation notices, and article maintenance warnings. We use decompose() to remove these elements completely from the tree before converting to Markdown, ensuring clean output without navigational clutter.

The h.body_width = 0 setting disables html2text's default 78-character line wrapping, preserving the original structure of Wikipedia's content, which is better for downstream processing and AI training data.

The function returns a dictionary containing the BeautifulSoup object and title – we'll need the soup object for link extraction when we add crawling functionality.

Testing the scraper

Run the script:

python wiki_scraper.py

You'll see a new folder named Output_Google containing:

  1. infobox.json – structured metadata from the Wikipedia infobox.
  2. tables/ – all Wikipedia tables extracted as CSV files.
  3. content.md – the full article in clean Markdown format.

The infobox JSON structure:

{
"title": "Google LLC",
"Formerly": "Google Inc. (1998-2017)",
"Company type": "Subsidiary",
"Traded as": "Nasdaq : GOOGL Nasdaq : GOOG",
"Industry": "Internet Cloud computing Computer software Computer hardware Artificial intelligence Advertising",
"Founded": "September 4, 1998 ; 27 years ago ( 1998-09-04 ) [ a ] in Menlo Park , California , United States",
"Founders": "Larry Page Sergey Brin",
"Headquarters": "Googleplex , Mountain View, California , U.S.",
"Area served": "Worldwide",
"Key people": "John L. Hennessy ( Chairman ) Sundar Pichai ( CEO ) Ruth Porat ( President and CIO ) Anat Ashkenazi ( CFO )",
"Products": "Google Search Android Nest Pixel Workspace Fitbit Waze YouTube Gemini Full list",
"Number of employees": "187,000 (2022)",
"Parent": "Alphabet Inc.",
"Subsidiaries": "Adscape Cameyo Charleston Road Registry Endoxon FeedBurner ImageAmerica Kaltix Nest Labs reCAPTCHA X Development YouTube ZipDash",
"ASN": "15169",
"Website": "about .google"
}

Example table data (CSV format):

SN,City,Country or U.S. state
1.0,Ann Arbor,Michigan
2.0,Atlanta,Georgia
3.0,Austin,Texas
4.0,Boulder,Colorado
5.0,Boulder - Pearl Place,Colorado
6.0,Boulder - Walnut,Colorado
7.0,Cambridge,Massachusetts
8.0,Chapel Hill,North Carolina
9.0,Chicago - Carpenter,Illinois
10.0,Chicago - Fulton Market,Illinois

The Markdown output (content.md):

# Google
**Google LLC** ([/ˈɡuː.ɡəl/](/wiki/Help:IPA/English "Help:IPA/English") [](//upload.wikimedia.org/wikipedia/commons/transcoded/3/3d/En-us-googol.ogg/En-us-googol.ogg.mp3 "Play audio")[](/wiki/File:En-us-googol.ogg "File:En-us-googol.ogg"), [_GOO -gəl_](/wiki/Help:Pronunciation_respelling_key "Help:Pronunciation respelling key")) is an American multinational technology corporation focused on...
Google was founded on September 4, 1998, by American computer scientists [Larry Page](/wiki/Larry_Page "Larry Page") and [Sergey Brin](/wiki/Sergey_Brin "Sergey Brin"). Together, they own about 14% of its publicly listed shares and control 56% of its stockholder voting power through [super-voting stock](/wiki/Super-voting_stock "Super-voting stock"). The company went [public](/wiki/Public_company "Public company") via an [initial public offering](/wiki/Initial_public_offering "Initial public offering") (IPO) in 2004...
## History
### Early years
Google began in January 1996 as a research project by [Larry Page](/wiki/Larry_Page "Larry Page") and [Sergey Brin](/wiki/Sergey_Brin "Sergey Brin") while they were both [PhD](/wiki/PhD "PhD") students at [Stanford University](/wiki/Stanford_University "Stanford University") in [California](/wiki/California "California"), United States...

Building the Wikipedia crawler

Now that we have a working scraper that extracts data from individual Wikipedia pages, let's extend it to automatically discover and scrape related topics. This transforms our single-page scraper into a crawler that can map out connections between articles, creating a dataset of related concepts.

A basic crawler follows every link on a page. On Wikipedia, that's problematic – if you start at "Python" and follow every link, you'll end up scraping "1991 in Science" and "Netherlands" within seconds. The Python article alone contains over 1,000 links, and following all of them would quickly spiral out of control.

To collect related topics efficiently, we need a selective crawler that focuses on conceptually relevant links. We'll build this in three parts: URL validation, intelligent link extraction, and the crawling loop.

Step 4: Crawler setup and validation

Update your imports at the top of wiki_scraper.py:

import requests
from bs4 import BeautifulSoup
import html2text
import pandas as pd
import io, os, re, json, random, time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib.parse import urlparse, urljoin
from collections import deque
import argparse

Add these validation functions below your extract_tables function:

def normalize_url(url):
"""Standardizes URLs (removes fragments like #history)"""
if not url:
return None
# Handle protocol-relative URLs (common on Wikipedia)
if url.startswith("//"):
url = "https:" + url
parsed = urlparse(url)
return f"{parsed.scheme}://{parsed.netloc.lower()}{parsed.path}"
def is_valid_wikipedia_link(url):
"""Filters out special pages (Files, Talk, Help)"""
if not url:
return False
parsed = urlparse(url)
if "wikipedia.org" not in parsed.netloc:
return False
# We only want articles, not maintenance pages
skip = [
"/wiki/Special:",
"/wiki/File:",
"/wiki/Help:",
"/wiki/User:",
"/wiki/Talk:",
]
return not any(parsed.path.startswith(p) for p in skip)

The normalize_url function strips URL fragments (like #History), so we treat each page as one unique URL. Wikipedia often uses protocol-relative URLs like //upload.wikimedia.org, which this function converts to proper HTTPS URLs. The is_valid_wikipedia_link function filters out maintenance pages, special pages, user pages, and talk pages, keeping only article content.

Add the link extraction function that focuses on the first few paragraphs and "See also" sections:

def extract_links(soup, base_url):
links = set()
# Target the main article body
content = soup.select_one("#mw-content-text .mw-parser-output")
if not content:
return links
# 1. Early paragraphs: Scan only first 3 paragraphs for high-level concepts
for p in content.find_all("p", recursive=False, limit=3):
for link in p.find_all("a", href=True):
url = urljoin(base_url, link["href"])
if is_valid_wikipedia_link(url):
links.add(normalize_url(url))
# 2. "See Also": Find the header and grab all links in that section
for heading in soup.find_all(["h2", "h3"]):
if "see also" in heading.get_text().lower():
# Get all elements after the heading until the next heading
current = heading.find_next_sibling()
while current and current.name not in ["h2", "h3"]:
for link in current.find_all("a", href=True):
url = urljoin(base_url, link["href"])
if is_valid_wikipedia_link(url):
links.add(normalize_url(url))
current = current.find_next_sibling()
break # Stop once we've processed the section
return links

The recursive=False, limit=3 parameters tell Beautiful Soup to only examine the top-level paragraphs and stop after the third one. This typically captures key concepts linked in the article's opening. We focus on these sections because:

  • The first 3 paragraphs usually contain the most important related concepts (e.g., the "Google" article mentions Alphabet Inc., Larry Page, search engines).
  • In the "See also" section, Wikipedia editors manually curate related topics here, providing high-quality connections.

This strategy avoids noise from footnote links, navigation elements, and tangentially related articles mentioned deep in the content.

Step 6: Create the crawler class

Add the crawler class that uses breadth-first (BFS) search to explore related pages (read more about crawling vs scraping):

class WikipediaCrawler:
def __init__(self, start_url, max_pages=5, max_depth=2):
# The queue stores tuples: (URL, Depth)
self.queue = deque([(normalize_url(start_url), 0)])
self.visited = set()
self.max_pages = max_pages
self.max_depth = max_depth
def crawl(self):
count = 0
while self.queue and count < self.max_pages:
# Get the next URL from the front of the queue
url, depth = self.queue.popleft()
# Skip if we've already scraped this
if url in self.visited:
continue
# Skip if we've exceeded max depth
if depth > self.max_depth:
continue
print(f"[{count+1}/{self.max_pages}] [Depth {depth}] Crawling: {url}")
# 1. Scrape the page
data = scrape_page(url)
self.visited.add(url)
count += 1
# 2. Find new links (if the scrape was successful)
if data and data.get("soup"):
new_links = extract_links(data["soup"], url)
for link in new_links:
if link not in self.visited:
self.queue.append((link, depth + 1))
# 3. Rate limiting
time.sleep(1.5)

The deque (double-ended queue) allows efficient removal of URLs from the front using popleft(), implementing breadth-first search (FIFO – first in, first out). This means the crawler explores pages level by level rather than diving deep into one branch. Breadth-first search ensures you get a diverse set of related topics at similar conceptual distances from your starting point, rather than following a single chain of links very deep into one specific subtopic.

The crawler tracks depth to prevent going too deep into tangential topics:

  • Depth 0 – starting page (e.g., "Google")
  • Depth 1 – pages directly linked from start (e.g., "Alphabet Inc.", "Larry Page", "Android")
  • Depth 2 – pages linked from depth 1 (e.g., "Stanford University", "Java", "Chromium")

The visited set prevents duplicate scraping – if "Python" links to "C++" and "C++" links back to "Python", we won't scrape "Python" twice. The time.sleep(1.5) pause prevents overwhelming Wikipedia's servers with rapid requests. The if data and data.get('soup') check handles cases where scraping fails (network errors, 404 pages, etc.) – the crawler continues with other pages instead of crashing.

Note that the queue can grow significantly even with max_depth limiting. At depth 1, the queue might contain 20-50 URLs. At depth 2, it could contain 200-500+ URLs. The visited set prevents re-scraping, but all unique URLs get added to the queue until max_pages is reached or the queue is exhausted.

Step 7: Set up the command-line interface

Replace the existing if __name__ == "__main__": block at the bottom of wiki_scraper.py with this CLI implementation using argparse (learn more about running Python in terminal):

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("url", help="Wikipedia article URL")
parser.add_argument("--crawl", action="store_true", help="Enable crawler mode")
parser.add_argument(
"--max-pages", type=int, default=5, help="Maximum pages to scrape"
)
parser.add_argument("--max-depth", type=int, default=2, help="Maximum crawl depth")
args = parser.parse_args()
if args.crawl:
WikipediaCrawler(
args.url, max_pages=args.max_pages, max_depth=args.max_depth
).crawl()
else:
scrape_page(args.url)

Test the scraper and crawler

Open your terminal in the folder containing wiki_scraper.py.

Single page mode – scrape one article:

python wiki_scraper.py "https://en.wikipedia.org/wiki/Google"

This creates an Output_Google folder containing content.mdinfobox.json, and a tables/ directory with CSV files.

Crawler mode – collect related topics:

python wiki_scraper.py "https://en.wikipedia.org/wiki/Google" --crawl --max-pages 10 --max-depth 2
The script will print its progress with depth indicators as it discovers and scrapes related pages. Here's what a typical crawl looks like:
[1/10] [Depth 0] Crawling: https://en.wikipedia.org/wiki/Google
Scraped: Google
[2/10] [Depth 1] Crawling: https://en.wikipedia.org/wiki/Initial_public_offering
Scraped: Initial public offering
[3/10] [Depth 1] Crawling: https://en.wikipedia.org/wiki/E-commerce
Scraped: E-commerce
[4/10] [Depth 1] Crawling: https://en.wikipedia.org/wiki/Search_engine
Scraped: Search engine
[5/10] [Depth 1] Crawling: https://en.wikipedia.org/wiki/BBC
Scraped: BBC
[6/10] [Depth 1] Crawling: https://en.wikipedia.org/wiki/Public_company
Scraped: Public company
[7/10] [Depth 1] Crawling: https://en.wikipedia.org/wiki/Super-voting_stock
Scraped: Super-voting stock
[8/10] [Depth 1] Crawling: https://en.wikipedia.org/wiki/Larry_Page
Scraped: Larry Page
[9/10] [Depth 1] Crawling: https://en.wikipedia.org/wiki/Sundar_Pichai
Scraped: Sundar Pichai
[10/10] [Depth 1] Crawling: https://en.wikipedia.org/wiki/Sergey_Brin
Scraped: Sergey Brin

When finished, you'll have up to 10 separate Output_* folders, each containing the full extracted data for that topic. The crawler might scrape fewer than max_pages if it reaches max_depth and runs out of links to explore.
Here's what the output folder structure looks like after crawling:

Complete source code

Here’s the full script for reference. You can copy-paste this directly into wiki_scraper.py.

import requests
from bs4 import BeautifulSoup
import html2text
import pandas as pd
import io, os, re, json, random, time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib.parse import urlparse, urljoin
from collections import deque
import argparse
# Rotate user agents to mimic different browsers
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
]
def get_session():
"""Create a requests session with automatic retry on server errors"""
session = requests.Session()
# Retry on server errors (5xx) and rate limit errors (429)
retry = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"],
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("https://", adapter)
return session
SESSION = get_session()
def extract_infobox(soup):
"""Extract structured data from Wikipedia infobox"""
box = soup.select_one("table.infobox")
if not box:
return None
data = {}
# Extract title
title = box.select_one(".infobox-above, .fn")
if title:
data["title"] = title.get_text(strip=True)
# Extract key-value pairs
for row in box.find_all("tr"):
label = row.select_one("th.infobox-label, th.infobox-header")
value = row.select_one("td.infobox-data, td")
if label and value:
# Regex cleaning: Remove special chars to make a valid JSON key
key = re.sub(r"[^\w\s-]", "", label.get_text(strip=True)).strip()
if key:
data[key] = value.get_text(separator=" ", strip=True)
return data
def extract_tables(soup, folder):
"""Extract Wikipedia tables and save as CSV files"""
tables = soup.select("table.wikitable")
if not tables:
return 0
os.makedirs(f"{folder}/tables", exist_ok=True)
table_count = 0
for table in tables:
try:
dfs = pd.read_html(io.StringIO(str(table)))
if dfs:
table_count += 1
dfs[0].to_csv(f"{folder}/tables/table_{table_count}.csv", index=False)
except Exception:
pass
return table_count
def normalize_url(url):
"""Standardizes URLs (removes fragments like #history)"""
if not url:
return None
# Handle protocol-relative URLs (common on Wikipedia)
if url.startswith("//"):
url = "https:" + url
parsed = urlparse(url)
return f"{parsed.scheme}://{parsed.netloc.lower()}{parsed.path}"
def is_valid_wikipedia_link(url):
"""Filters out special pages (Files, Talk, Help)"""
if not url:
return False
parsed = urlparse(url)
if "wikipedia.org" not in parsed.netloc:
return False
# We only want articles, not maintenance pages
skip = [
"/wiki/Special:",
"/wiki/File:",
"/wiki/Help:",
"/wiki/User:",
"/wiki/Talk:",
]
return not any(parsed.path.startswith(p) for p in skip)
def extract_links(soup, base_url):
links = set()
# Target the main article body
content = soup.select_one("#mw-content-text .mw-parser-output")
if not content:
return links
# 1. Early paragraphs: Scan only first 3 paragraphs for high-level concepts
for p in content.find_all("p", recursive=False, limit=3):
for link in p.find_all("a", href=True):
url = urljoin(base_url, link["href"])
if is_valid_wikipedia_link(url):
links.add(normalize_url(url))
# 2. "See Also": Find the header and grab all links in that section
for heading in soup.find_all(["h2", "h3"]):
if "see also" in heading.get_text().lower():
# Get all elements after the heading until the next heading
current = heading.find_next_sibling()
while current and current.name not in ["h2", "h3"]:
for link in current.find_all("a", href=True):
url = urljoin(base_url, link["href"])
if is_valid_wikipedia_link(url):
links.add(normalize_url(url))
current = current.find_next_sibling()
break # Stop once we've processed the section
return links
class WikipediaCrawler:
def __init__(self, start_url, max_pages=5, max_depth=2):
# The queue stores tuples: (URL, Depth)
self.queue = deque([(normalize_url(start_url), 0)])
self.visited = set()
self.max_pages = max_pages
self.max_depth = max_depth
def crawl(self):
count = 0
while self.queue and count < self.max_pages:
# Get the next URL from the front of the queue
url, depth = self.queue.popleft()
# Skip if we've already scraped this
if url in self.visited:
continue
# Skip if we've exceeded max depth
if depth > self.max_depth:
continue
print(f"[{count+1}/{self.max_pages}] [Depth {depth}] Crawling: {url}")
# 1. Scrape the page
data = scrape_page(url)
self.visited.add(url)
count += 1
# 2. Find new links (if the scrape was successful)
if data and data.get("soup"):
new_links = extract_links(data["soup"], url)
for link in new_links:
if link not in self.visited:
self.queue.append((link, depth + 1))
# 3. Rate limiting
time.sleep(1.5)
def scrape_page(url):
# 1. Fetch
headers = {"User-Agent": random.choice(USER_AGENTS)}
try:
response = SESSION.get(url, headers=headers, timeout=15)
except Exception as e:
print(f"Error: {e}")
return None
# 2. Parse
soup = BeautifulSoup(response.content, "lxml")
# 3. Setup output folder
title_elem = soup.find("h1", id="firstHeading")
title = title_elem.get_text(strip=True) if title_elem else "Unknown"
safe_title = re.sub(r"[^\w\-_]", "_", title)
output_folder = f"Output_{safe_title}"
os.makedirs(output_folder, exist_ok=True)
# 4. Extract structured data (before cleaning!)
infobox = extract_infobox(soup)
if infobox:
with open(f"{output_folder}/infobox.json", "w", encoding="utf-8") as f:
json.dump(infobox, f, indent=2, ensure_ascii=False)
extract_tables(soup, output_folder)
# 5. Clean and convert to Markdown
content = soup.select_one("#mw-content-text .mw-parser-output")
# Remove noise elements so they don't appear in the text
junk = [".navbox", ".reflist", ".reference", ".hatnote", ".ambox"]
for selector in junk:
for el in content.select(selector):
el.decompose()
h = html2text.HTML2Text()
h.body_width = 0 # No wrapping
markdown_content = h.handle(str(content))
with open(f"{output_folder}/content.md", "w", encoding="utf-8") as f:
f.write(f"# {title}\n\n{markdown_content}")
print(f"Scraped: {title}")
return {"soup": soup, "title": title}
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("url", help="Wikipedia article URL")
parser.add_argument("--crawl", action="store_true", help="Enable crawler mode")
parser.add_argument(
"--max-pages", type=int, default=5, help="Maximum pages to scrape"
)
parser.add_argument("--max-depth", type=int, default=2, help="Maximum crawl depth")
args = parser.parse_args()
if args.crawl:
WikipediaCrawler(
args.url, max_pages=args.max_pages, max_depth=args.max_depth
).crawl()
else:
scrape_page(args.url)

Troubleshooting common issues

Wikipedia constantly updates its layout, and network issues occur. Here are the most common errors and their fixes.

1. AttributeError: 'NoneType' object has no attribute 'text'

The cause: Your script tried to find an element (like the infobox), but it didn't exist on that page. 

The fix: Our code handles this with if not box: return None. Always check if an element exists before accessing its .text property (read more about handling Python errors).

2. HTTP Error 429: Too Many Requests

The cause: You're hitting Wikipedia too fast with requests.

The fix: Increase your delay. Change time.sleep(1.5) to time.sleep(3) in your loop. If the error persists, you'll need proxy rotation to distribute requests across multiple IP addresses (which requires additional infrastructure or a proxy service).

3. Empty CSVs or JSON files

The cause: Wikipedia likely changed a CSS class name (e.g., infobox became information-box).

The fix: Open the page in your browser, press F12, and re-inspect the element to see the new class name. Update your selector in wiki_scraper.py.

Limitations of DIY scraping

Your Python script is powerful, but running it from your local machine has constraints. As you scale from scraping 10 pages to 10,000, you'll face these challenges:

  1. IP blocks. Wikipedia monitors traffic volume. Sending too many requests from a single IP risks getting blocked entirely.
  2. Maintenance overhead. Wikipedia updates its HTML structure occasionally. When they do, your selectors will break, requiring code updates.
  3. Speed vs. detection. Scraping faster requires parallel requests (threading), but parallel requests increase the chance of being flagged by anti-bot systems.

Tools like Claude or ChatGPT can help you write and debug scrapers faster through AI-assisted coding, but they don't solve infrastructure challenges like IP rotation or scaling. This is where developers often switch to managed solutions.

Scraping Wikipedia with third-party tools

For enterprise-scale data collection, developers often switch to web scraping APIs.

The Decodo solution

The Decodo Web Scraping API handles the complexity we just built. Instead of managing sessions, retries, and parsers yourself, you send a request to the API, and it handles the infrastructure.

Key features:

  • Structured data is returned automatically (you can easily convert extracted HTML to Markdown).
  • Automatic rotation through residential proxies to bypass blocks.
  • Maintenance handled by Decodo when HTML changes.
  • Handles proxy management and CAPTCHAs.
  • Scale to millions of pages without local bandwidth constraints.
  • Direct Markdown output without writing converters.

Implementation example

The Decodo dashboard generates code instantly in cURL, Node.js, or Python.

You can check the Markdown box and enable JS Rendering if the page is dynamic. You can also configure advanced parameters (like proxy location, device type, and more).

For Python, click the Python tab in the dashboard to generate the exact code. Here's the implementation:

import requests
url = "https://scraper-api.decodo.com/v2/scrape"
# Request the page in Markdown format directly
payload = {"url": "https://en.wikipedia.org/wiki/Google", "markdown": True}
headers = {
"accept": "application/json",
"content-type": "application/json",
"authorization": "Basic YOUR_AUTH_TOKEN",
}
# Send the request
response = requests.post(url, json=payload, headers=headers)
# Print the clean Markdown content
print(response.text)

The response returns clean Markdown text directly, ready to be saved or fed into an LLM:

Google was founded on September 4, 1998, by American computer scientists [Larry
Page](/wiki/Larry_Page "Larry Page") and [Sergey Brin](/wiki/Sergey_Brin "Sergey
Brin"). Together, they own about 14% of its publicly listed shares and control
56% of its stockholder voting power through [super-voting
stock](/wiki/Super-voting_stock "Super-voting stock"). The company went
[public](/wiki/Public_company "Public company") via an [initial public
offering](/wiki/Initial_public_offering "Initial public offering") (IPO) in
2004. In 2015, Google was reorganized as a wholly owned subsidiary of Alphabet
Inc. Google is Alphabet's largest subsidiary and is a [holding
company](/wiki/Holding_company "Holding company") for Alphabet's internet
properties and interests. [Sundar Pichai](/wiki/Sundar_Pichai "Sundar Pichai")
was appointed CEO of Google on October 24, 2015, replacing Larry Page, who
became the CEO of Alphabet. On December 3, 2019, Pichai also became the CEO of
Alphabet...
[Response truncated for brevity]

Watch the 2-minute video setup guide here

Comparison: Custom script vs. Decodo API

Here's how your DIY Python script compares to a managed API solution:

Feature

Your Python script

Decodo Web Scraping API

Setup time

Hours (coding, debugging, testing)

Maintenance

High (breaks when HTML changes)

Minimal (infrastructure managed by Decodo)

Reliability

Depends on your local IP reputation

Enterprise-grade infrastructure

Scalability

Limited by your CPU/bandwidth

High concurrent request capacity

What to do with your scraped data

You now have a structured dataset of related Wikipedia topics. Here are some ways to use it:

  • AI training data. The Markdown files and infoboxes provide clean text for fine-tuning language models.
  • Knowledge graphs. Parse the infoboxes to build entity relationship databases.
  • Research datasets. Analyze table data across multiple articles for comparative studies.
  • Content analysis. Study how topics connect and what patterns emerge in Wikipedia's knowledge structure.

For projects that grow beyond individual files, consider structured storage solutions like databases or data warehouses.

Best practices

You now have a functional Wikipedia scraper. To keep it running reliably, follow these web scraping best practices:

  1. Check robots.txt. Verify if the website allows scraping.
  2. Rate limiting. Keep your delay enabled. We included time.sleep() for a reason.
  3. Identify yourself. Use a custom User-Agent that includes your contact info.

Conclusion

You have moved from a simple HTML parser to a crawler that explores Wikipedia's knowledge graph, and the next step is clear. To scale past experiments, shift from files to a database, tailor the crawl logic to your goals, add parallel processing for thousands of pages, and use tools like Decodo to handle the infrastructure pain points that come with real-world scale.

icon_check-circle

Scrape Wikipedia without blocks

Decodo's residential proxies distribute your requests across 115M+ IPs, letting you extract data at scale without hitting rate limits.

About the author

Justinas Tamasevicius

Head of Engineering

Justinas Tamaševičius is Head of Engineering with over two decades of expertize in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.


Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

How to Scrape Google Search Data

Business success is driven by data, and few data sources are as valuable as Google’s Search Engine Results Page (SERP). Collecting this data can be complex, but various tools and automation techniques make it easier. This guide explores practical ways to scrape Google search results, highlights the benefits of such efforts, and addresses common challenges.

Dominykas Niaura

Dec 30, 2024

7 min read

What Is Web Scraping? A Complete Guide to Its Uses and Best Practices

Web scraping is a powerful tool driving innovation across industries, and its full potential continues to unfold with each day. In this guide, we'll cover the fundamentals of web scraping – from basic concepts and techniques to practical applications and challenges. We’ll share best practices and explore emerging trends to help you stay ahead in this dynamic field.

Dominykas Niaura

Jan 29, 2025

10 min read

🐍 Python Web Scraping: In-Depth Guide 2025

Welcome to 2025, the year of the snake – and what better way to celebrate than by mastering Python, the ultimate "snake" in the tech world! If you’re new to web scraping, don’t worry – this guide starts from the basics, guiding you step-by-step on collecting data from websites. Whether you’re curious about automating simple tasks or diving into more significant projects, Python makes it easy and fun to start. Let’s slither into the world of web scraping and see how powerful this tool can be!

Zilvinas Tamulis

Feb 28, 2025

15 min read

Frequently asked questions

Is scraping Wikipedia possible?

Yes, Wikipedia is one of the easiest sites to scrape due to its consistent HTML structure and predictable CSS selectors. Every article follows the same pattern with standardized elements like #firstHeading for titles and table.infobox for structured data. The main challenge is scaling beyond single articles—you'll need proper rate limiting and potentially proxy rotation to avoid blocks when scraping hundreds or thousands of pages.

Is scraping Wikipedia legal?

Generally, yes, for public non-commercial use, provided you respect their terms. However, mass scraping can be blocked. See: Is web scraping legal?.

Is it possible to download the entire Wikipedia?

Wikipedia publishes regular database dumps (twice per month, starting on the 1st and 20th). While useful for offline archives, they're large (20-24GB compressed, 80-100GB uncompressed for current articles), difficult to parse (MediaWiki XML format), and always outdated by a few weeks. Scraping is better when you need real-time data or only a specific subset of topics (e.g., "just the tech companies").

What's the best way to scrape Wikipedia?

For small projects (under 100 pages), use Python with Beautiful Soup and Requests — it's straightforward and requires minimal setup. For larger datasets (100-10,000 pages), add crawler logic with proper rate limiting and error handling as shown in this tutorial. For enterprise-scale needs (10,000+ pages), consider managed solutions like Decodo's Web Scraping API that handle proxy rotation, structure changes, and infrastructure automatically without the maintenance overhead.

© 2018-2025 decodo.com (formerly smartproxy.com). All Rights Reserved