Back to blog

How to Web Scrape a Table with Python: a Complete Guide

HTML tables are one of the most common ways websites organize data – financial reports, product listings, sports scores, population statistics. But this data is locked in the webpage's layout. To use it, you need to extract it. This guide will show you how to do it using Python, starting with simple static tables and working up to complex dynamic ones.

Justinas Tamasevicius

Nov 10, 2025

9 min read

Understanding HTML tables

Before you can scrape a table, you need to understand its structure. HTML uses tags to organize content. To see this, visit any webpage with a table, right-click it, and select Inspect (check out our guide on how to inspect elements if you need help).

You only need to know 4 tags:

1. <table> – the entire table container (think: the whole spreadsheet file)
2. <tr> – "table row" – a single row (like one row in a spreadsheet)
3. <th> – "table header" – a header cell (column titles like "Name" or "Price")
4. <td> – "table data" – a data cell (individual values like "$19.99")

Tables are nested: a <table> contains <tr> (rows), which contain <th> (headers) and <td> (data). Understanding this structure is important for choosing the right CSS or XPath selectors later.

Prerequisites

You'll need Python installed and a few libraries:

  • Requests – send HTTP requests to download web pages
  • BeautifulSoup4 – parse HTML into searchable objects
  • Pandas – organize scraped data into tables and export to CSV/Excel
  • lxml – a fast parser that pandas.read_html needs to read the HTML
  • Selenium – automate browsers for JavaScript-heavy sites

To install all of them, open your terminal (or command prompt) and run the following command:

pip install requests beautifulsoup4 pandas selenium lxml

We'll use requests and BeautifulSoup for static sites, and Selenium when you need to scrape dynamic content.

How to scrape static HTML tables

A static table has all its data in the initial HTML – no JavaScript loading required. There are two ways to scrape them.

Method 1 – the easiest way (pandas.read_html)

For simple static tables, pandas has a function that does the heavy lifting: read_html().

This function scans HTML, finds all <table> tags, and converts them into a list of pandas DataFrames.

import pandas as pd
import requests
from io import StringIO  # Needed to wrap the HTML string
# The URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
# Fetch the HTML content
response = requests.get(url, headers=headers)
html_content = response.text
# Parse all tables from the HTML content
tables = pd.read_html(StringIO(html_content))
print(f"Total tables found: {len(tables)}")
# In this case, the table we want is the 1st one
df = tables[0]
# Display the first 5 rows
print(df.head())
# Save the data to a CSV file
df.to_csv("population_data.csv", index=False)
print("Data saved to population_data.csv")

Here's what the console output looks like when you run the script:

Method 2 – the manual way (BeautifulSoup)

Sometimes pandas.read_html fails, or you need more control. Here's the manual approach using Beautiful Soup.

We'll scrape the same Wikipedia page in 5 clear steps.

Step 1: Fetch the web page

Use requests to download the page's HTML.

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
    print("Successfully fetched the page")
    html_content = response.text
else:
    print(f"Error fetching page: Status code {response.status_code}")
    exit()

Step 2: Parse HTML with BeautifulSoup

Pass the raw HTML to BeautifulSoup to create a searchable object.

soup = BeautifulSoup(html_content, 'html.parser')

Step 3: Locate and extract the table

Right-click the table in your browser and select Inspect. You'll see it has a class attribute of "wikitable".

Use this class to find the table, then extract all headers (<th>) and rows (<tr>).

table = soup.find('table', attrs={'class': 'wikitable'})
if table is None:
    print("Could not find the table. Check your selectors.")
    exit()
# --- Get Headers ---
headers_html = table.find_all('th')
headers = [th.text.strip() for th in headers_html]
print(f"Headers found: {headers}")
# --- Get All Rows ---
rows_html = table.find('tbody').find_all('tr')
all_rows = []
for row in rows_html:
    cells_html = row.find_all('td')
    row_data = [cell.text.strip() for cell in cells_html]
    if row_data:
        all_rows.append(row_data)

Step 4: Convert data to a pandas DataFrame

Convert your list of lists into a structured DataFrame.

A pandas DataFrame is a powerful, in-memory table, similar to a spreadsheet. While our all_rows variable is just a Python list of lists, converting it to a DataFrame allows us to easily label the data with our headers, clean it, analyze it, and, most importantly, export it to formats like CSV or Excel with a single command.

df = pd.DataFrame(all_rows, columns=headers)
print(df.head())

Step 5: Save the data

Export your DataFrame to CSV.

df.to_csv('scraped_data_manual.csv', index=False)

The final CSV file will look like this:

This method uses all of the basic fundamentals of Python web scraping and gives you complete control. You can further modify the code to only extract specific rows or columns, or filter for relevant data.

Quick tip – always try method 1 first. It's fast and simple. If it fails or you need more control, use method 2.

Handling complex table structures (colspan and rowspan)

Some tables have merged cells using colspan (spanning multiple columns) or rowspan (spanning multiple rows).

The issue – a simple loop will put data in the wrong columns.

The fix – try pandas.read_html first. It's often smart enough to correctly parse tables with colspan and rowspan automatically.

Let's take the comparison of web browsers page. The row that mentions "Arora" browser uses rowspan to span 6 rows:

And the "Latest release" header uses colspan to span 2 columns ("Version" and "Date"):

Let's see if pandas can handle it.

import pandas as pd
import requests
from io import StringIO
# Configuration
WIKI_URL = 'https://en.wikipedia.org/wiki/Comparison_of_web_browsers'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36'
OUTPUT_FILE = 'browser_comparison.csv'
# Fetch page
response = requests.get(WIKI_URL, headers={'User-Agent': USER_AGENT})
html_content = response.text
# Extract all tables
all_tables = pd.read_html(StringIO(html_content))
print(f"Found {len(all_tables)} tables")
# Find the correct table by checking its columns
for index, table in enumerate(all_tables):
    if 'Browser' in str(table.columns):
        print(f"\nBrowser table found at index {index}")
        
        # Clean the multi-level headers from pandas
        table.columns = ['_'.join(col).strip() for col in table.columns.values]
        print(table.head(10))
        table.to_csv(OUTPUT_FILE, index=False)
        print(f"\nSaved to {OUTPUT_FILE}")
        break

And then you'll get the CSV with successfully flattened complex headers and rows.

When pandas.read_html (Method 1) fails on a complex table, you can use a specialized library like html-table-extractor. Its primary purpose is to correctly parse complex tables, automatically expanding merged cells to build a clean, rectangular grid of data.

First, you'll need to install it:

pip install html-table-extractor

Here is the code combining BeautifulSoup to find the table and html-table-extractor to parse it:

import requests
import csv
from bs4 import BeautifulSoup
from html_table_extractor.extractor import Extractor
# Configuration
WIKI_URL = "https://en.wikipedia.org/wiki/Comparison_of_web_browsers"
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36"
OUTPUT_FILE = "browser_comparison_extractor.csv"
print(f"Fetching page: {WIKI_URL}")
response = requests.get(WIKI_URL, headers={"User-Agent": USER_AGENT})
html_content = response.text
soup = BeautifulSoup(html_content, "html.parser")
all_tables = soup.find_all("table")
print(f"Found {len(all_tables)} tables. Searching for the correct one...")
for index, table in enumerate(all_tables):
    try:
        # We convert the 'table' object back to a string
        extractor = Extractor(str(table))
        extractor.parse()
        table_data = extractor.return_list()
        if table_data:
            # Clean header values (e.g., "Browser\n" becomes "Browser")
            headers = [str(header).strip() for header in table_data[0]]
            if "Browser" in headers:
                print(f"\nBrowser table found at index {index}")
                with open(OUTPUT_FILE, "w", newline="", encoding="utf-8") as f:
                    writer = csv.writer(f)
                    writer.writerows(table_data)
                print(f"Successfully saved to {OUTPUT_FILE}")
                # Break the loop since we found the first one
                break
    except Exception as e:
        print(f"Could not parse table at index {index}{e}")

Cleaning and saving your data

Raw scraped data is usually messy. For example, in the DataFrame we created in Method 2 (BeautifulSoup), the Population column contains strings like "1,417,492,000". Because that's a string, not a number, you can't perform calculations with it. Let's clean it.

You can add the following lines to your Method 2 script to clean the data. This code continues directly from the df variable you created earlier:

# This code extends the Method 2 (BeautifulSoup) example.
# It assumes 'df' is the DataFrame we already created.
# The 'Population' column is a string: '1,417,492,000'
# Remove commas
df['Population_Clean'] = df['Population'].str.replace(',', '', regex=False)
# Convert to numeric
df['Population_Clean'] = pd.to_numeric(df['Population_Clean'])
# Now you can perform calculations
print(f"Total population of top 5: {df.head(5)['Population_Clean'].sum()}")

Once clean, save your data. For more formats (JSON, databases, etc.), check out our guide on how to save your scraped data.

# Save to CSV
df.to_csv('population_data_clean.csv', index=False)
# Save to JSON
df.to_json('population_data_clean.json', orient='records')

How to scrape dynamic (JavaScript) tables

You run your scraper, but the table data is empty. You inspect the HTML and see "Loading..." or an empty <div>. This happens because the website loads data dynamically with JavaScript. requests.get() only retrieves the initial HTML – it doesn't execute JavaScript. If the table data loads after the page renders, your scraper won't see it.

Solution #1: Find the hidden API

Many sites load data through background API requests. You can intercept this API and request the data directly:

  • Open Developer Tools (F12) in your browser
  • Go to the Network tab – filter by Fetch/XHR
  • Reload the page
  • Look for requests returning JSON data (that's your table content)

Financial sites like Yahoo Finance and Google Finance commonly use this pattern. In the image below, you can see the Fetch/XHR tab capturing a request that returns clean JSON data.

Once you find the right request, right-click it and select Copy as cURL (or note the URL, headers, and parameters). You can then recreate that request in Python.

A pro tip for converting that "Copy as cURL" command into Python is to use an online tool like the curl converter. You can paste the entire cURL command, and it will automatically generate the equivalent code using the requests library.

Running this code will print the clean JSON data, which will look something like this:

{
    "marketCap": {
        "raw": 4.580887901397705E12,
        "fmt": "4.581T",
        "longFmt": "4,580,887,901,397"
    },
    "ticker": "NVDA",
    "avgDailyVol3m": {
        "raw": 1.818218125E8,
        "fmt": "181.822M",
        "longFmt": "181,821,812"
    },
...
}

Important note: If your request fails or you get no data, the website is likely blocking your script. To get around this, you need to make your request look more like a real user. We cover the techniques for this in the "Best practices and ethical considerations" section later in this guide.

Solution #2: Use a headless browser

If there's no API to target, you'll need to automate a real browser to render the JavaScript. This is called a headless browser when run on a server. Tools like Selenium will:

  • Open a real browser (like Chrome)
  • Wait for the JavaScript to execute and the table to load
  • Give you the final rendered HTML to parse

This is also the only way to handle dynamic pagination – when you need to click Next buttons that trigger JavaScript. We'll cover that in the next section.

This approach is slower and more resource-intensive than direct API requests, so always check for hidden APIs first.

How to scrape paginated tables

Tables are often split across multiple pages ("page 1", "page 2", etc.). The key is figuring out how the "Next" button works.

Case #1: Static pagination

What to look for – click "Next" and check if the URL changes (e.g., to /?page=2). If it does, you're dealing with a static site.

Solution – use requests in a loop. Take CoinMarketCap as an example. It uses URLs like /?page=2, /?page=3. If you inspect the "Next" button, you'll see it's just a simple link (<a> tag) with a static href attribute.

Because the URL is simple and predictable, you can scrape it with requests and pandas loop:

import pandas as pd
import requests
from io import StringIO
from time import sleep
all_data = []
page = 90
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"
}
while True:
    print(f"Page {page}...", end=" ")
    response = requests.get(f"https://coinmarketcap.com/?page={page}", headers=headers)
    if response.status_code != 200:
        break
    try:
        all_data.append(pd.read_html(StringIO(response.text))[0])
    except:
        break
    if (
        'aria-disabled="true"' in response.text
        or "?page=" + str(page + 1) not in response.text
    ):
        print("Last page!")
        break
    page += 1
    sleep(2)
pd.concat(all_data, ignore_index=True).to_csv("coinmarketcap_data.csv", index=False)
print(f"\nDone! Saved {len(all_data)} pages")

Case #2: Dynamic pagination

What to look for – click "Next" and check if the URL stays the same, but the table content changes. This means the site uses dynamic pagination.

Solution – use a headless browser like Selenium to simulate clicking the "Next" button.

Let's scrape Datatables as an example. This page loads its data and pagination using JavaScript.

import pandas as pd
from io import StringIO
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from time import sleep
url = "https://datatables.net/examples/data_sources/ajax.html"
all_data = []
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get(url)
sleep(2)  # Wait for the page and initial JavaScript to load
page = 90
while True:
    print(f"Scraping page {page}...", end=" ")
    table_html = driver.find_element(By.ID, "example").get_attribute("outerHTML")
    if not table_html:
        break
    df = pd.read_html(StringIO(table_html))[0]
    all_data.append(df)
    print("Done.")
    # Check if the "Next" button is disabled
    try:
        next_btn = driver.find_element(By.CSS_SELECTOR, "button.dt-paging-button.next")
        btn_class = next_btn.get_attribute("class") or ""
        if "disabled" in btn_class:
            print("Last page reached!")
            break
        # If not disabled, click it and wait
        next_btn.click()
        sleep(1)  # Wait for the table to reload
        page += 1
    except:
        print("Could not find 'Next' button or it broke.")
        break
driver.quit()
final_df = pd.concat(all_data, ignore_index=True)
final_df.to_csv("datatables_data.csv", index=False)
print(f"\nDone! Saved {len(all_data)} pages to datatables_data.csv")

This same logic can be applied using other modern tools like Playwright for web scraping.

The modern method: Can't AI scrape tables?

This is the most common question today, and the short answer is "yes, but with caveats".

AI as a co-pilot

AI chatbots are great for generating code. Copy a table's HTML, paste it into ChatGPT or Claude, and ask it to write a BeautifulSoup script. You'll get working code in seconds instead of writing it manually. You can learn more about using ChatGPT for web scraping or Claude for web scraping in our other guides.

But chatbots have limits. They write code – they don't run it for you. They can't render JavaScript, manage proxies, or handle getting blocked. And when a website changes its HTML (which happens constantly), the code breaks.

AI as the scraper itself

A real AI scraping solution doesn't just generate code – it understands the page structure and adapts to changes. Instead of writing brittle CSS selectors like soup.find(class_="wikitable"), you tell it what data you want in plain English.

That's what Decodo's AI Parser does. It's one of the best AI data collection tools because it stays resilient when websites change. For more on this approach, see how Decodo handles AI data collection.

Best practices and ethical considerations

Scraping is powerful, but you need to be respectful. Hammering a server with hundreds of requests per second will get you blocked fast. Here's how to scrape responsibly (and avoid getting banned):

  • Check robots.txt first. Most websites have a file at website.com/robots.txt that states rules for bots. We have a full guide on how to check if a website allows scraping.
  • Respect terms of service (ToS). The ToS is a legal document. Review it to ensure you aren't violating policies, especially for commercial use. This helps answer the question, "is web scraping legal?".
  • Use a proxy network. Your home IP address is a dead giveaway to a website. After 10 or 100 requests, you will get blocked. A proxy, especially from a residential proxy network, routes your request through a different, real device, making you look like a regular, unique user every time. This is the key to reliable scraping at scale.
  • Use a User-Agent. As shown in our examples, a User-Agent string makes your request look like it's from a browser, not a script. This bypasses many basic anti-bot systems.
  • Add delays between requests. Add time.sleep(2) inside your loops. This is polite to the server and makes you look less like a bot. (You can even build a Python requests retry system to handle this automatically.)

Troubleshooting common issues

The most common error is AttributeError: 'NoneType' object has no attribute 'find'. This is probably the most common scraping error. It means you tried to call a method on None, which happens when your selector doesn't find anything.

# Your selector has a typo or wrong class name
table = soup.find('table', attrs={'class': 'my-typo-class'})  # Returns None
# This next line is where it crashes
rows = table.find_all('tr')  # Can't call .find_all() on None!

The fix – check the line before the crash. Your soup.find() returned None because it didn't match anything. Double-check your selector for typos or inspect the actual HTML to see what the class name really is.

Other common issues

Here are some common scraping errors and fixes:

  • 404 Not Found. Your URL is wrong, or the page doesn't exist. Check for typos.
  • 403 Forbidden. The server is blocking you. Add a User-Agent header to look like a browser. If that doesn't work, you need a proxy to hide your identity.
  • 429 Too Many Requests. You're scraping too fast. Add time.sleep(2) between requests to slow down. If you still get blocked, use a proxy network to rotate your IP address.
  • Empty data / missing content. The data is probably loaded with JavaScript. Use your browser's Network tab to find the hidden API endpoint, or switch to headless browser automation tools to wait for the page to fully load.

Bottom line

You've learned how to scrape tables in Python – from simple pandas one-liners to complex Selenium setups. These tools work great for small projects and learning.

But scrapers are fragile, and websites change their HTML constantly. Scaling to thousands of pages means dealing with proxies, headless browsers, and CAPTCHAs. If that sounds tedious, there's another option.

Instead of building and maintaining scrapers yourself, you can use the Decodo Web Scraping API. Send a URL, the API handles the messy stuff (proxies, JavaScript rendering, anti-bot measures), and you get clean JSON back.

Plus, you can integrate with LangChain and n8n if you're building AI agents or automation workflows.

Try Web Scraping API for free

Collect real-time data without a single line of code – start your 7-day free trial today.

About the author

Justinas Tamasevicius

Head of Engineering

Justinas Tamaševičius is Head of Engineering with over two decades of expertize in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.


Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Which Python libraries are best for scraping HTML tables?

For simple static tables, pandas.read_html() is the fastest option. For more control, use requests to fetch the page and BeautifulSoup to parse it. For JavaScript-heavy sites, use Selenium or Playwright.

How can I scrape tables from websites that use JavaScript to load data?

Two approaches:

  • Find the hidden API – use your browser's Network tab (filter by Fetch/XHR) to find the API request returning JSON data. This is faster and more reliable than browser automation.
  • Use a headless browser – if there's no API, use Selenium or Playwright to load the page, execute the JavaScript, and parse the final HTML.

What should I do if the table structure is complex (e.g., merged cells, multi-level headers)?

Try pandas.read_html() first – it handles colspan and rowspan automatically in most cases. If it fails, you'll need to write a custom BeautifulSoup parser to manually handle the cell structure.

How do I export scraped table data to a .csv or Excel file?

Once you have a pandas DataFrame (df), it's a one-line job:

  • CSV – df.to_csv('filename.csv', index=False)
  • Excel – df.to_excel('filename.xlsx', index=False)

🐍 Python Web Scraping: In-Depth Guide 2025

Welcome to 2025, the year of the snake – and what better way to celebrate than by mastering Python, the ultimate "snake" in the tech world! If you’re new to web scraping, don’t worry – this guide starts from the basics, guiding you step-by-step on collecting data from websites. Whether you’re curious about automating simple tasks or diving into more significant projects, Python makes it easy and fun to start. Let’s slither into the world of web scraping and see how powerful this tool can be!

Zilvinas Tamulis

Feb 28, 2025

15 min read

Python Errors and Exceptions

Python Errors and Exceptions: An Ultimate Guide to Different Types and Solutions

In this article, we’ll explore the different kinds of errors and exceptions, what causes them, and provide solutions to solving them. No more headaches and cursing your code until it gets scared and starts working – master the language of Python to understand precisely what it wants from you.

Zilvinas Tamulis

Feb 05, 2024

6 min read

Go vs. Python: A 2025 Developer's Guide

The Go vs Python comparison is a key discussion among developers. Go (Golang), created at Google, excels in performance, scalability, and concise syntax for distributed systems. Meanwhile, Python prioritizes readability and rapid development with a vast library ecosystem. Understanding these core differences is crucial for developers choosing tech stacks in 2025 and beyond. Let's dive in!

Justinas Tamasevicius

May 13, 2025

9 min read

© 2018-2025 decodo.com (formerly smartproxy.com). All Rights Reserved