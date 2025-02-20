Understanding HTML tables

Before you can scrape a table, you need to understand its structure. HTML uses tags to organize content. To see this, visit any webpage with a table, right-click it, and select Inspect (check out our guide on how to inspect elements if you need help).

You only need to know 4 tags:

1. <table> – the entire table container (think: the whole spreadsheet file)

2. <tr> – "table row" – a single row (like one row in a spreadsheet)

3. <th> – "table header" – a header cell (column titles like "Name" or "Price")

4. <td> – "table data" – a data cell (individual values like "$19.99")

Tables are nested: a <table> contains <tr> (rows), which contain <th> (headers) and <td> (data). Understanding this structure is important for choosing the right CSS or XPath selectors later.

Prerequisites

You'll need Python installed and a few libraries:

Requests – send HTTP requests to download web pages

– send HTTP requests to download web pages BeautifulSoup4 – parse HTML into searchable objects

– parse HTML into searchable objects Pandas – organize scraped data into tables and export to CSV/Excel

– organize scraped data into tables and export to CSV/Excel lxml – a fast parser that pandas.read_html needs to read the HTML

– a fast parser that needs to read the HTML Selenium – automate browsers for JavaScript-heavy sites

To install all of them, open your terminal (or command prompt) and run the following command: