Python Extract Text From HTML: A Step-by-Step Guide With Code Examples
Extracting text from HTML in Python is one of the most common tasks in web scraping, NLP pipelines, search indexing, and data preparation. The goal is to keep the visible content from a webpage while removing all the HTML markup, scripts, and styles that surround it. This guide walks you through the popular Python libraries for HTML text extraction and a full step-by-step workflow to go from raw HTML to clean, production-ready text.
Lukas Mikelionis
Last updated: May 28, 2026
14 min read

TL;DR
- Python extract text from HTML workflows use libraries like Beautiful Soup, lxml, selectolax, html2text, Inscriptis, or Trafilatura, each suited to a different use case.
- Strip <script> and <style> tags before calling get_text(), or JavaScript ends up mixed into your output.
- Use Beautiful Soup for one-off scripts, selectolax for large-scale extraction, Inscriptis when layout quality matters, and Trafilatura when you only want the article body.
- Clean extracted text after parsing by normalizing whitespace, Unicode, HTML entities, and invisible characters.
- Choose your parser based on speed, formatting quality, and page complexity.
Popular Python libraries for HTML text extraction
Python offers several reliable libraries for extracting visible text from HTML. Some libraries focus on simplicity, while others prioritize speed, formatting quality, or boilerplate removal.
Choosing the right tool depends on your workload, the quality of the HTML, and the type of output you need.
Here’s a quick overview of the popular Python libraries for HTML text extraction:
Beautiful Soup
Beautiful Soup is one of the top Python scraping libraries and the default starting point for most developers. It handles broken HTML well, supports multiple parsers, and provides the familiar beautifulsoup get_text workflow for fast extraction.
It’s also beginner-friendly, well-documented, and compatible with multiple parsers underneath. The main limitation is scale. Beautiful Soup web scraping adds parsing overhead that adds up when processing thousands of documents.
lxml and lxml.html
The lxml library is faster and stricter than Beautiful Soup and supports full XPath queries, which lets you write precise, readable selectors that CSS can’t express. It works especially well for structured XML and HTML text extraction, precise element targeting, and large-scale parsing.
Meanwhile, lxml.html is the specialized submodule for working with HTML documents. Developers often combine it with HTML-to-text utilities for cleaner Python HTML to plain text conversion without walking the DOM manually.
selectolax
This popular Python library is built as a Cython wrapper that uses the Lexbor C engine underneath to parse HTML and query it with CSS selectors. This makes it extremely fast for high-volume extraction, where Beautiful Soup’s speed overhead becomes a real cost.
It’s a strong choice when you need to get text from HTML across thousands of pages daily. The API is smaller than Beautiful Soup’s, but parsing speed often compensates for the learning curve.
html2text
html2text Python workflows convert HTML into Markdown-style text instead of plain strings. This format works well for LLM prompts, documentation generation, and note-taking systems.
The output preserves link text, headings, and basic list structure, which makes it useful when you care about the document’s shape, not just its words.
Inscriptis
This library is built for conversion quality. It handles lists and table layouts noticeably better than Beautiful Soup or html2text, because it applies CSS-style rendering rules during conversion. If your job is to produce text that reads the way a human would read the page, with preserved indentation and column separation, then Inscriptis is worth the extra dependency.
Trafilatura, readability-lxml, and justext
These tools specialize in boilerplate removal. They don’t give you all the text on a page, but the main content, stripping navigation menus, footers, ads, cookie banners, and sidebars automatically. Trafilatura is the most actively maintained and generally the most reliable. Use any of these when scraping article-style pages, and you want the main body text without writing selector rules.
Standard library options (html.parser and html.unescape)
Python’s built-in html.parser and html.unescape() can handle small scripts without external dependencies, but they are not ideal for large or real-world scraping.
Comparing extraction methods
Different HTML-to-text Python extraction libraries solve different problems. Choosing without a clear reference could lead you to switch libraries later. Here is a side-by-side comparison to help you pick the right tool before building your extraction workflow.
Library
Best for
Speed
Conversion quality
Learning curve
Install size
Beautiful Soup
General-purpose extraction, one-off scripts
Slow to medium
Basic
Easy
Small (~500 KB)
lxml.html
Structured parsing and XPath
Fast
Good
Moderate
Medium (~3 MB)
selectolax
High-volume extraction
Very fast
Basic
Moderate
Small (~200 KB)
html2text
Markdown output for LLMs/docs
Medium
Markdown only
Easy
Small (~100 KB)
Inscriptis
Readable plain-text formatting
Medium
Excellent (tables/lists)
Medium
Tiny (~50 KB)
Trafilatura
Article body extraction
Fast
Excellent (article only)
Easy
Medium (~5 MB)
Performance benchmarks vary depending on parser configuration and HTML complexity. Independent parser comparisons, including tests from Rushter, often show selectolax performing roughly 5x to 30x faster than Beautiful Soup on large workloads.
Here is a quick decision guide: Use Beautiful Soup for one-off scripts and beginner projects. Use the selectolax library when speed matters most. Choose Inscriptis if readable formatting is your top priority. Use html2text when you need Markdown output. Choose Trafilatura when you only want article content.
For quick reference, use this decision tree:
A side note you need to know:
- The nltk.clean_html function was deprecated since 2014 and now raises NotImplementedError. Some outdated Stack Overflow answers still reference it, so avoid copying those snippets into production workflows.
- Regex-only HTML stripping recipes also appear frequently in Python remove HTML tags tutorials. Regex can help clean already-extracted plain text, but it breaks on nested or malformed HTML. Parse the HTML first, then clean the extracted text afterward.
Step-by-step: a minimum viable extraction
The fastest way to extract text from HTML in Python is a simple workflow: load your HTML, parse it, strip noisy elements, extract visible text, and clean the output. Beautiful Soup is the right default here. It is readable, forgiving, and well-documented.
Step 1: Install the libraries
Install Beautiful Soup and lxml first:
Installing lxml alongside Beautiful Soup matters. Beautiful Soup is a parsing interface, not a parser, so it needs a backend. lxml is the fastest option and the right default. Besides, using lxml as the Beautiful Soup backend improves parsing speed significantly.
Step 2: Load your HTML
You can load HTML from a local file, a string, or a requests.get() response. For this example, we’ll use a simple HTML string.
For live URLs, use html = requests.get(url).text. We’ll cover encoding pitfalls with live pages in the real-world section below.
Step 3: Parse with Beautiful Soup
The "lxml" parser is usually the best default. It tells Beautiful Soup which parser backend to use. Passing it explicitly avoids a warning and maintains a consistent behavior across environments.
If you hit parser issues, explore our guide on what to do when getting parsing errors in Python.
Step 4: Strip script and style tags
Next, remove noisy elements:
The decompose() method removes the tag and its contents from the tree entirely. The extract() method removes it too, but keeps the elements available in memory. Use that if you need the removed element for something else.
Step 5: Extract text
The separator argument inserts a string between each piece of text. A single space works for flat output; use "\n\n" if you need paragraph breaks. The strip=True argument strips leading and trailing whitespace from each text piece before joining, but it doesn’t collapse whitespace in the middle of strings.
When you run the script above, you’ll get the following output:
The script, style, and hidden div are gone. What remains is what a human reading the page would see.
Here is a raw HTML and a cleaned text output side by side, so you can see what changed.
Raw HTML:
Full script:
Output:
Notice that the hidden div still appears. HTML parsers don’t evaluate CSS visibility rules automatically.
Parse more, debug less
Stop troubleshooting empty responses and blocked requests. Decodo delivers fully rendered HTML from 195+ countries straight to your Python script. You just run Beautiful Soup.
Extracting specific content or data structures
Most real tasks don’t need all the text from a page. They need a specific element, a set of matching tags, or structured data embedded in the HTML.
Python libraries make this easy once you understand selectors and structured extraction patterns. We’ll cover the patterns for targeting what you actually want and leaving the rest behind.
Targeting tags by name and attribute
Beautiful Soup provides find() and find_all() methods for targeted extraction.
The following script fetches book titles from books.toscrape.com using find_all() with a tag name, then filters by attribute:
Create a file called extract_titles.py, then add the script:
Run it with:
Output:
The class_ spelling exists because class is a reserved word in Python. Beautiful Soup handles it cleanly with the underscore.
CSS selectors with select() and select_one()
CSS selectors are easier to read than nested find() calls when targeting nested elements or using attribute patterns:
Output:
Nested selectors are especially useful when you parse HTML Python projects with repeated structures.
The selector article.product_pod h3 a targets links nested inside each product card.
XPath with lxml.html
XPath handles advanced conditions that CSS selectors can’t express as complex relationships as cleanly.
Create a file called xpath_example.py and add the following script:
Run:
Output:
XPath is especially useful for sibling relationships, ancestor checks, and conditional matching.
Pulling text from tables
For structured table data, pandas.read_html() is often the fastest and easiest option.
This requires lxml or Beautiful Soup as the underlying parser. It works well on well-formed tables, but can misbehave silently on rowspan and colspan oddities. For irregular tables, fall back to Beautiful Soup’s find_all("tr") and walk rows manually.
Structured data inside HTML (JSON-LD)
Many modern websites store structured metadata inside JSON-LD scripts <script type="application/ld+json">.
For Open Graph tags and Microdata at scale, the extruct library handles all structured data formats from a single call. For JSON handling on the JavaScript side, see JSON.parse() in JavaScript: a complete guide.
Handling complex and real-world HTML
Real-world HTML rarely looks like tutorial examples. Pages often contain ads, navigation menus, cookie banners, malformed markup, hidden elements, and JavaScript-rendered content that never appears in the downloaded HTML at all. Production extraction pipelines need to handle all of these consistently.
Removing boilerplate (navigation, footers, sidebars, ads)
Before you call beautifulsoup get_text(), remove navigation, ads, and layout elements.
- Manual removal. Target by tag (header, footer, nav, aside) and by common class fragments (cookie-banner, ad, sidebar) before calling get_text.
- Automated boilerplate removal. For article-style pages, let a dedicated library handle it:
Trafilatura removes menus, ads, and unrelated layout sections automatically.
The readability-lxml and justext functions are also solid alternatives to Trafilatura.
Hidden content
HTML parsers don’t evaluate CSS visibility rules, so display:none, aria-hidden="true", and class names like "hidden" or "visually-hidden" do not affect what get_text() returns. Tracking pixels, cookie notices, and hidden terms paragraphs all appear in extracted text unless you remove them manually first:
Malformed HTML
The lxml function is the right default as it fixes most real-world issues while staying fast. It also handles malformed HTML better than Python’s built-in parser. Switch to html5lib only for the worst markup; it’s the most forgiving but significantly slower.
Signs of a parse failure include empty results where you expect content, missing tags that should be there, or output that looks correct but doesn’t match the visible page.
JavaScript-rendered content
Sometimes requests.get() returns HTML without the visible text you see in the browser. That usually means JavaScript injected the content after page load. No HTML parser can solve this, as the data doesn’t exist in the downloaded HTML.
You need to use a headless browser like Playwright or Selenium. Read how to scrape websites with dynamic content using Python and Playwright web scraping for deeper guidance, or check if the site has an API or JSON endpoint that serves the same data without rendering.
Anti-bot defenses on the fetching step
The parsing step often isn’t the real bottleneck. Fetching pages reliably becomes harder at scale because some sites use IP blocks, rate limits, browser fingerprinting, and CAPTCHAs.
Decodo residential proxies route requests through real user devices, so each request looks like it’s coming from a genuine browser on a home connection rather than a datacenter IP.
These proxies make traffic look closer to normal user activity, which reduces blocks during large-scale scraping workflows and improves extraction consistency on heavily protected sites.
Cleaning and normalizing extracted text
Extracting text is only the first part of the process. Real-world output still contains broken spacing, invisible characters, inconsistent Unicode, and encoding issues. Cleaning those problems improves search indexing, NLP processing, and downstream analysis.
Here are the steps below that separate usable output from output that causes silent bugs in downstream pipelines.
Whitespace and line breaks
The strip=True argument in get_text() strips leading and trailing whitespace from each text node. It doesn’t collapse multiple spaces or newlines in the middle of the joined output.
Create a file called cleanup_whitespace.py and add the script below, then collapse repeated whitespace:
Run the script:
Output:
Use regex only after parsing. Do not use regex to parse raw HTML. If you need paragraph preservation:
HTML entities
Most parsers decode & and entities automatically, but not always. The common failure case is text pulled from attribute values, such as title and alt attributes, can still contain raw entities like this:
Output:
A few things worth noting:
- Browsers and parsers like Beautiful Soup often decode text nodes automatically
- But raw attribute values (title, alt, data-*, etc.) may still contain entity references depending on the parser and extraction method
- If you're scraping HTML manually (regex/string operations), entities usually remain encoded until you call html.unescape()
Most HTML parsers automatically decode entities in text nodes, but attribute values or manually extracted HTML fragments may still contain encoded entities.
Unicode normalization
The unicodedata.normalize("NFKC", text) call fixes ligatures, full-width digits, and compatibility characters for downstream NLP:
This fixes compatibility characters, ligatures, and full-width forms that can otherwise break matching, indexing, tokenization, and regex processing.
So the practical rule is:
- NFKC is great for search, NLP, indexing, and user input normalization
- Avoid blindly applying it to cryptographic data, passwords, source code identifiers, or text where exact Unicode distinctions matter.
Non-breaking spaces and other invisibles
Invisible characters show up constantly in scraped text and look like regular spaces until something breaks:
- \u00a0 — non-breaking space, comes from that wasn’t decoded
- \u200b — zero-width space, used by some sites to defeat scrapers
- \ufeff — byte order mark (BOM), appears at the start of files
Encoding detection at the source
When Requests guesses the wrong encoding, text comes out garbled with things like ’ instead of an apostrophe. Set the encoding from the <meta charset> tag or use apparent_encoding as a fallback:
Putting it together: a reusable clean() function
The function below takes extracted text and returns it ready for downstream use. It’s self-contained and copyable into any project that does HTML text extraction:
Create a file called clean_text.py and add the following script:
This helper handles many common Python HTML to plain data cleanup problems in one place.
Full working scripts
Script 1: Minimum viable extraction with BeautifulSoup
Install dependencies, then run the script to extract and clean text from any URL:
Run:
Output:
Script 2: High-volume extraction with selectolax
Use this when processing large numbers of documents where Beautiful Soup’s overhead adds up:
Run:
Output:
Script 3: Article body extraction with Trafilatura
The Trafilatura library strips boilerplate automatically and returns only the main article content:
Run:
Output:
Script 4: Targeted extraction with lxml XPath
Use lxml’s XPath when you need precise control over which elements to target:
Run:
Output:
Console-based and alternative extraction tools
Python isn’t the only way to extract readable text from HTML. Several other console tools and browser-based methods work well for quick validation, debugging, and lightweight extraction workflows.
Here are a few Python alternatives and console-based text extraction tools you need to know:
Lynx and w3m
Terminal browsers like lynx and w3m can render messy HTML into readable plain text. You need to run lynx -dump or w3m -dump through Python’s subprocess module to get a fast sanity check when your parser output looks suspicious.
These tools work especially well for:
- Quick one-off extraction and sanity checks
- Debugging formatting problems
- Small scripts
- Comparing parser output
Overall, the main use case is debugging. Comparing lynx output against your parser’s output tells you quickly if a discrepancy is a parser problem or a selector problem.
Pandoc
For article-shaped HTML and documentation pipelines, the command pandoc -f html -t plain file.html can convert HTML into plain text in one line. It’s not scriptable in the same way Python is, but it’s a useful sanity check.
Browser-side extraction as a reality check
Sometimes the fastest debugging approach is checking the browser directly. Running document.body.innerText in the browser console returns what a human-rendering engine considers the visible text with CSS evaluated, hidden elements excluded, and JavaScript-rendered content included. This is the gold standard to compare your extracted text against.
LLM-based extraction
LLMs can technically extract text from HTML, but they’re usually unnecessary for plain conversion tasks. They’re slower, more expensive, and less deterministic than parser libraries. Parser libraries are faster, cheaper, and more predictable.
LLMs make more sense for semantic extraction tasks such as:
- Product names and prices
- Event detection
- Entity extraction
- Review summarization
For standard extract text from HTML Python workflows, parser libraries remain faster and more reliable.
Best practices and common pitfalls
Text extraction problems usually come from a small set of recurring mistakes that are easy to handle once you’ve seen them, and expensive to debug after a downstream consumer breaks.
Here are the common text extraction best practices:
Practice
Why it matters
Always strip script and style tags first
The #1 cause of JavaScript code ending up in extracted text. Call decompose() before get_text().
Choose your parser deliberately
Use lxml as the Beautiful Soup backend by default; html.parser if you cannot install C extensions; html5lib for severely broken markup.
Never regex full HTML
Regex works for the cleanup of already-extracted plain text only. It breaks silently on real-world HTML.
Guard against None on find()
The find() method returns None when no match exists. Always check before calling .get_text() to avoid AttributeError.
Cache responses during development
Hitting a live site repeatedly while tweaking selectors gets your IP blocked fast.
Validate output continuously
Spot-check dozens of pages. Selector drift is easier to catch early than after a downstream consumer breaks.
JavaScript-rendered content remains one of the most common confusion points for beginners. If your selectors suddenly return empty results, this guide outlines why Beautiful Soup fails and how to fix it.
Final thoughts
Python extract text from HTML comes down to these key decisions: which library to use, how to strip noise before you extract, and how to clean what comes out. Beautiful Soup is the right default; selectolax handles large-scale extraction efficiently, Inscriptis improves formatting quality, and Trafilatura removes article boilerplate automatically.
Cleanup matters just as much as extraction. Normalizing whitespace, Unicode, entities, and invisible characters often determines whether extracted text is merely readable or truly production-ready.
Your next step? Pick a library from the decision matrix above, run the minimum viable workflow in the step-by-step section, and save the clean() function for when you need it.
Get the HTML, skip the blocks
Your Python parser is ready. Now feed it clean HTML from any site. Decodo's Web Scraping API handles proxies, rendering, and anti-bot bypass so your extraction code always has data to work with.
About the author

Lukas Mikelionis
Senior Account Manager
Lukas is a seasoned enterprise sales professional with extensive experience in the SaaS industry. Throughout his career, he has built strong relationships with Fortune 500 technology companies, developing a deep understanding of complex enterprise needs and strategic account management.
Connect with Lukas via LinkedIn.
All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.


