Back to blog

Python Extract Text From HTML: A Step-by-Step Guide With Code Examples

Share article:

Extracting text from HTML in Python is one of the most common tasks in web scraping, NLP pipelines, search indexing, and data preparation. The goal is to keep the visible content from a webpage while removing all the HTML markup, scripts, and styles that surround it. This guide walks you through the popular Python libraries for HTML text extraction and a full step-by-step workflow to go from raw HTML to clean, production-ready text.

Python Extract Text From HTML

TL;DR

  • Python extract text from HTML workflows use libraries like Beautiful Soup, lxml, selectolax, html2text, Inscriptis, or Trafilatura, each suited to a different use case.
  • Strip <script> and <style> tags before calling get_text(), or JavaScript ends up mixed into your output.
  • Use Beautiful Soup for one-off scripts, selectolax for large-scale extraction, Inscriptis when layout quality matters, and Trafilatura when you only want the article body.
  • Clean extracted text after parsing by normalizing whitespace, Unicode, HTML entities, and invisible characters.
  • Choose your parser based on speed, formatting quality, and page complexity.

Python offers several reliable libraries for extracting visible text from HTML. Some libraries focus on simplicity, while others prioritize speed, formatting quality, or boilerplate removal.

Choosing the right tool depends on your workload, the quality of the HTML, and the type of output you need.

Here’s a quick overview of the popular Python libraries for HTML text extraction:

Beautiful Soup

Beautiful Soup is one of the top Python scraping libraries and the default starting point for most developers. It handles broken HTML well, supports multiple parsers, and provides the familiar beautifulsoup get_text workflow for fast extraction. 

It’s also beginner-friendly, well-documented, and compatible with multiple parsers underneath. The main limitation is scale. Beautiful Soup web scraping adds parsing overhead that adds up when processing thousands of documents.

lxml and lxml.html

The lxml library is faster and stricter than Beautiful Soup and supports full XPath queries, which lets you write precise, readable selectors that CSS can’t express. It works especially well for structured XML and HTML text extraction, precise element targeting, and large-scale parsing.

Meanwhile, lxml.html is the specialized submodule for working with HTML documents. Developers often combine it with HTML-to-text utilities for cleaner Python HTML to plain text conversion without walking the DOM manually. 

selectolax

This popular Python library is built as a Cython wrapper that uses the Lexbor C engine underneath to parse HTML and query it with CSS selectors. This makes it extremely fast for high-volume extraction, where Beautiful Soup’s speed overhead becomes a real cost. 

It’s a strong choice when you need to get text from HTML across thousands of pages daily. The API is smaller than Beautiful Soup’s, but parsing speed often compensates for the learning curve.

html2text

html2text Python workflows convert HTML into Markdown-style text instead of plain strings. This format works well for LLM prompts, documentation generation, and note-taking systems. 

The output preserves link text, headings, and basic list structure, which makes it useful when you care about the document’s shape, not just its words.

Inscriptis

This library is built for conversion quality. It handles lists and table layouts noticeably better than Beautiful Soup or html2text, because it applies CSS-style rendering rules during conversion. If your job is to produce text that reads the way a human would read the page, with preserved indentation and column separation, then Inscriptis is worth the extra dependency.

Trafilatura, readability-lxml, and justext

These tools specialize in boilerplate removal. They don’t give you all the text on a page, but the main content, stripping navigation menus, footers, ads, cookie banners, and sidebars automatically. Trafilatura is the most actively maintained and generally the most reliable. Use any of these when scraping article-style pages, and you want the main body text without writing selector rules. 

Standard library options (html.parser and html.unescape)

Python’s built-in html.parser and html.unescape() can handle small scripts without external dependencies, but they are not ideal for large or real-world scraping.

Comparing extraction methods

Different HTML-to-text Python extraction libraries solve different problems. Choosing without a clear reference could lead you to switch libraries later. Here is a side-by-side comparison to help you pick the right tool before building your extraction workflow.

Library

Best for

Speed

Conversion quality

Learning curve

Install size

Beautiful Soup

General-purpose extraction, one-off scripts

Slow to medium

Basic

Easy

Small (~500 KB)

lxml.html

Structured parsing and XPath

Fast

Good

Moderate

Medium (~3 MB)

selectolax

High-volume extraction

Very fast

Basic

Moderate

Small (~200 KB)

html2text

Markdown output for LLMs/docs

Medium

Markdown only

Easy

Small (~100 KB)

Inscriptis

Readable plain-text formatting

Medium

Excellent (tables/lists)

Medium

Tiny (~50 KB)

Trafilatura

Article body extraction

Fast

Excellent (article only)

Easy

Medium (~5 MB)

Performance benchmarks vary depending on parser configuration and HTML complexity. Independent parser comparisons, including tests from Rushter, often show selectolax performing roughly 5x to 30x faster than Beautiful Soup on large workloads.

Here is a quick decision guide: Use Beautiful Soup for one-off scripts and beginner projects. Use the selectolax library when speed matters most. Choose Inscriptis if readable formatting is your top priority. Use html2text when you need Markdown output. Choose Trafilatura when you only want article content.

For quick reference, use this decision tree:

Need article-only content?
├── Yes → Trafilatura
└── No
├── Need Markdown output?
│ ├── Yes → html2text
│ └── No
│ ├── Need maximum speed?
│ │ ├── Yes → selectolax
│ │ └── No
│ │ ├── Need best formatting?
│ │ │ ├── Yes → Inscriptis
│ │ │ └── No → Beautiful Soup

A side note you need to know:

  • The nltk.clean_html function was deprecated since 2014 and now raises NotImplementedError. Some outdated Stack Overflow answers still reference it, so avoid copying those snippets into production workflows.
  • Regex-only HTML stripping recipes also appear frequently in Python remove HTML tags tutorials. Regex can help clean already-extracted plain text, but it breaks on nested or malformed HTML. Parse the HTML first, then clean the extracted text afterward.

Step-by-step: a minimum viable extraction

The fastest way to extract text from HTML in Python is a simple workflow: load your HTML, parse it, strip noisy elements, extract visible text, and clean the output. Beautiful Soup is the right default here. It is readable, forgiving, and well-documented.

Step 1: Install the libraries

Install Beautiful Soup and lxml first:

pip install beautifulsoup4 lxml

Installing lxml alongside Beautiful Soup matters. Beautiful Soup is a parsing interface, not a parser, so it needs a backend. lxml is the fastest option and the right default. Besides, using lxml as the Beautiful Soup backend improves parsing speed significantly.

Step 2: Load your HTML

You can load HTML from a local file, a string, or a requests.get() response. For this example, we’ll use a simple HTML string.

html = """
<html>
<head>
<style>body { font-size: 16px; }</style>
</head>
<body>
<h1>Book review: The Pragmatic Programmer</h1>
<p>One of the most influential software books ever written.</p>
<p>It covers career advice, coding habits, and project thinking.</p>
<div class="hidden" style="display:none">Tracking pixel loaded</div>
<script>window.analytics = { user: "anon" };</script>
</body>
</html>
"""

For live URLs, use html = requests.get(url).text. We’ll cover encoding pitfalls with live pages in the real-world section below.

Step 3: Parse with Beautiful Soup

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")

The "lxml" parser is usually the best default. It tells Beautiful Soup which parser backend to use. Passing it explicitly avoids a warning and maintains a consistent behavior across environments.

If you hit parser issues, explore our guide on what to do when getting parsing errors in Python.

Step 4: Strip script and style tags

Next, remove noisy elements:

for tag in soup(["script", "style", "noscript", "template"]):
tag.decompose()

The decompose() method removes the tag and its contents from the tree entirely. The extract() method removes it too, but keeps the elements available in memory. Use that if you need the removed element for something else.

Step 5: Extract text

text = soup.get_text(separator=" ", strip=True)
print(text)

The separator argument inserts a string between each piece of text. A single space works for flat output; use "\n\n" if you need paragraph breaks. The strip=True argument strips leading and trailing whitespace from each text piece before joining, but it doesn’t collapse whitespace in the middle of strings.

When you run the script above, you’ll get the following output:

Book review: The Pragmatic Programmer One of the most influential software
books ever written. It covers career advice, coding habits, and project thinking.

The script, style, and hidden div are gone. What remains is what a human reading the page would see.

Here is a raw HTML and a cleaned text output side by side, so you can see what changed.

Raw HTML:

<html>
<head>
<style>
.hidden {display:none;}
</style>
<script>
console.log("analytics");
</script>
</head>
<body>
<h1>Python scraping basics</h1>
<p>Extract only visible content.</p>
<p>Ignore scripts and styles during parsing.</p>
<div class="hidden">Hidden debug content</div>
</body>
</html>

Full script:

from bs4 import BeautifulSoup
html = """
<html>
<head>
<style>
.hidden {display:none;}
</style>
<script>
console.log("analytics");
</script>
</head>
<body>
<h1>Python scraping basics</h1>
<p>Extract only visible content.</p>
<p>Ignore scripts and styles during parsing.</p>
<div class="hidden">Hidden debug content</div>
</body>
</html>
"""
soup = BeautifulSoup(html, "lxml")
for tag in soup(["script", "style", "noscript", "template"]):
tag.decompose()
text = soup.get_text(separator=" ", strip=True)
print(text)

Output:

Python scraping basics Extract only visible content. Ignore scripts and
styles during parsing. Hidden debug content

Notice that the hidden div still appears. HTML parsers don’t evaluate CSS visibility rules automatically.

Parse more, debug less

Stop troubleshooting empty responses and blocked requests. Decodo delivers fully rendered HTML from 195+ countries straight to your Python script. You just run Beautiful Soup.

Extracting specific content or data structures

Most real tasks don’t need all the text from a page. They need a specific element, a set of matching tags, or structured data embedded in the HTML. 

Python libraries make this easy once you understand selectors and structured extraction patterns. We’ll cover the patterns for targeting what you actually want and leaving the rest behind. 

Targeting tags by name and attribute

Beautiful Soup provides find() and find_all() methods for targeted extraction.

The following script fetches book titles from books.toscrape.com using find_all() with a tag name, then filters by attribute:

Create a file called extract_titles.py, then add the script:

from bs4 import BeautifulSoup
import requests
response = requests.get("https://books.toscrape.com")
soup = BeautifulSoup(response.text, "lxml")
# All book titles using find_all with tag name
titles = soup.find_all("h3")
for t in titles[:5]:
print(t.get_text(strip=True))
# Find by attribute -- trailing underscore avoids Python keyword clash
links = soup.find_all("a", class_="thumbnail")

Run it with:

python extract_titles.py

Output:

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind

The class_ spelling exists because class is a reserved word in Python. Beautiful Soup handles it cleanly with the underscore.

CSS selectors with select() and select_one()

CSS selectors are easier to read than nested find() calls when targeting nested elements or using attribute patterns:

# Nested selector: anchor inside h3 inside article
book_links = soup.select("article.product_pod h3 a")
for book in book_links[:3]:
print(book.get("title"))
# First match only
first_price = soup.select_one("p.price_color")
print(first_price.get_text(strip=True))

Output:

A Light in the Attic
Tipping the Velvet
Soumission
£51.77

Nested selectors are especially useful when you parse HTML Python projects with repeated structures.

The selector article.product_pod h3 a targets links nested inside each product card.

XPath with lxml.html

XPath handles advanced conditions that CSS selectors can’t express as complex relationships as cleanly.

Create a file called xpath_example.py and add the following script:

from lxml import html
import requests
tree = html.fromstring(requests.get("https://books.toscrape.com").content)
# XPath: get book titles via title attribute on anchor tags
titles = tree.xpath("//article[@class='product_pod']//h3/a/@title")
print(titles[:3])

Run:

python xpath_example.py

Output:

['A Light in the Attic', 'Tipping the Velvet', 'Soumission']

XPath is especially useful for sibling relationships, ancestor checks, and conditional matching.

Pulling text from tables

For structured table data, pandas.read_html() is often the fastest and easiest option.

import pandas as pd
# pandas.read_html returns a list of DataFrames
tables = pd.read_html("https://en.wikipedia.org/wiki/Python_(programming_language)")
print(tables[0].head())

This requires lxml or Beautiful Soup as the underlying parser. It works well on well-formed tables, but can misbehave silently on rowspan and colspan oddities. For irregular tables, fall back to Beautiful Soup’s find_all("tr") and walk rows manually.

Structured data inside HTML (JSON-LD)

Many modern websites store structured metadata inside JSON-LD scripts <script type="application/ld+json">.

import json
import requests
from bs4 import BeautifulSoup
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
script = soup.find("script", type="application/ld+json")
if script:
data = json.loads(script.string)
print(data)

For Open Graph tags and Microdata at scale, the extruct library handles all structured data formats from a single call. For JSON handling on the JavaScript side, see JSON.parse() in JavaScript: a complete guide.

Handling complex and real-world HTML

Real-world HTML rarely looks like tutorial examples. Pages often contain ads, navigation menus, cookie banners, malformed markup, hidden elements, and JavaScript-rendered content that never appears in the downloaded HTML at all. Production extraction pipelines need to handle all of these consistently.

Removing boilerplate (navigation, footers, sidebars, ads)

Before you call beautifulsoup get_text(), remove navigation, ads, and layout elements.

  • Manual removal. Target by tag (header, footer, nav, aside) and by common class fragments (cookie-banner, ad, sidebar) before calling get_text.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
# Remove common layout elements
for tag in soup(["header", "footer", "nav", "aside"]):
tag.decompose()
# Remove common sidebar and banner classes
for element in soup.select(".cookie-banner, .sidebar, .ad"):
element.decompose()
  • Automated boilerplate removal. For article-style pages, let a dedicated library handle it:
import trafilatura
text = trafilatura.extract(html)
print(text)

Trafilatura removes menus, ads, and unrelated layout sections automatically.

The readability-lxml and justext functions are also solid alternatives to Trafilatura.

Hidden content

HTML parsers don’t evaluate CSS visibility rules, so display:nonearia-hidden="true", and class names like "hidden" or "visually-hidden" do not affect what get_text() returns. Tracking pixels, cookie notices, and hidden terms paragraphs all appear in extracted text unless you remove them manually first:

# Remove elements with display:none or common hidden class names
for tag in soup.find_all(style=lambda s: s and "display:none" in s.replace(" ", "")):
tag.decompose()
for tag in soup.find_all(class_=lambda c: c and "hidden" in c):
tag.decompose()

Malformed HTML

The lxml function is the right default as it fixes most real-world issues while staying fast. It also handles malformed HTML better than Python’s built-in parser. Switch to html5lib only for the worst markup; it’s the most forgiving but significantly slower.

Signs of a parse failure include empty results where you expect content, missing tags that should be there, or output that looks correct but doesn’t match the visible page.

JavaScript-rendered content

Sometimes requests.get() returns HTML without the visible text you see in the browser. That usually means JavaScript injected the content after page load. No HTML parser can solve this, as the data doesn’t exist in the downloaded HTML.

You need to use a headless browser like Playwright or Selenium. Read how to scrape websites with dynamic content using Python and Playwright web scraping for deeper guidance, or check if the site has an API or JSON endpoint that serves the same data without rendering. 

Anti-bot defenses on the fetching step

The parsing step often isn’t the real bottleneck. Fetching pages reliably becomes harder at scale because some sites use IP blocks, rate limits, browser fingerprinting, and CAPTCHAs.

Decodo residential proxies route requests through real user devices, so each request looks like it’s coming from a genuine browser on a home connection rather than a datacenter IP. 

These proxies make traffic look closer to normal user activity, which reduces blocks during large-scale scraping workflows and improves extraction consistency on heavily protected sites. 

Cleaning and normalizing extracted text

Extracting text is only the first part of the process. Real-world output still contains broken spacing, invisible characters, inconsistent Unicode, and encoding issues. Cleaning those problems improves search indexing, NLP processing, and downstream analysis.

Here are the steps below that separate usable output from output that causes silent bugs in downstream pipelines.

Whitespace and line breaks

The strip=True argument in get_text() strips leading and trailing whitespace from each text node. It doesn’t collapse multiple spaces or newlines in the middle of the joined output. 

Create a file called cleanup_whitespace.py and add the script below, then collapse repeated whitespace:

import re
from bs4 import BeautifulSoup
html = """
<p>Hello</p>
<p>World</p>
"""
soup = BeautifulSoup(html, "lxml")
text = soup.get_text(separator=" ", strip=True)
# Collapse repeated whitespace
text = re.sub(r"\s+", " ", text)
print(text)

Run the script:

python cleanup_whitespace.py

Output:

Hello World

Use regex only after parsing. Do not use regex to parse raw HTML. If you need paragraph preservation:

text = soup.get_text(separator="\n\n")

HTML entities

Most parsers decode &amp; and &nbsp; entities automatically, but not always. The common failure case is text pulled from attribute values, such as title and alt attributes, can still contain raw entities like this:

from html import unescape
raw_text = "Tom & Jerry"
clean_text = unescape(raw_text)
print(clean_text)

Output:

Tom & Jerry

A few things worth noting:

  • Browsers and parsers like Beautiful Soup often decode text nodes automatically 
  • But raw attribute values (titlealtdata-*, etc.) may still contain entity references depending on the parser and extraction method
  • If you're scraping HTML manually (regex/string operations), entities usually remain encoded until you call html.unescape()

Most HTML parsers automatically decode entities in text nodes, but attribute values or manually extracted HTML fragments may still contain encoded entities.

Unicode normalization

The unicodedata.normalize("NFKC", text) call fixes ligatures, full-width digits, and compatibility characters for downstream NLP:

import unicodedata
text = unicodedata.normalize("NFKC", text)
# Why this matters:
# "\uFB01le" (fi ligature) won't match a search for "file"
# "\uFF15" (full-width 5) won't match a regex looking for "5"

This fixes compatibility characters, ligatures, and full-width forms that can otherwise break matching, indexing, tokenization, and regex processing.

So the practical rule is:

  • NFKC is great for search, NLP, indexing, and user input normalization 
  • Avoid blindly applying it to cryptographic data, passwords, source code identifiers, or text where exact Unicode distinctions matter.

Non-breaking spaces and other invisibles

Invisible characters show up constantly in scraped text and look like regular spaces until something breaks:

  • \u00a0 — non-breaking space, comes from &nbsp; that wasn’t decoded
  • \u200b — zero-width space, used by some sites to defeat scrapers
  • \ufeff — byte order mark (BOM), appears at the start of files
# Strip the common invisibles safely
text = text.replace("\u00a0", " ") # NBSP to regular space
text = text.replace("\u200b", "") # zero-width space to nothing
text = text.replace("\ufeff", "") # BOM to nothing

Encoding detection at the source

When Requests guesses the wrong encoding, text comes out garbled with things like â€™ instead of an apostrophe. Set the encoding from the <meta charset> tag or use apparent_encoding as a fallback:

response = requests.get(url)
# Let Requests detect encoding from the HTML meta tag
response.encoding = response.apparent_encoding
html = response.text
# When reading from a file, always be explicit
with open("page.html", encoding="utf-8") as f:
html = f.read()

Putting it together: a reusable clean() function

The function below takes extracted text and returns it ready for downstream use. It’s self-contained and copyable into any project that does HTML text extraction:

Create a file called clean_text.py and add the following script:

import re
import unicodedata
from html import unescape
def clean(text: str) -> str:
"""Normalize extracted HTML text for downstream use."""
# Decode any remaining HTML entities
text = unescape(text)
# Strip BOM and zero-width space
text = text.replace("\ufeff", "").replace("\u200b", "")
# Convert non-breaking spaces to regular spaces
text = text.replace("\u00a0", " ")
# NFKC normalization -- fixes ligatures and full-width chars
text = unicodedata.normalize("NFKC", text)
# Collapse whitespace
text = re.sub(r"\s+", " ", text).strip()
return text

This helper handles many common Python HTML to plain data cleanup problems in one place.

Full working scripts

Script 1: Minimum viable extraction with BeautifulSoup

Install dependencies, then run the script to extract and clean text from any URL:

pip install beautifulsoup4 lxml requests
# bs4_extract.py
from bs4 import BeautifulSoup
import requests, re, html, unicodedata
def clean(text: str) -> str:
text = html.unescape(text)
text = text.replace("\ufeff", "").replace("\u200b", "")
text = text.replace("\u00a0", " ")
text = unicodedata.normalize("NFKC", text)
text = re.sub(r"\s+", " ", text).strip()
return text
def extract_text(url: str) -> str:
response = requests.get(url)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, "lxml")
for tag in soup(["script", "style", "noscript", "template"]):
tag.decompose()
return clean(soup.get_text(separator=" ", strip=True))
if __name__ == "__main__":
print(extract_text("https://books.toscrape.com")[:500])

Run:

python bs4_extract.py

Output:

All products | Books to Scrape - Sandbox Books to Scrape We love being scraped!
Warning! This is a demo website for web scraping purposes... A Light in the Attic
£13.99 In stock Add to basket Tipping the Velvet £53.74 In stock...

Script 2: High-volume extraction with selectolax

Use this when processing large numbers of documents where Beautiful Soup’s overhead adds up:

pip install selectolax requests
# selectolax_extract.py
from selectolax.parser import HTMLParser
import requests, re, html, unicodedata
def clean(text: str) -> str:
text = html.unescape(text)
text = text.replace("\ufeff", "").replace("\u200b", "")
text = text.replace("\u00a0", " ")
text = unicodedata.normalize("NFKC", text)
text = re.sub(r"\s+", " ", text).strip()
return text
def extract_text(url: str) -> str:
response = requests.get(url)
tree = HTMLParser(response.text)
for tag in tree.css("script, style, noscript, template"):
tag.decompose()
return clean(tree.body.text(separator=" "))
if __name__ == "__main__":
print(extract_text("https://books.toscrape.com")[:500])

Run:

python selectolax_extract.py

Output:

All products | Books to Scrape - Sandbox Books to Scrape We love being scraped!
A Light in the Attic £13.99 In stock Add to basket Tipping the Velvet £53.74...

Script 3: Article body extraction with Trafilatura

The Trafilatura library strips boilerplate automatically and returns only the main article content:

pip install trafilatura
# trafilatura_extract.py
import trafilatura
def extract_article(url: str) -> str:
downloaded = trafilatura.fetch_url(url)
return trafilatura.extract(downloaded) or ""
if __name__ == "__main__": print(extract_article("https://en.wikipedia.org/wiki/Web_scraping")[:500])

Run:

python trafilatura_extract.py

Output:

Web scraping, web harvesting, or web data extraction is data scraping used for
extracting data from websites. Web scraping software may directly access the World
Wide Web using the Hypertext Transfer Protocol or a web browser...

Script 4: Targeted extraction with lxml XPath

Use lxml’s XPath when you need precise control over which elements to target:

pip install lxml requests
# lxml_xpath_extract.py
from lxml import html
import requests
def extract_titles(url: str) -> list:
tree = html.fromstring(requests.get(url).content)
return tree.xpath("//article[@class='product_pod']//h3/a/@title")
if __name__ == "__main__":
for title in extract_titles("https://books.toscrape.com")[:10]:
print(title)

Run:

python lxml_xpath_extract.py

Output:

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Remains of the Day
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free

Console-based and alternative extraction tools

Python isn’t the only way to extract readable text from HTML. Several other console tools and browser-based methods work well for quick validation, debugging, and lightweight extraction workflows. 

Here are a few Python alternatives and console-based text extraction tools you need to know:

Lynx and w3m

Terminal browsers like lynx and w3m can render messy HTML into readable plain text. You need to run lynx -dump or w3m -dump through Python’s subprocess module to get a fast sanity check when your parser output looks suspicious.

These tools work especially well for:

  • Quick one-off extraction and sanity checks
  • Debugging formatting problems
  • Small scripts
  • Comparing parser output

Overall, the main use case is debugging. Comparing lynx output against your parser’s output tells you quickly if a discrepancy is a parser problem or a selector problem.

Pandoc

For article-shaped HTML and documentation pipelines, the command pandoc -f html -t plain file.html can convert HTML into plain text in one line. It’s not scriptable in the same way Python is, but it’s a useful sanity check.

Browser-side extraction as a reality check

Sometimes the fastest debugging approach is checking the browser directly. Running document.body.innerText in the browser console returns what a human-rendering engine considers the visible text with CSS evaluated, hidden elements excluded, and JavaScript-rendered content included. This is the gold standard to compare your extracted text against.

LLM-based extraction

LLMs can technically extract text from HTML, but they’re usually unnecessary for plain conversion tasks. They’re slower, more expensive, and less deterministic than parser libraries. Parser libraries are faster, cheaper, and more predictable.

LLMs make more sense for semantic extraction tasks such as:

  • Product names and prices 
  • Event detection 
  • Entity extraction 
  • Review summarization 

For standard extract text from HTML Python workflows, parser libraries remain faster and more reliable.

Best practices and common pitfalls

Text extraction problems usually come from a small set of recurring mistakes that are easy to handle once you’ve seen them, and expensive to debug after a downstream consumer breaks.

Here are the common text extraction best practices:

Practice

Why it matters

Always strip script and style tags first

The #1 cause of JavaScript code ending up in extracted text. Call decompose() before get_text().

Choose your parser deliberately

Use lxml as the Beautiful Soup backend by default; html.parser if you cannot install C extensions; html5lib for severely broken markup.

Never regex full HTML

Regex works for the cleanup of already-extracted plain text only. It breaks silently on real-world HTML.

Guard against None on find()

The find() method returns None when no match exists. Always check before calling .get_text() to avoid AttributeError.

Cache responses during development

Hitting a live site repeatedly while tweaking selectors gets your IP blocked fast.

Validate output continuously

Spot-check dozens of pages. Selector drift is easier to catch early than after a downstream consumer breaks.

JavaScript-rendered content remains one of the most common confusion points for beginners. If your selectors suddenly return empty results, this guide outlines why Beautiful Soup fails and how to fix it. 

Final thoughts

Python extract text from HTML comes down to these key decisions: which library to use, how to strip noise before you extract, and how to clean what comes out. Beautiful Soup is the right default; selectolax handles large-scale extraction efficiently, Inscriptis improves formatting quality, and Trafilatura removes article boilerplate automatically.

Cleanup matters just as much as extraction. Normalizing whitespace, Unicode, entities, and invisible characters often determines whether extracted text is merely readable or truly production-ready.

Your next step? Pick a library from the decision matrix above, run the minimum viable workflow in the step-by-step section, and save the clean() function for when you need it.

Get the HTML, skip the blocks

Your Python parser is ready. Now feed it clean HTML from any site. Decodo's Web Scraping API handles proxies, rendering, and anti-bot bypass so your extraction code always has data to work with.

Share article:

About the author

Lukas Mikelionis

Senior Account Manager

Lukas is a seasoned enterprise sales professional with extensive experience in the SaaS industry. Throughout his career, he has built strong relationships with Fortune 500 technology companies, developing a deep understanding of complex enterprise needs and strategic account management.

Connect with Lukas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

What is the best Python library to extract text from HTML?

It depends on the job. Beautiful Soup is usually the best default for most text extraction tasks because it balances readability, flexibility, and parser support.

It works well for beginners and production scripts alike. If speed matters most, selectolax is typically faster. If article extraction matters more than raw parsing, Trafilatura often produces cleaner output.

How do I remove HTML tags from a string in Python?

The safest default is BeautifulSoup(html, "lxml").get_text(). This properly parses nested HTML before removing tags. Regex-only approaches can break on malformed markup, embedded scripts, or nested elements. Regex is fine for the cleanup of already-extracted plain text, not for stripping markup from HTML strings.

Why does BeautifulSoup's get_text() include script and style content?

Beautiful Soup is an HTML parser, not a browser. It doesn’t evaluate CSS or run JavaScript, so it has no concept of "visible" content. Every text node in the tree comes out, including the content of <script> and <style> tags. The fix is to call tag.decompose() on all script, style, noscript, and template tags before calling get_text().

How do I extract text from a URL in Python?

Fetch the page first with requests.get(), then parse the HTML with Beautiful Soup or another parser. Encoding mismatches can produce broken characters, so check response.apparent_encoding when the text looks corrupted. If visible content never appears in the HTML response, the site likely renders content with JavaScript.

Which Python library is faster: Beautiful Soup, lxml, or selectolax?

The selectolax library is usually the fastest option, often outperforming Beautiful Soup significantly on large workloads. lxml sits in the middle and provides excellent XPath support. Beautiful Soup is slower but easier to read and debug. Actual speed depends heavily on parser configuration and page complexity.

Can I extract text from HTML without any external library?

Yes. Python’s built-in html.parser module and html.unescape() can handle tiny scripts without additional installs. That said, real-world HTML quickly becomes messy, malformed, and inconsistent. Beautiful Soup with the lxml backend remains the safer choice for anything beyond very small parsing tasks.

Beautiful Soup Web Scraping: How to Parse Scraped HTML with Python

Web scraping with Python is a powerful technique for extracting valuable data from the web, enabling automation, analysis, and integration across various domains. Using libraries like Beautiful Soup and Requests, developers can efficiently parse HTML and XML documents, transforming unstructured web data into structured formats for further use. This guide explores essential tools and techniques to navigate the vast web and extract meaningful insights effortlessly.

Parsing HTML and XML documents with lxml.

lxml Tutorial: Parsing HTML and XML Documents

Keepin’ it short and sweet: data parsing is a process of computer software converting unstructured and often unreadable data into structured and readable format. Parsing offers a lot of benefits, some of which include work optimization, saving time, reducing costs, and many more; in addition, you can use parsed data in plenty of different situations.

Even tho that sounds epic, parsing itself can be quite complicated. But hold on, buddy, and get ready to explore a step-by-step process on how to parse HTML and XML documents using lxml.

Top Python Scraping Libraries: Overview, Comparison, and How to Choose the Right One

Python has the richest scraping ecosystem of any language. That breadth is exactly why making a choice is harder than it should be. This article continues from our Python web scraping guide, focusing on the selection problem: 8 libraries across 4 categories, what each one does best, where it breaks down, and how to choose the right one for the job.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved