Back to blog

How to Parse HTML With Regex: A Practical Guide

Share article:

Yes, you can parse HTML with regex – but only for specific tasks. Regex works well on flat targets like meta tags, sitemap URLs, or inline JSON-LD. But on nested or JavaScript-rendered markup, it fails silently, and you often don’t notice until the data is already wrong. This guide explains when regex works on HTML and when it breaks, includes working Python for the common extraction tasks (meta tags, JSON-LD, bulk extraction), and covers when to switch to a parser or get past a page that blocks you.

Browser window with code symbol inside a rounded square.

TL;DR

  • A regular expression (regex) is a pattern-matching mechanism that compiles into a state machine to scan linear text. It can't parse infinitely nested structures like HTML, but it's fast and reliable for extracting flat, predictable substrings from a document.
  • Before writing complex patterns against an HTML body, check for inline JSON-LD or framework hydration scripts like __NEXT_DATA__. A single regex match on these blocks can pull the entire page data model as a stable JSON payload that survives future layout changes.
  • Use regex only on flat, predictable targets within markup you control or monitor, such as SEO meta tags, sitemap URLs, or script source attributes.
  • Switch to a proper HTML parser like BeautifulSoup, lxml, or selectolax the moment you hit nested elements of the same tag, varying attribute orders, or frequently changing layouts.
  • For stability at scale: compile patterns once at the module level, use named capture groups, favor strict character classes over the lazy dot wildcard (.*?), decode HTML entities after matching, and add canary patterns to catch silent layout changes early.
  • For JavaScript-heavy sites or targets behind anti-bot walls, route requests through a managed scraping API to get raw rendered HTML. For millions of cached pages, use vectorized column operations via DuckDB or Polars instead of Python loops.

Parse HTML with regex: what regex and HTML actually are

People often use both regex and HTML loosely. Before we decide where regex fits, it helps to be clear about what each one really is.

The regex pieces that decide whether HTML extraction works

A regular expression compiles to a state machine that scans text for matches. Most of the syntax is well-covered in MDN's reference. But when you apply regex to HTML extraction specifically, 2 pieces matter more than the rest. 

The first is the non-greedy quantifier. The pattern .*? matches as few characters as possible. Without it, <div>.*</div> matches everything from the first <div> to the last <div> on the page, which is usually not what you want. Inside attribute values, the character class form [^"]+ is safer, because it physically can’t go past the closing quote.

The second is the dotall flag (sometimes called single-line). Without it, . doesn’t match newlines. This matters because real HTML often wraps attributes across several lines, especially in pretty-printed pages and JSON-LD blocks. You turn it on with re.DOTALL (Python), the s flag (most other engines), or [\s\S] as the explicit "any character including newline" regular expression.

Named capture groups and possessive quantifiers also matter. Lookarounds are important too, but mostly for the engines that support them, and that is in the comparison table below.

The fundamental mismatch

Regex describes regular languages, while HTML is a context-free language with arbitrary nesting. Real-world HTML is also frequently malformed in ways browsers tolerate, but no spec endorses, for example, missing close tags, attributes without quotes, and scripts containing < characters. A regex can describe slices of that mess reliably, but not the structure as a whole.

When regex and HTML still meet productively

When your target is flat and predictable, the mismatch doesn't matter. A meta tag usually sits on one line, a sitemap <loc> element has no inner tags, and a <script src="..."> attribute is a fixed wrapper around the value you want. These all sit inside regex's comfort zone.

For tree-shaped data, switch to a proper parser. Data parsing the document into a tree and querying it with selectors stays correct as the markup changes, but regex breaks the first time someone reorders attributes. The same applies for any web scraping project where the markup itself is the data, not only the wrapper around it.

Step-by-step process to parse HTML with regex

Each step below is a decision point. So, the choices you make at, say, steps 3 and 5 cause more regex bugs in practice than any syntax mistake.

Step 1 – set up your environment

Start off by picking a language whose regex engine you trust, and whose escape rules you can remember. The defaults are all production-grade:

  • Python's re module (and the richer third-party regex package, which has supported Unicode 17.0.0 since its October 2025 release). Current stable Python is 3.14+.
  • JavaScript's built-in RegExp with the v flag for Unicode set notation. The flag is available since Node.js 20.12 and stable in the current 22 LTS and 24 LTS lines.
  • PHP's PCRE2. PHP 8.4 raised the named-capture-group label limit from 32 to 128 characters by bundling PCRE2 10.44; PHP 8.5 and later inherit that ceiling.
  • Go's regexp package, which uses RE2 and runs in linear time but doesn't support lookarounds or backreferences.

Engine

Lookarounds

Backreferences

Possessive quantifiers / atomic groups

Linear-time guarantee on hostile input

Python re (3.11+)

Yes

Yes

Yes

No

Python regex (third-party)

Yes

Yes

Yes

No

JavaScript RegExp

Yes

Yes

No

No

PHP PCRE2

Yes

Yes

Yes

No

Go regexp (RE2)

No

No

N/A (linear by design)

Yes

The trade-off matters mostly at scale. RE2-style engines (Go and Python through the re2 binding) refuse some patterns, but they guarantee the engine won't run forever on hostile input. PCRE-style engines (Python re, JavaScript RegExp, and PHP PCRE2) support more features, but they can hit catastrophic backtracking on adversarial input.

Also decide how you handle encoding, and do it early. HTML pages typically arrive as UTF-8, but Latin-1 and Windows-1252 still show up on older sites. If you get the encoding wrong, your pattern matches against garbled text (mojibake), and it silently corrupts any string with diacritics or currency symbols.

Step 2 – fetch the HTML

Use a real HTTP client. In Python, that is httpx or requests. httpx is recommended for new code, while requests, while widely used, works best in maintenance mode. In Node.js, it is fetch or undici, in PHP, it is Guzzle, and in Go, it is net/http. Don't read from raw sockets.

Set a realistic User-Agent and Accept-Language header. Stripped-down headers are one of the fastest ways to get rate-limited, served a mobile-only variant, or routed to a stripped HTML cache that is missing the tags you came for.

Wrap the fetch in retry-with-exponential-backoff, so transient timeouts and 503s retry instead of silently poisoning your data. A decorator fromtenacity or stamina (Python) handles this in a few lines.

Capture the raw response body before any parsing, and never send a partially decoded string to your regex. The encoding mismatch produces empty matches that look like a pattern bugwhen they aren’tt.

When the response itself comes back as a 403, a 429, or a CAPTCHA page, the bug is in the fetch layer, not the pattern.Dedicated site unblockers help mitigate these by handling rotation, fingerprinting, JavaScript rendering, and common CAPTCHA challenges. With an unblocker, you get back rendered HTML that your regex can read instead of a block page. 

Step 3 – inspect the markup before writing the pattern

Open the page in DevTools (via the Inspect element menu) and copy the actual HTML around your target node. Then click View Page Source and copy the same area from the raw response. These 2 views differ on any site where JavaScript builds or changes the page: DevTools shows the rendered DOM, while View Page Source shows what your HTTP client actually receives.

The page a human sees and the HTML a scraper receives are different documents. The rendered DOM is full of results; the raw response is a near-empty shell. That is why a regex against it returns zero matches, and the signal to reach for a renderer or hydration data, not a smarter pattern.

Rendered DOM vs raw HTML

Note the boundary of your target. Is the value always on one line, always lowercase, always quoted, and always with the same attribute order? Every "always" is a constraint you can put into the pattern, and that is what makes a regex narrow enough to survive.

Step 4 – craft the pattern

Start narrow, then widen. Lock the attributes you know are stable and leave only the variable slice as a capture group. A pattern like <meta property="og:title" content="([^"]+)"> extracts the title from a single meta tag in one line because it locks 5 tokens and only varies 1 value.

Inside attribute values, prefer a negated character class over the dot wildcard. The class [^"]+ is safer than .+? against unexpected quote nesting, because it can’t drift past the closing quote. Use the dotall flag if your target spans several lines.

Test against 3 samples: a happy-path page, a page with a known edge case (entity-encoded characters, or a different attribute order), and a deliberately broken snippet. If the third sample matches when it shouldn't, your pattern is too loose.

Testing the pattern on regex101 before it ships with: the won't og: tags – they should match and expose the name and content groups, while the unrelated description tag should beis correctly skipped.

Testing the pattern on regex101

Step 5 – extract and normalize the data

First, get rid of HTML entities like &&#x27; and  yourself, because the match is the literal source text, and the engine doesn’t decode entities for you.

Also, cast types where it matters. For example, convert prices to numbers, dates to ISO 8601 strings, URLs to absolute form using the page's base URL. If you send a URL with a leading slash to a downstream system, nothing in that URL says which host it came from.

Step 6 – validate, store, and monitor

It’s important to validate every extracted record against a schema before you store it. An empty match is one of the most common failure modes, and if it fails, it should do it loudly.

Depending on your volume, save your results to CSV, JSON Lines, or a database. Also make sure to record the source URL and the timestamp alongside each row so you can reproduce the extraction when the site changes.

Finally, set an alert for when the daily record count drops by more than a small threshold. This catches markup changes before they turn into a silent data outage.

3 habits that decide whether regex extraction works

These 3 habits decide whether the use cases below work on the first try. The question is not which regex you write, but where you point it, and what you check around it.

Find the page's hydration data before writing any body pattern

Modern JavaScript frameworks ship a copy of the page's state as JSON inside the HTML, so the React or Vue runtime can hydrate the client without fetching again. Most engineers who write regex against the rendered body never check for this. These are the framework-specific markers worth looking for first:

  • <script id="__NEXT_DATA__" type="application/json"> – Next.js Pages Router. Still used by many production sites (verified live on Notion at the time of writing).
  • self.__next_f.push([1, "..."]) – Next.js App Router. Streaming React Server Components payload chunks. Used by Next.js's own homepage, Vercel, Hashnode, Product Hunt.
  • Nuxt's serialized payload script (the exact variable name has changed across major versions; search the page source for __NUXT__ or nuxt-payload).
  • window.__remixContext= – Remix (an inline-script global, not an id-tagged block; React Router v7 moved it, so also scan for __staticRouterHydrationData).
  • window.__INITIAL_STATE__= – generic SSR/SPA convention, common with Vue, Redux SSR, and custom setups.

The cleanest case is the Next.js Pages Router pattern. One regex pulls a single JSON blob that usually holds most of the page model, including the props the page was rendered with, the metadata,and even data fetched on the server that has not rendered yet.

import json
import re
from pprint import pp
import httpx
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; hydration-extract/1.0)"}
html = httpx.get(
"https://www.notion.so/",
headers=HEADERS,
timeout=15,
follow_redirects=True,
).text
NEXT_DATA = re.compile(
r'<script id="__NEXT_DATA__"[^>]*>(?P<json>.*?)</script>',
re.IGNORECASE | re.DOTALL,
)
m = NEXT_DATA.search(html)
if m:
state = json.loads(m.group("json"))
pp(list(state["props"]["pageProps"].keys()))

Real output against notion.so at the time of writing:

['metadata',
'pageContent',
'locale',
'platform',
'experimentProps',
'variant',
'canonicalUrl',
'preview',
'geo',
'__dev__',
'_sentryTraceData',
'_sentryBaggage']

That single regex returned 12 fields in one match, many of which you would otherwise chase with separate body patterns. On a Next.js commerce product page, the same block usually carries most of the page model. This includes elements like product fields, related products, pagination state, and data the framework fetched on the server but has not rendered yet.

The Next.js App Router pattern needs more work. The __next_f chunks are React Server Components payload fragments, not raw JSON, so you join them together and walk the result with a small parser (or use a maintained library like nextjs-hydration-parser). The extra work is worth it, because hydration data is more stable across redesigns than the rendered HTML.

A Next.js App Router page ships its content as JSON inside streaming script chunks. One regex on that block often returns most of the page model, and it survives redesigns better than the rendered HTML.

A Next.js App Router page

The remaining markers work the same way. You slice the serialized state out of its <script> block, or for the window.__X__= globals like Remix and __INITIAL_STATE__ you capture the text after the =. Then you parse it with json.loads for the JSON ones, and a format-specific decoder for the rest.

So, before you write any body pattern, search the page source for these markers. If one matches, your body regex only needs to cover the fields that the JSON doesn’t carry.

Carve away the noise, but not the signal

The default mental model is to write a pattern that captures what you want. The inverse is often more stable: write patterns that remove what you don't want, and then treat whatever is left as the signal.

import re
REMOVE = [
re.compile(r"<script\b[^>]*>.*?</script>", re.IGNORECASE | re.DOTALL),
re.compile(r"<style\b[^>]*>.*?</style>", re.IGNORECASE | re.DOTALL),
re.compile(r"<nav\b[^>]*>.*?</nav>", re.IGNORECASE | re.DOTALL),
re.compile(r"<footer\b[^>]*>.*?</footer>", re.IGNORECASE | re.DOTALL),
re.compile(r"<!--.*?-->", re.DOTALL),
]
def carve_away(html: str) -> str:
for pattern in REMOVE:
html = pattern.sub("", html)
return html

This works when the boilerplate is more stable than the content. Headers, footers, navigation menus, and analytics scripts rarely change between releases, while the main content layout is reworked often. So, a carve-away pipeline often survives redesigns that would break a carve-out pattern. This is also the cleanup step most articles skip when they prepare HTML for an LLM ingestion pipeline. They write a fragile selector for the main content area, when removing the predictable noise would do the same job with a quarter of the maintenance.

For production-grade article-only extraction (boilerplate removal, language detection, metadata), trafilatura and readability-lxml are the mature Python libraries that ship this carve-away approach plus more.

Add a canary pattern that catches silent page changes

The most dangerous regex failure in production is not a thrown exception. It is the page changing just enough that a pattern returns wrong data instead of no data. A canary pattern catches this.

Pick something that should always be present, and always look the same on a healthy fetch: the site's copyright line, the logo image filename, the meta charset declaration, a known JSON-LD context. Compile it as a separate pattern. Run it on every fetch BEFORE the real patterns. If the canary fails, the page did not load correctly, or the structure changed – so flag and skip, instead of shipping corrupted records.

import re
CANARY = re.compile(r'<meta\s+charset="utf-8"', re.IGNORECASE)
def fetch_validated(url: str, fetch) -> str:
html = fetch(url)
if not CANARY.search(html):
raise ValueError(
f"Canary failed for {url}; page structure may have changed"
)
return html

Pair the canary with the daily record-count alert. The canary catches fast structural breaks right away, while the count alert catches slow content degradation over time. But neither one catches plausible-but-poisoned values. So, when the data feeds something that matters, add a value-level sanity check too, such as flagging a field that never varies across a page, or a price outside a sane range.

Common use cases when you parse HTML with regex

Most regex on HTML patterns in production target one of 4 shapes. Each one has a reason why regex works there and a known point where it breaks.

Use case 1 – pulling Open Graph and meta tags for SEO audits

Marketing pages render og:titleog:imageog:descriptiontwitter:card, and the basic meta name="description" tag as flat <meta> elements inside <head>. Each tag fits on one line, the attribute order is usually stable across crawls, and there is no nesting. So, regex web scraping is at its best on shapes like this.

Here’s a working pattern against a live Wikipedia page:

import re
import httpx
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; SEO-audit/1.0)"}
html = httpx.get(
"https://en.wikipedia.org/wiki/Web_scraping",
headers=HEADERS,
timeout=10,
follow_redirects=True,
).text
OG_TAG = re.compile(
r'<meta\s+property="(?P<name>og:[^"]+)"\s+content="(?P<content>[^"]*)"',
re.IGNORECASE,
)
for m in OG_TAG.finditer(html):
print(f"{m.group('name')}: {m.group('content')}")

And here’s real output at the time of writing:

og:title: Web scraping - Wikipedia
og:type: website

The reason why regex fits here is that the target is a single attribute value inside a wrapper that rarely changes between page loads. But there are also instance of it breaking when the site injects meta tags through JavaScript after page load and the static HTML comes back blank. Regex can’t tell the difference between a tag that is not there and a tag that has not been rendered yet.

Use case 2 – extracting URLs from sitemaps and RSS feeds

Sitemaps wrap URLs in <loc> tags with no inline attributes, and RSS items wrap them in <link> tags. A regex matching <loc>(.*?)</loc> handles thousands of entries in one pass and is a common first step for crawl budget audits and content-discovery pipelines.

The following is a working extraction against the live Python documentation sitemap:

import re
import httpx
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; site-discovery/1.0)"}
xml = httpx.get(
"https://docs.python.org/sitemap.xml",
headers=HEADERS,
timeout=10,
follow_redirects=True,
).text
LOC = re.compile(r"<loc>(.*?)</loc>", re.DOTALL)
urls = LOC.findall(xml)
print(f"{len(urls)} URLs found")
for u in urls[:3]:
print(u)

And here’s the output:

8 URLs found
https://docs.python.org/3.10/
https://docs.python.org/3.11/
https://docs.python.org/3.12/

Regex works well here because the wrapper is a fixed 2-token boundary with little variability.  But keep an eye out for when feeds nest CDATA blocks (often inside RSS <description>) or wrap URLs in extra XML namespaces, as it will start breaking your scrape. To fix this, move to an XML parser. The line between crawling and scraping workflows is also where you decide if regex is the right layer at all.

Use case 3 – harvesting tracking pixels, analytics scripts, and inline JSON-LD

Security and privacy auditors often scan pages for third-party tracker URLs in script src and img src attributes. The patterns are tightly scoped, and the matches are easy to validate. The same approach extracts JSON-LD blocks (JSON for Linked Data, the schema.org structured-data format embedded in many modern pages for SEO and AI-search discoverability). You capture the contents of <script type="application/ld+json"> with regex, and then hand the JSON string to a real JSON parser.

Here’s a snippet for JSON-LD extraction against the live python.org homepage:

import json
import re
import httpx
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; ld-extract/1.0)"}
html = httpx.get(
"https://www.python.org/",
headers=HEADERS,
timeout=10,
follow_redirects=True,
).text
LD_BLOCK = re.compile(
r'<script[^>]*type="application/ld\+json"[^>]*>(?P<json>.*?)</script>',
re.IGNORECASE | re.DOTALL,
)
for m in LD_BLOCK.finditer(html):
data = json.loads(m.group("json"))
print(json.dumps(data, indent=2))

And the parsed JSON:

{
"@context": "https://schema.org",
"@type": "WebSite",
"url": "https://www.python.org/",
"potentialAction": {
"@type": "SearchAction",
"target": "https://www.python.org/search/?q={search_term_string}",
"query-input": "required name=search_term_string"
}
}

The shape here is the same as extracting email addresses or other contact patterns from a page – a known prefix, a variable middle, and a known suffix.

Using regex, the wrapper stays rigid even when the inner content varies. Notice that the script-block JSON itself goes to json.loads – regex only slices it out, and then a real JSON parser handles the structure. For teams that would rather not maintain the JSON-LD extraction pattern at all, Decodo's AI Parser packages the same extract-then-validate workflow as a managed service.

However, when the tracker tag is built by JavaScript at runtime, the static HTML is empty, and the regex returns nothing, this way breaking your workflow. For JSON-LD specifically, the break moves to the parse step with a literal </script> inside a JSON string cutting the lazy match short, and a malformed JSON making json.loads throw. So, wrap the parse in error handling, instead of trusting every block.

Use case 4 – cleaning HTML before sending it to an LLM or text pipeline

Stripping <script> and <style> blocks, HTML comments, and inline style attributes is a one-liner with regex. It cuts the token cost a lot when you are converting pages to markdown for an LLM pipeline. Use the dotall flag so that multi-line script and style blocks disappear in one pass.

import re
raw_html = """
<html>
<head><style>body{color:red}</style></head>
<body>
<script>fetch('https://tracker.example.com/pixel.gif')</script>
<h1>Article title</h1>
<p>The body text we want to send to an LLM.</p>
</body>
</html>
"""
CLEAN = re.compile(r"<(script|style)\b[^>]*>.*?</\1>", re.IGNORECASE | re.DOTALL)
cleaned = CLEAN.sub("", raw_html)
print(cleaned.strip())

After cleanup:

<html>
<head></head>
<body>
<h1>Article title</h1>
<p>The body text we want to send to an LLM.</p>
</body>
</html>

As a result, the tracking script and the style block are gone in one pass. Comments and inline style="..." attributes need their own patterns, but the principle is the same.

The upside of using regex here is that the cleanup target is anything inside a known tag pair, not the semantic structure of what is inside it. But issues appear when you need to keep the surrounding semantic structure (headings, lists, code blocks). The fix is to parse the document into a tree and run that instead.

An API built for scale

Simplify how you collect data by connecting our Web Scraping API into your workflows. Backed by 125M+ IPs across 195+ locations and 99.99% success rates.

Best practices and limitations of using regex to parse HTML

One of the most-cited Stack Overflow answers warns developers not to parse HTML with regex. That warning is correct at scale, and wrong for narrow tasks. Instead, here is a practical guide with clear rules on when regex does work, paired with the specific bugs and limitations that caused the warning in the first place.

Best practices

These rules apply whether you work in Python, PHP, Go, or JavaScript. The pattern of mistakes is the same across all engines.

  1. Compile patterns once and reuse them. Every regex engine has compile overhead. Re-compiling inside a loop wastes CPU power at scale. In Python, that means you should assign re.compile(...) to a module-level variable. In Go it, it’s regexp.MustCompile. In JavaScript, you hoist the RegExp literal out of the loop.
  2. Use named capture groups. For easeir tracking and troubleshooting, use meaningful names in your code. Named groups are (?P<url>...) in Python, and (?<url>...) in JavaScript and PHP.
  3. Prefer character classes over the dot wildcard. The class [^"]+ inside an attribute value is safer than .+?, because it can't drift past the closing quote.
  4. Anchor patterns to the surrounding markup. A pattern like <a\s+href= that ties href to its opening tag beats a free-floating href= that could match inside a comment or a JavaScript string literal.
  5. Decode HTML entities after extraction, not before pattern matching. This keeps the pattern predictable as decoding entities inside the regex mixes escaping rules with parsing rules, and makes debugging painful.
  6. Pin a sample-based regression test to every production pattern. When the source HTML changes, the test fails before your data does. A 30-line pytest file that runs the pattern against 3 pinned HTML samples is an effective habit for catching silent failures.
  7. Use possessive quantifiers or atomic groups where the engine supports them. The forms a++ (possessive) and (?>a+) (atomic) stop the engine from backtracking into the part it already matched, which removes a common source of catastrophic backtracking on adversarial input. They are available in PCRE, in Python 3.11+'s re module, in PHP's PCRE2, and in Java; but not in Go's RE2 (which is linear-time anyway) or in standard JavaScript engines.
  8. Compile patterns with the Unicode flag on multilingual pages. The class \w defaults to ASCII in some engines (Go's RE2, Python re on bytestrings), so the German ß or French é won't match without it.
  9. Save every public regex with a Regex101 or RegExr permalink in its code comment. The next maintainer (often your future self) can step through the pattern against test strings without leaving the diff.

Limitations and failure modes

Most regex extractors in production hit at least one of these. When you know the failure mode by name, the debugging goes much faster.

  • Nested elements of the same tag. A <div> inside a <div> defeats <div>(.*?)</div>, because regex can’t count nesting depth.
  • Attribute order or quoting style varies. ages might use single quotes or unquoted attributes, while others might reorder attributes between page loads (server-side templating sometimes does this).
  • HTML entities inside attribute values. A class like [^"]+ captures an encoded entity such as " or & without trouble, because the entity holds no literal quote. The value content="A "quoted" thing" comes back whole, just still encoded, so decode it after extraction. The real trap is decoding the page before matching. This turns " into a literal " that truncates the next [^"]+ mid-value.
  • Inline JavaScript and CSS contain text that looks like HTML. A naive <a href= pattern matches inside a string literal in a script tag, and returns useless data.
  • Catastrophic backtracking. A poorly written pattern with overlapping quantifiers can run for too long (or effectively forever) on a single page. This is also a denial-of-service vector when the input is hostile, often called ReDoS (Regular expression Denial of Service). Recent ReDoS CVEs against minimatchpicomatch, and Symfony YAML are all variants of the same bug class, and new ones appear regularly.
  • Encoding mismatches. Matching a UTF-8 byte stream with a pattern compiled for a Unicode string gives you silent mismatches on non-ASCII characters. Decode the response to text, and then compile the pattern as a string, not bytes.
  • JavaScript-rendered DOM. The HTML you fetch is not always the HTML the user sees. As a result, a regex against the raw response returns empty matches with no warning.
  • HTTP 200 is not a success signal. Anti-bot block pages, CAPTCHA pages, and JavaScript challenge landing pages frequently return 200 with HTML that contains zero of the data you asked for. Even managed unblocker APIs can return 200 for what is actually a CAPTCHA response. A regex pattern matched against that block page returns garbage – or empty matches that look like a legitimate "no results" page. Production scrapers verify they got the right kind of page: a quick check on script-to-text ratio (block pages are mostly script and almost no body text), a canary substring in the body, or a fingerprint of the expected response shape catches this before extraction runs.

Seeing the failures in code

2 of the above failure modes might seem abstract in the bullet list, but become more obvious code.

Nested elements break greedy and non-greedy alike:

import re
html = "<div>outer<div>inner</div>more outer</div>"
GREEDY = re.compile(r"<div>(.*)</div>")
NONGREEDY = re.compile(r"<div>(.*?)</div>")
print(f"greedy: {GREEDY.search(html).group(1)!r}")
print(f"non-greedy: {NONGREEDY.search(html).group(1)!r}")

Both matches are wrong in different ways:

greedy: 'outer<div>inner</div>more outer'
non-greedy: 'outer<div>inner'

In this particular case, neither match is what’s actually needed. The greedy version swallows the inner closing tag and the non-greedy version captures the inner opening tag as content. Regex can't see that the second <div> opens a new scope, so any pattern that matches <div> to <div> lies about the structure of the page.

Catastrophic backtracking looks innocent and isn't:

import re, time
EVIL = re.compile(r"^(a+)+$")
for n in (10, 18, 22, 26):
text = "a" * n + "b" # the trailing 'b' forces every path to exhaust
start = time.perf_counter()
EVIL.match(text)
print(f"{n:3} a's + b: {time.perf_counter() - start:8.4f} s")

Real output on a typical computer will look something like this:

10 a's + b: 0.0000 s
18 a's + b: 0.0074 s
22 a's + b: 0.1206 s
26 a's + b: 1.9172 s

The pattern ^(a+)+$ looks harmless, but the runtime multiplies by roughly 16 every 4 characters – exponential growth from a single overlapping quantifier. If the input is user-controlled, this is a denial-of-service weapon. When reviewing your code, flag any pattern with nested quantifiers like (...+)+ or (...*)+, and go with the possessive quantifiers or atomic groups from best practice #7 when the engine supports them.

The decision rule, in one line

If the target is one flat substring inside well-formed markup that you control or monitor, regex is fine. But for everything else, use an HTML parser instead.

When you switch, the right parser depends on your stack: 

  • In Python, Beautiful Soup handles forgiving HTML well
  • lxml is faster for large documents and supports XPath
  • selectolax (a Rust-backed binding to the Lexbor engine) is even faster  when you need to parse millions of pages and don’t need XPath
  • In Node.js, Cheerio gives you jQuery-style selectors over a parsed tree

From there, pick the right selector strategy for your target's structure.

Beyond the basics: production-grade techniques

The decision rule above sends tree-shaped data to a parser. When regex is the right call, these 6 shifts decide whether the basics hold up once the page, the volume, or the pipeline gets messier.

Treat JSON-LD as the default extraction target

You can pull a JSON-LD block by matching the <script type="application/ld+json"> wrapper with regex and handing its contents to a JSON parser. This particular approach should be your first attempt on commerce, news, and content pages, not the fallback. AI search engines reward structured data, so publishers ship more of it and more reliably than they used to. 

The schema types worth scanning for on first inspection are Product, Offer, AggregateRating, Article, NewsArticle, BreadcrumbList, FAQPage, Recipe, Event, Organization, Person. When one of these is present and matches your target, the JSON-LD block usually carries most of what you would otherwise build several body patterns to extract. Body patterns are only for the fields that the JSON-LD doesn’t cover.

Generate patterns with a language model, then validate against pinned samples

Writing a tricky regex by hand still works, but handing a representative HTML snippet to a strong AI model is more efficient. Give the LLM your snippet,, describe what to extract, get a pattern back, validate it against the same pinned samples (a happy-path page, a known edge case, and a deliberately broken snippet), and ship if it passes.

A prompt shape that works reliably:

Write a Python re pattern that captures the price from this HTML, allowing for variations in currency symbol position (before or after the number) and either dot or comma as the decimal separator. Use a named capture group called price. Return the pattern only.

Sample HTML:
 <span class="product-price">$19.99</span>

<span class="product-price">19,99 €</span>

<span class="product-price">£1,299</span>

A useful habit is to store the original prompt next to the regex in code comments. When the markup changes, the next maintainer can regenerate the pattern with the same context, instead of reverse-engineering what the previous author was trying to capture.The same validate-against-pinned-samples discipline applies whether the pattern came from a prompt you wrote, or from a managed service running on your behalf.

You can automate that loop by setting up that a failed canary or regression test triggers regeneration from the stored prompt, then re-validation before the new pattern ships. That is the regex version of what the industry now calls a self-healing scraper.

Wrap matches in Pydantic models so silent failures fail loudly

Regex returns strings, while Pydantic v2 validates structure, types, and constraints. Putting them together is the anti-corruption pattern for scraped data whereby the regex extracts, and the model decides whether what came out is actually usable.

import re
from typing import Annotated
from pydantic import BaseModel, StringConstraints, ValidationError
META_TAG = re.compile(
r'<meta\s+name="(?P<name>[^"]+)"\s+content="(?P<content>[^"]*)"',
re.IGNORECASE | re.DOTALL,
)
NonEmpty = Annotated[str, StringConstraints(strip_whitespace=True, min_length=1)]
class MetaTag(BaseModel):
name: NonEmpty
content: NonEmpty
samples = [
'<meta name="description" content="A real page">',
'<meta name="description" content="">',
'<meta name="robots" content="noindex">',
]
for s in samples:
for m in META_TAG.finditer(s):
try:
tag = MetaTag(name=m.group("name"), content=m.group("content"))
print(f"OK: {tag.name} -> {tag.content!r}")
except ValidationError as e:
print(f"FAIL: {e.errors()[0]['type']} on input {s!r}")

What the script prints:

OK: description -> 'A real page'
FAIL: string_too_short on input '<meta name="description" content="">'
OK: robots -> 'noindex'

The empty-content tag matches the regex, but fails validation. The right production behavior is to log the validation failure as a metric, and alert when the rate crosses a threshold, instead of letting silent empty strings flow into the database.

One more thing worth knowing is that Pydantic v2's own pattern validators use Rust's regex engine by default (non-backtracking, ReDoS-resistant). Switch the regex engine to python-re if you need lookarounds or backreferences in a Pydantic field.

Detect catastrophic backtracking before it ships, not after

The ReDoS demo earlier was harmless at 26 characters, but on user-supplied input of any length, the same pattern becomes a denial-of-service weapon. Several static analyzers catch vulnerable patterns at code review time.

  • recheck (makenowjust-labs) – a well-known ReDoS detector, written in Scala with a JavaScript/TypeScript surface and an ESLint plugin. Available as a long-running RPC server for CI integration.
  • redos-detector (tjenkinson) – CLI and library that scores how vulnerable a regex pattern is to ReDoS and runs in Node, the browser, and Deno. Ships its own ESLint plugin.
  • GitHub CodeQL ships ReDoS queries for JavaScript (js/redos) and Python that flag patterns in code review.
  • For Python projects specifically, bandit catches a basic set, and redos-analyzer walks Python's own sre_parse abstract syntax tree to detect vulnerable patterns and propose auto-fixes using atomic groups.

Add the check as a CI gate. Any new regex pattern that flags as vulnerable blocks the pull request. For patterns that come from an external source (like config files, user input, or language-model output), validate the pattern before re.compile() ever runs on it.

For mission-critical patterns, the safe migration is the same shadow-deploy idea used for service rewrites. You run the old pattern and the new pattern in parallel for a few days, log the difference between their outputs, and only promote the new pattern after a quiet period where the difference stays empty. While it is overkill for a one-off meta-tag fix, it is the right call when the output reaches a downstream system that you can’t easily reverse.

Extract in bulk with DuckDB or Polars instead of Python loops

When you have thousands or millions of HTML pages already on disk, the Python for loop you would write to apply a regex to each one is the slow path. The fast path is column-oriented bulk extraction with DuckDB or Polars. DuckDB is an in-process analytical database (think SQLite, but tuned for analytical queries); Polars is a Rust-based DataFrame library (think Pandas, but parallelized and faster). Both expose regex as a vectorized operation over a dataframe column.

DuckDB ships regexp_extract and regexp_extract_all as SQL functions backed by RE2, so you can run regex across a column of HTML strings in parallel without writing a loop at all:

import duckdb
con = duckdb.connect()
con.execute(r"""
CREATE TABLE pages AS
SELECT * FROM (VALUES
('<meta property="og:title" content="Page 1"><a href="/a">x</a><a href="/b">y</a>'),
('<meta property="og:title" content="Page 2"><a href="/c">z</a>'),
('<meta property="og:title" content="Page 3"><a href="/d">w</a><a href="/e">v</a><a href="/f">u</a>')
) AS t(html);
""")
result = con.execute(r"""
SELECT
regexp_extract(html, '<meta\s+property="og:title"\s+content="([^"]+)"', 1) AS og_title,
regexp_extract_all(html, '<a\s+href="([^"]+)"', 1) AS hrefs,
len(regexp_extract_all(html, '<a\s+href="([^"]+)"', 1)) AS link_count
FROM pages;
""").fetchall()
for row in result:
print(row)

Here’s the result:

('Page 1', ['/a', '/b'], 2)
('Page 2', ['/c'], 1)
('Page 3', ['/d', '/e', '/f'], 3)

Polars exposes the same shape through pl.col("html").str.extract(...) and str.extract_all(...) over a column. Both can run much faster than a Python iteration on corpora in the millions-of-pages range because the regex engine runs at native speed across the whole column.

For responses too large to load whole (in cases like Common Crawl shards, very large sitemaps, LLM-training data sources at 100+ MB per file), stream the response and run pattern.finditer() against a sliding buffer instead. httpx.stream("GET", url) with iter_text() is the Python idiom, but equivalents exist in every language. The principle is to not load the whole response into Python’s memory when the regex engine can scan it as it arrives.

In a typical pipeline, the scraper stores raw (or compressed) HTML as a column in Parquet, and then DuckDB or Polars runs extraction queries against that column whenever a new field is needed. The analytical layer replaces the scraper as the place where you change patterns. This separates the extraction logic from the crawl schedule, which matters when the markup changes and you need to rerun against historical data without crawling again.

Scale to thousands of patterns with Vectorscan

For large-scale pattern matching, including security scanning, content classification, and large-corpus filtering for LLM training data, Vectorscan is the right tool for the job. Vectorscan is a BSD-licensed community fork of Intel's Hyperscan that exists because Intel moved Hyperscan to a proprietary license in May 2024. The fork carries the last open-source version (5.4) forward and adds non-Intel platform support.

Vectorscan compiles patterns into a single automaton, and then matches them in parallel using SIMD instructions. It supports x86, ARM NEON, and Power VSX as production platforms, with ARM SVE2 support in active development. For Python, there are bindings, but for production-volume work, most teams use it from C++ or Rust.

When regex isn't enough: scaling with a Web Scraping API

There are 3 main problems that appear when data parsing scrales upwards: JavaScript-rendered content, anti-bot defenses, and markup that changes often. The fix is not a smarter pattern. It is a different request path, one designed to return rendered HTML your existing regex can still read.

3 signals it's time to escalate

Each signal points to a different infrastructure gap and none of them is a pattern-writing problem. First, rule out the simple cases like an increased amount of empty or blocked-looking responses that are just malformed URLs or 404s. Confirm the status code and the URL before you escalate. Escalate if:

  1. Your regex returns zero matches on a page that clearly contains the data when you open it in a browser. This is almost always a JavaScript-rendering issue,which you can fix in-house with a headless browser like Playwright, or escalate to a rendering API.
  2. You start seeing 403, 429, or CAPTCHA HTML in your responses. These are anti-bot tripwires triggered by your traffic pattern. Modern anti-bot stacks (Cloudflare, DataDome, and others) fingerprint the TLS handshake itself (the JA3/JA4 fingerprint), so rotating User-Agent strings or adding headers won't fix this. The main ways through are a TLS-impersonating HTTP client (such as curl_cffi or curl-impersonate, which copy a real browser's handshake), a real browser session, or a proxy endpoint designed to handle the fingerprinting.
  3. Your patterns need a rewrite every week because the source markup changes often. That is a sign that you should call a maintained extractor instead of bare HTML.

What you escalate to

3 deployment shapes cover almost every escalation path, and the choice between them depends mostly on how much of the fetching layer you want to own yourself.

  1. A managed scraping API. Stop running a fetch+render layer entirely. An API, like Decodo’s Web Scraping API, returns fully-rendered HTML in most cases, so your existing regex usually keeps working against the response body. The trade-off is a per-request cost, and a third-party dependency in your fetch path. 
  2. An in-house scraper based on a residential proxy pool. Unlike the managed API, you keep running your own scraper. The code, the patterns, the rendering, and the retry logic all stay in your stack, and the only thing that changes is the egress IP your requests use. Decodo's residential proxies work with existing Python, Node, and Go HTTP clients over HTTP basic authentication. The trade-off is that JavaScript rendering, CAPTCHA handling, and retry logic stay your responsibility.
  3. LLM-based extraction. This one is work great with pages where the data is semantic rather than structural. Think of a product description with the shipping time buried in prose, or a review where the rating is implied rather than tagged. Language models handle these in a way regex can't, at the cost of per-page latency and per-token spend. The common pattern is to combine the 2: regex (or a parser) for the structured fields, and the LLM for the unstructured remainder. That way the structured fields stay deterministic and cheap, and the token spend goes only to the part of the page that actually needs reasoning.

Full working script

The script below extracts the <title> and every <meta name="..."> tag from a book product page on books.toscrape.com. It shows the 6 steps from this guide end to end: module-level compile, named capture groups, character classes inside attribute values, entity decoding after extraction, schema validation, and a pinned regression test. The proxy block is commented out, so the script runs without credentials; uncomment it to route through Decodo's gateway.

"""
parse_html_with_regex.py
Extract <title> and <meta> tags from a stable sandbox target.
Run:
pip install httpx
python parse_html_with_regex.py
"""
import html
import json
import re
import sys
from datetime import datetime, timezone
import httpx
TARGET_URL = (
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
)
# Proxy integration - uncomment and fill in credentials to route through Decodo.
# proxy_username = "YOUR_PROXY_USERNAME"
# proxy_password = "YOUR_PROXY_PASSWORD"
# PROXY = f"http://{proxy_username}:{proxy_password}@gate.decodo.com:7000"
PROXY = None
# Production-shaped template, so this sends a current browser User-Agent;
# the earlier teaching snippets identify honestly as bots (their targets are
# scraping-friendly). Keep the version current (a stale UA is itself a bot
# signal), or use curl_cffi / a proxy to manage the full browser fingerprint.
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/148.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
# Compile patterns once at module level.
# Note: META_NAME assumes attribute order name -> content, which is stable
# on this sandbox. On a site with variable order, widen the pattern or
# switch to a parser (see "The decision rule, in one line").
META_NAME = re.compile(
r'<meta\s+name="(?P<name>[^"]+)"\s+content="(?P<content>[^"]*)"',
re.IGNORECASE | re.DOTALL,
)
META_PROPERTY = re.compile(
r'<meta\s+property="(?P<name>(?:og|twitter):[^"]+)"\s+content="(?P<content>[^"]*)"',
re.IGNORECASE | re.DOTALL,
)
TITLE_TAG = re.compile(
r"<title[^>]*>(?P<title>.*?)</title>",
re.IGNORECASE | re.DOTALL,
)
def fetch_html(url: str) -> str:
with httpx.Client(
proxy=PROXY,
headers=HEADERS,
timeout=15,
follow_redirects=True,
) as client:
response = client.get(url)
response.raise_for_status()
return response.text
def clean_value(raw: str) -> str:
# Decode entities and trim whitespace at extraction time, not in the pattern.
return html.unescape(raw).strip()
def extract_meta(raw_html: str, source_url: str) -> dict:
record = {
"source_url": source_url,
"fetched_at": datetime.now(timezone.utc).isoformat(),
"title": None,
"meta": {},
}
title_match = TITLE_TAG.search(raw_html)
if title_match:
record["title"] = clean_value(title_match.group("title"))
for pattern in (META_NAME, META_PROPERTY):
for match in pattern.finditer(raw_html):
name = match.group("name")
record["meta"][name] = clean_value(match.group("content"))
return record
def validate(record: dict) -> None:
# An empty title is the most common silent-failure mode - fail loudly.
if not record["title"]:
raise ValueError(
"Empty title - check that the target page returned full HTML."
)
if not record["meta"]:
raise ValueError(
"No meta tags found - the markup may have changed."
)
def regression_test() -> None:
# Pinned sample. If the production pattern changes, this fails
# before the live data does.
sample = (
"<html><head>"
"<title>Test Page</title>"
'<meta name="description" content="A & B sells C's books" />'
'<meta property="og:title" content="OG Title" />'
"</head><body></body></html>"
)
record = extract_meta(sample, "test://sample")
assert record["title"] == "Test Page", record["title"]
assert record["meta"]["description"] == "A & B sells C's books"
assert record["meta"]["og:title"] == "OG Title"
def main() -> int:
regression_test()
try:
raw_html = fetch_html(TARGET_URL)
except httpx.HTTPError as error:
print(f"Fetch failed: {error}", file=sys.stderr)
return 1
record = extract_meta(raw_html, TARGET_URL)
validate(record)
print(json.dumps(record, indent=2, ensure_ascii=False))
return 0
if __name__ == "__main__":
sys.exit(main())

Run the script with:

pip install httpx
python parse_html_with_regex.py

Sample output (truncated):

{
"source_url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"fetched_at": "2026-05-25T12:34:56.789012+00:00",
"title": "A Light in the Attic | Books to Scrape - Sandbox",
"meta": {
"created": "24th Jun 2016 09:29",
"description": "It's hard to imagine a world without A Light in the Attic...",
"viewport": "width=device-width",
"robots": "NOARCHIVE,NOCACHE"
}
}

Notice that the description value is decoded. That decoding happens after the match, not inside the pattern, and this is what keeps the pattern predictable across pages with different entities.

Bottom line

Regex is the right tool on flat, predictable targets, such as a meta tag, a sitemap URL, a script src attribute, or an inline JSON-LD block, all inside markup you control or monitor. The moment that markup nests, changes often, or only appears after JavaScript runs, stop tuning the pattern and switch to a parser. Decide which side your task is on before you write a single pattern.

On the regex side, writing the pattern is the easy part. The real work is around it. You fetch with a proper HTTP client such as httpx, match with Python's re, and back each pattern with a regression test, a Pydantic check, and a ReDoS scan, so nothing ships silently broken. Across millions of stored pages, DuckDB or Polars run the same patterns over a whole column at once. When the task belongs to a parser instead, Beautiful Soup, lxml, and selectolax are the equivalents.

A pattern like this costs little to run, and it fails loudly instead of writing silent empty strings. If you run it on the JSON-LD or hydration data a page already ships, it survives redesigns better than a hand-written body pattern. When a page renders with JavaScript or starts blocking you, it still doesn’t change. It runs against any response that comes back as rendered HTML, whether that response is from your own headless browser or a managed service like Decodo's Web Scraping API.

Get Web Scraping API

Plug our Web Scraping API straight into your pipelines to scale your projects instantly. Gain access to 125M+ IPs across 195+ locations with a 99.99% success rate.

Share article:

About the author

Justinas Tamasevicius

Director of Engineering

Justinas Tamaševičius is Director of Engineering with over two decades of expertise in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.

Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Can you parse HTML with regex?

Yes, for flat and predictable patterns; no, for arbitrary nested markup. Regex handles meta tags, sitemap URLs, script src attributes, and inline JSON-LD reliably, because each one sits inside a known, fixed wrapper. For tree-shaped data with variable nesting, use an HTML parser instead.

Why do developers say you should never use regex on HTML?

Because HTML is a context-free language, and regex can’t count nesting depth. The widely shared warning is about correctness at scale, not about every use case. A regex that pulls a single attribute from a meta tag is correct and fast; the same regex pointed at arbitrary nested <div> trees returns silent garbage. The advice is really shorthand for "don't reach for regex first".

What is the fastest way to extract a single attribute from HTML?

A compiled, anchored regex with a non-greedy capture group is usually the fastest option for one-off extractions. The pattern <meta property="og:title" content="([^"]+)">, compiled once and run across thousands of pages, is faster than building a DOM tree to find the same attribute. But the speed advantage disappears as soon as you need a second attribute on the same page; at that point, parse the document once and query it.

When should I switch from regex to a parser?

Watch for 3 concrete signals: the same pattern needs a rewrite every week, an inner <div> breaks an outer <div> pattern, or the regex returns zero matches on a page that clearly shows the data in your browser. Each one is a structural problem that a parser solves and that tuning the pattern can’t.

How do I parse HTML from sites that block scrapers?

Route requests through a managed scraping API or a residential proxy pool, and then run your regex against the returned HTML. A managed API handles JavaScript rendering, proxy rotation, and common CAPTCHA challenges, so the response body usually matches what your regex was already designed to read; Decodo's Web Scraping API is one option that works this way. For lighter targets, residential proxy rotation alone is often enough to keep request success rates stable without an API in front.

Python logo rising from a bowl beside a spoon in neon magenta on dark background

Beautiful Soup Web Scraping: How to Parse Scraped HTML with Python

Web scraping with Python is a powerful technique for extracting valuable data from the web, enabling automation, analysis, and integration across various domains. Using libraries like Beautiful Soup and Requests, developers can efficiently parse HTML and XML documents, transforming unstructured web data into structured formats for further use. This guide explores essential tools and techniques to navigate the vast web and extract meaningful insights effortlessly.

Dark humanoid with red headset and cape sits at desk, monitor showing code: printf("Hello, World!");

lxml Tutorial: Parsing HTML and XML Documents

Keepin’ it short and sweet: data parsing is a process of computer software converting unstructured and often unreadable data into structured and readable format. Parsing offers a lot of benefits, some of which include work optimization, saving time, reducing costs, and many more; in addition, you can use parsed data in plenty of different situations.

Even tho that sounds epic, parsing itself can be quite complicated. But hold on, buddy, and get ready to explore a step-by-step process on how to parse HTML and XML documents using lxml.

Document labeled 'XML' showing code lines on a dark background with neon magenta outline

Parsing XML in Python – The Ultimate Guide

Standards are a means to clear and define communication between people and things in the world. For example, the human language, USB sockets on computers, or the fact that you must add cereal before pouring milk. When it comes to computer applications and systems, one standard stands out above the rest as the most popular choice for developers – XML (eXtensible Markup Language). In this article, we’ll explore how you can parse data from XML files using Python’s built-in libraries, see the best methods to do so, and understand the importance of effectively reading information.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved