Request orchestrator

The orchestrator is your control plane. It manages the URL queue, assigns tasks to workers, tracks job state, and enforces scheduling logic: deciding what gets scraped, when, and how aggressively.

In modern stacks, this runs on distributed task queues like Celery, Bull, or cloud-native alternatives like AWS Step Functions. It needs to support priority-based scheduling (high-value URLs first), deduplication (don't re-scrape unchanged pages), and dynamic concurrency adjustment based on target-site response patterns.

More advanced systems now use ML models to optimize recrawl frequency per URL. A static blog post doesn't need daily refreshes, but an eCommerce product page with volatile pricing might need hourly updates.

Proxy layer

The proxy layer sits between your orchestrator and the open web. It handles IP rotation, session management, and geographic targeting. At scale, this is typically the largest variable cost, and the layer where misconfiguration has the highest impact.

Your orchestrator should never care which specific IP a request uses. It requests a connection from the proxy layer that matches the target's requirements.

Rendering layer

Not every page needs a browser, and spinning one up when you don't have to add unnecessary compute costs. Static HTML pages can be fetched with lightweight HTTP clients at a fraction of the cost.

But JavaScript-heavy sites require a headless browser. Production teams typically run rendering clusters with Playwright under Kubernetes, keeping memory predictable by recycling browser contexts between jobs. When a target exposes JSON endpoints or serves partial HTML, the system should skip rendering entirely.

Parser

The parser handles actual data extraction, turning raw HTML or JSON into structured records. At scale, parsers need to be versioned, testable, and resilient to minor DOM changes.

AI-driven extraction now supplements traditional CSS/XPath selectors. Across 3,000 pages, a McGill University study found AI extraction hit 98.4-100% accuracy, with semantic approaches maintaining performance even when page structures changed, compared to rigid selectors that broke on every redesign.

But AI extraction has a critical limit: numerical precision. In testing, LLM-based scrapers regularly confuse VAT-inclusive and VAT-exclusive pricing, misread currency formats, and confuse unit vs. total prices, with error rates high enough to be unusable for mission-critical data.

The cost gap is wide: leading LLM extraction tools cost $666-$3,025 per 1M pages, compared to $1-5 per 1M for deterministic scrapers (ScrapeOps, 2025). Most teams use a hybrid: deterministic code for high-stakes numerical fields, AI for layout resilience and unstructured content.

For the AI tier specifically (layout-resilient extraction of text, categories, descriptions, and other non-numerical fields), tools like Decodo's AI Parser generate structured JSON from any HTML using natural-language prompts, without writing selectors. Teams building agentic workflows can also plug Decodo into their LLM stack via the LangChain integration, so agents can trigger scrapes on their own. For more on using LLMs for extraction, see our guide to ChatGPT-based web scraping.

Data pipeline

The gap between "scraper returned HTML" and "data is in the warehouse, ready to use" is bigger than most teams expect. Closing that gap is what the pipeline does: deduplication, schema enforcement, format standardization, and storage. Stream-oriented architectures (Kafka, Kinesis) are replacing batch processing for time-sensitive use cases.

The pipeline also needs lineage tracking. When something breaks, you need to trace the failure back to its origin and trigger automatic backfills.

Monitoring

Without observability, scraper issues can go undetected for days, gradually degrading data quality. Your monitoring layer should track success rates, response times, block rates, and data completeness, all in real-time.

Monitor shifts in median or p95 latency, not just hard failures.

Handling dynamic and JavaScript-heavy websites

Most websites now render content client-side. A plain HTTP GET returns a near-empty shell.

Your scraper receives no useful data unless it executes JavaScript. The signals that a page requires rendering: empty or minimal HTML responses, AJAX/XHR calls that load data after page load, content that appears only after scrolling (infinite scroll), and pages protected by JavaScript challenges.

Headless browsers and when to skip them

Playwright is now the most popular choice for scraping dynamic content as of early 2026, overtaking Puppeteer and Selenium in scraping-specific workflows. Its multi-browser support (Chromium, Firefox, WebKit), built-in auto-wait mechanisms, and network interception API make it well-suited for scraping.

Selective rendering cuts the most cost: avoid headless browser overhead on pages that don't require it. Before spinning up a headless instance, check whether the data you need is available in an API response, an embedded JSON-LD block, or server-rendered HTML. Many sites include structured data in <script type="application/ld+json"> tags that's machine-readable and far cheaper to extract.

For pages that genuinely require rendering, manage your browser fleet like any other compute resource: set memory limits, recycle contexts after each job, and scale rendering pods independently of your crawl workers.

Production teams now use a hybrid scraping pattern: launch a headless browser only to pass the initial JavaScript challenge, extract the session cookie it generates, then immediately kill the browser and hand that cookie to a lightweight HTTP client (like rnet or curl_cffi) for all subsequent requests. This cuts RAM usage and response times while still passing challenge-based defenses.

We cover the rendering decision in more detail in our guides to scraping dynamic content, JavaScript web scraping, and Playwright-based web scraping.

Anti-bot systems and why scale triggers them

Scale is one of the strongest signals to anti-bot systems. A human browsing a site makes maybe 50-100 requests per session. A production scraper makes 50K. Even with perfect headers and rotating proxies, the statistical footprint of automated traffic is different from organic behavior.

Automated traffic now accounts for 51% of all web traffic, surpassing human activity (Imperva). AI crawlers alone generate more than 50B requests to Cloudflare's network every day. And AI crawlers are now among the user agents most often fully blocked in robots.txt files (Known Agents, formerly Dark Visitors).

Defenders now respond with full automation. Two days of unblocking efforts used to give two weeks of access... Now, it's become the other way around.

Anti-bot systems now reconfigure their detection mechanisms continuously. One vendor deployed more than 25 version changes over a 10-month period, often releasing updates multiple times per week, with ML models that adapt in as little as a few minutes. You can't keep up manually at that pace.

Rate-based detection

This is the simplest detection layer. If you're sending requests to a domain faster than any human could browse it, you're flagged. Modern systems like Cloudflare's Bot Management don't just count requests per IP; as of 2025-2026, they've introduced per-customer defense systems that use ML models automatically tuned to each website's specific traffic patterns.

The rate limit threshold isn't static. It adapts based on the site's normal traffic baseline.

IP reputation

Every IP has a reputation score with major anti-bot providers: datacenter (lowest trust), residential (high), mobile (highest, due to Carrier-Grade NAT sharing IPs across many real users).

IP reputation alone isn't enough. No single detection signal works in isolation; modern anti-bot systems layer fingerprinting, behavioral analysis, and IP reputation together, and proxies alone don't determine the outcome.

But IP reputation still matters. A datacenter IP still triggers blocks before other detection layers even evaluate the request. Clean residential IPs are the prerequisite. They pass IP reputation checks so your fingerprint and behavioral stealth get evaluated.

The proxy investment serves one purpose: clearing the first filter so that fingerprint and behavioral signals get evaluated.

Browser fingerprinting

Anti-bot systems analyze hundreds of browser characteristics, from canvas fingerprints and WebGL renderers to the order of HTTP/2 header frames, to build a unique device signature.

At the network layer, TLS fingerprinting via JA3 hashes has been a reliable detection signal for years. JA3's successor, JA4, is now a top concern for its ability to fingerprint clients with even greater precision across TLS, HTTP, and other protocols.

Standard automation tools like Selenium or Puppeteer carry distinct fingerprints that differ from real browsers. Tools like Camoufox, SeleniumBase UC Mode, and Playwright with stealth plugins patch many of these detectable signs, but stealth tools and detection systems keep evolving against each other.

Evasion now extends beyond patching JavaScript APIs (the earlier Puppeteer-extra-stealth approach) to addressing automation protocols at different levels. The Chrome DevTools Protocol (CDP) is a key detection vector. Tools like Patchright and rebrowser-patches fix specific CDP leaks (such as the detectable Runtime.enable command) while still using CDP as their core communication layer. Pydoll and Selenium Driverless eliminate the WebDriver layer entirely and connect to Chrome via CDP directly, removing driver-related detection signals. Nodriver takes this further by minimizing CDP usage itself, avoiding the protocol patterns that anti-bot systems flag. BotBrowser ships a custom browser built with modified internals.

At the HTTP level, curl_cffi addresses TLS fingerprinting without a browser. It impersonates real browser TLS/JA3 and HTTP/2 fingerprints at the network layer, so automated HTTP requests look indistinguishable from browser traffic. Useful for static pages where headless browsers are unnecessary overhead.

Commercial antidetect browsers have also added automation hooks, making advanced evasion more accessible. Decodo's X Browser is a free option included with any proxy subscription, useful for quick manual testing and multi-profile validation before committing to a full antidetect platform.

Behavioral analysis

Of all detection layers, behavior-based systems are the hardest to beat. Simple scraping setups fail here. Systems like DataDome and HUMAN Security analyze interaction patterns throughout the session. Real humans move mice in erratic, non-linear paths; bots produce straight lines or no movement at all. Humans scroll in bursts with pauses; bots scroll at constant rates.

Human inter-click intervals are irregular; bots cluster at uniform intervals. Humans pause on content; bots don't pause. These signals combine with fingerprinting to build a real-time trust score that adjusts continuously. A session that starts trusted can be flagged mid-way if behavioral patterns degrade.

Even fingerprint-perfect browsers with no behavioral simulation are still detected. Production scrapers targeting behavioral-analysis-protected sites need to inject realistic interaction noise: randomized mouse paths, variable scroll timing, simulated reading pauses. Several commercial antidetect browsers now bundle behavioral simulation, but detection evasion rates vary widely across vendors.

For high-value targets, teams build custom behavioral profiles matched to each target site's interaction patterns. This requires per-target calibration and ongoing maintenance as sites update their models.

Some APIs handle behavioral evasion for you. Decodo's Site Unblocker, for example, integrates as a proxy endpoint that manages fingerprint rotation, JavaScript rendering, and anti-bot challenges per request. From the caller's side, it's a normal HTTP call that returns rendered HTML.

Machine identity

Beyond fingerprinting, bot mitigation systems now distinguish between verified bots (search engine crawlers), AI bots (training crawlers, search agents, user-action agents), and unverified scrapers. Whether your scraper can present a verifiable machine identity now determines whether you get through on major platforms.

"Know Your Agent" initiatives are spreading. In 2026, unsigned or unverifiable agents already receive heightened scrutiny from major anti-bot providers. Verified or attested bots get preferential routing; unverified agents get more friction. For scraping engineers, crawler design is shifting. Instead of just hiding, scrapers increasingly need to identify themselves and state what they're doing.

Teams building AI agent workflows can connect through infrastructure like Decodo's MCP Server, which gives AI agents and LLMs web access through managed proxies. It covers anti-bot evasion, rendering, and structured output. For more on the ecosystem, see top MCP servers for AI workflows.

Common defenses

Once anti-bot systems flag your traffic, they need to act on it. These are the most common responses you will run into at scale:

CAPTCHAs . Cloudflare Turnstile is replacing traditional image CAPTCHAs in 2025-2026. Unlike reCAPTCHA, Turnstile often runs invisibly in the background using fingerprinting and cryptographic proof-of-work.

. Cloudflare Turnstile is replacing traditional image CAPTCHAs in 2025-2026. Unlike reCAPTCHA, Turnstile often runs invisibly in the background using fingerprinting and cryptographic proof-of-work. JavaScript challenges . Inline scripts that test for browser capabilities, execute timing-based checks, or inject hidden elements that only bots would interact with.

. Inline scripts that test for browser capabilities, execute timing-based checks, or inject hidden elements that only bots would interact with. Data cloaking. Instead of blocking scrapers outright, systems like Cloudflare's AI Labyrinth (launched March 2025, available even on free-tier plans) return AI-generated fake pages with plausible-looking content, introducing synthetic data into your results rather than denying access.

Your scraper still receives a 200 OK response, valid HTML, and realistic-looking data, but the content is entirely synthetic. This changes what "success" means. Instead of just checking for block pages, extraction pipelines now need to verify that extracted data is real. We cover detection and mitigation strategies in our guide to bypassing AI Labyrinth.

Variable protection within a single site

Anti-bot protection now varies within the same website. High-demand pages that scrapers target most get aggressive protection, while lower-value listing pages stay lightly protected. Indeed.com, for example, increased protection on individual job posts while leaving job listing pages at lower security.

Sites now restrict internal API endpoints (particularly GraphQL) separately. Non-standard URL formats (e.g., /product/123456/reviews vs. the full SEO-friendly URL) trigger higher security settings. Sites don't need to stop you. They only need to make scraping expensive enough to be uneconomical.

You may need different proxy strategies, rendering approaches, and cost profiles for different page types on the same domain, and your orchestrator needs to route based on page type.

Most teams start with URL-pattern-based classification (listing pages get datacenter proxies, detail pages get residential) and then refine based on observed response behavior. If a URL pattern starts returning elevated CAPTCHA rates, the system automatically reclassifies it to a higher proxy tier.

This works as a manually curated config at dozens of domains; at hundreds, it needs to be automated, with the orchestrator learning protection levels from its own success/failure metrics per URL pattern.

For a deeper look at how these detection layers interact, see our anti-bot systems guide and CAPTCHA bypass guide.

Proxy management at scale

Proxy management is often the most expensive layer, and the one with the most impact in web scraping at scale. Selecting a provider and configuring credentials is the easy part. The hard part is building a routing layer that selects the right proxy type for each request, monitors IP health in real time, and adapts to target-site defenses dynamically.

A large, diverse proxy pool is the foundation – thin pools mean repeated IPs, detectable rotation patterns, and faster bans. But at production scale, how the pool is used matters as much as its size. The best results come when your provider's infrastructure combines deep IP coverage with intelligent routing: selecting the right proxy type, location, and rotation strategy per request, based on each target's defense profile. You can build this routing layer yourself on top of raw proxies, but most teams at scale let the provider handle it.

Rotation vs. sticky sessions

Rotating proxies assign a new IP for every request, ideal for stateless scraping like product catalog harvesting or SERP collection. Sticky sessions maintain the same IP for a defined duration (typically 10 minutes to 24 hours), essential for multi-step workflows: login flows, cart interactions, paginated results where the server validates session cookies. For the implementation details, see how to get a new IP for every connection.