Web Scraping Without Getting Blocked: A Practical Guide for 2026

Web scraping without getting blocked is one of the hardest challenges you might face. Whether you’re a business conducting market research or a solopreneur working on your next big thing, most scrapers fail not because the code is wrong, but because websites now run layered detection that flags bots before a single byte of HTML is returned. This guide breaks down all the detection layers, including network, TLS, browser, and behavioral, and delivers the best techniques on how to overcome each.

Benediktas Kazlauskas

Last updated: Apr 23, 2026

12 min read

Code panel showing HTML request beside 'Proxies enabled' and 'Your data is ready!' cards on dark gradient background

TL;DR

Modern anti-bot systems are multi-layered and continuously updated. Here are the key takeaways:

Modern websites block scrapers across 4 layers, network, TLS, browser, and behavioral
Residential proxies defeat IP-level blocks_,_ datacenter IPs are flagged by default on serious targets
Header consistency and TLS fingerprint matching address the majority of detections before any browser emulation is needed
Headless browsers should be a last resort, not a starting point – they're slow and resource-intensive
ReCAPTCHA v3 is a risk scoring system, and good session hygiene can keep scores high without CAPTCHA solvers
Always check robots.txt and Terms of Service before scraping, and avoid collecting personal data without a legal basis

How websites detect scrapers: the four-layer model

Before covering any technique, it's worth building a clear mental map of how detection actually works. Most competitor articles skip straight to tips. Understanding the detection architecture first helps explain why each countermeasure is necessary – and which layer to debug when your scraper gets blocked.

Layer #1: Network

The first check happens at the IP level, before the server reads a single header. Anti-bot systems query IP reputation databases to score incoming requests. Autonomous System Numbers (ASNs) belonging to cloud providers – AWS, GCP, Azure – are blocklisted by default on high-security sites because legitimate end users almost never browse from datacenter ranges. Request volume thresholds also trigger blocks at this layer, regardless of how well-crafted the request looks.

Layer #2: TLS

The TLS fingerprint is emitted during the TLS handshake, before any HTTP header is transmitted. The ClientHello message – used to negotiate the encryption session – contains a unique fingerprint known as a JA3 or JA4 hash. Python's requests library, curl, and other non-browser HTTP clients each produce recognizable JA3 values that anti-bot systems identify in milliseconds. A scraper with a perfect Chrome User-Agent can still be blocked at this layer because its TLS handshake signature doesn't match Chrome's.

Layer #3: Browser

Once a request passes the network and TLS checks, JavaScript challenges probe the browser environment. Key signals include the exposure of the navigator.webdriver (set to true in automation contexts), canvas and WebGL rendering signatures that identify GPU configurations, AudioContext behavior, font enumeration, plugin lists, and screen resolution anomalies. Real browsers leave consistent, device-specific fingerprints. Headless browsers don't – unless specifically patched.

Layer #4: Behavioral

Behavioral analysis operates at the session level. Systems like PerimeterX/HUMAN track inter-request timing patterns, scroll depth, mouse trajectory, navigation linearity, and session duration across an entire browsing session. A scraper that requests pages at perfectly uniform 2-second intervals may pass all other checks and still be flagged here.

The key implication – a scraper blocked despite good proxies and realistic headers is likely failing at the TLS or browser layer, not the network layer. That difference shapes which to apply first.

IP rotation and proxy usage

The network layer is the first line of defense. Getting past it requires understanding not just proxy types, but how IP reputation scoring works and how rotation strategy affects detection risk.

Why a single IP fails

A single IP accumulates rate-limit counters and reputation scores with every request it makes to a target domain. Datacenter IP ranges from AWS, GCP, and Azure are on prebuilt blocklists used by anti-bot vendors. A request from an AWS IP to a high-security target is often blocked before the server reads any content. Even residential IPs can accumulate a negative reputation from previous scraping activity by other users sharing the same address, so you need to choose a reputable provider that carefully checks the IPs’ quality.

Proxy type trade-offs

Not all proxies carry the same trust score with anti-bot systems. The main types:

Datacenter proxies are fast and cheap, but ASN ranges are trivially identifiable. Use only for low-security targets
Residential proxies come from real devices connected to local networks, so the ASN is indistinguishable from organic traffic, making the detection risk significantly lower
Static residential (ISP) proxies originate from ISPs but have high stability. They’re useful for account-based scraping that requires a consistent identity across sessions
Mobile proxies operate through carrier NAT (CGNAT), where many real users share a single public IP. These carry the highest trust scores but come at a premium

There are also solutions like Site Unblocker that combine multiple proxy types with built-in fingerprint management, automatic retry logic, and CAPTCHA solving – waving away the complexity so you don't have to manage rotation manually.

A failure mode competitors rarely discuss – shared residential proxy pools can contain previously flagged IPs. Pool size and IP freshness matter as much as proxy type. Smaller, exhausted pools increase the chance of cycling onto a compromised address.

Implementing rotation in Python

Rotation through a residential proxy endpoint handles most scenarios. The Decodo gateway rotates the exit IP automatically on each request, so no client-side cycle pool is needed.

Install requests first with pip install requests. Explore our Python Requests guide for a step-by-step guide for this part. When a target requires a consistent identity across paginated requests, append a session ID to the username to hold the same IP across that flow.

import requests
 
proxy_username = "YOUR_PROXY_USERNAME"
proxy_password = "YOUR_PROXY_PASSWORD"
 
proxies = {
    "http": f"http://{proxy_username}:{proxy_password}@gate.decodo.com:7000",
    "https": f"http://{proxy_username}:{proxy_password}@gate.decodo.com:7000",
}
 
response = requests.get("https://ip.decodo.com/json", proxies=proxies)
print(response.json())

For sticky sessions, append a session identifier to the username. The same session ID returns the same exit IP for up to 10 minutes:

sticky_user = f"{proxy_username}-session-abc123"
proxies = {
    "http": f"http://{sticky_user}:{proxy_password}@gate.decodo.com:7000",
    "https": f"http://{sticky_user}:{proxy_password}@gate.decodo.com:7000",
}

Geographic context matching is another detection vector that's easy to overlook. Routing through US residential IPs while sending a mismatched Accept-Language: fr-FR header is a detectable inconsistency. Always match the language and locale headers to the proxy country.

At the point where plain datacenter proxies are blocked on targets like Product Hunt or similar high-security sites, Decodo’s residential proxies provide geo-targeted rotation across a large ethically-sourced pool, significantly reducing the chance of cycling onto flagged IPs.

User-agent and request header management

After the IP check, the next line of inspection is header consistency. Default HTTP client headers are immediately identifiable, and a mismatched header set can expose a scraper even if the IP looks clean.

What default headers reveal

A default requests call sends a User-Agent header of python-requests/2.x.x, which is flagged on sight. Even after replacing the User-Agent with a Chrome string, most scrapers omit the full header set that Chrome sends, and the absence of those headers is itself a fingerprint.

The full Chrome header set

A realistic Chrome request includes these headers, in the correct order (header order is itself part of the fingerprint):

User-Agent: a current Chrome version string, matched to the correct OS
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: matched to the proxy country, e.g., en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Sec-Fetch-* headers: Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site, Sec-Fetch-User
Upgrade-Insecure-Requests: 1
Referer: a plausible upstream URL, such as a Google search results page for entry to deep URLs

A Chrome UA paired with a macOS Sec-CH-UA value is internally consistent. A Chrome UA paired with a Windows Sec-CH-UA is a mismatch. Internal header consistency matters as much as the individual header values.

Practical code: Building a realistic header dict

Use httpbin.org/headers to verify exactly what your scraper is sending. The endpoint echoes the request headers back as JSON, which makes it easy to confirm the set is complete before targeting a real site.

import requests
 
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
              "image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
    "Referer": "https://www.google.com/",
}
 
response = requests.get("https://httpbin.org/headers", headers=headers)
print(response.json())

Run the script with python check_headers.py. The output shows what the target server receives:

{
    "headers": {
        "Accept": "text/html,application/xhtml+xml,...",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://www.google.com/",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."
    }
}

Rotate User-Agents across realistic OS/browser combinations. A Chrome 2021 UA string in 2026 is a red flag in itself. Keep the pool current by checking the latest Chrome release versions.

Request timing and behavior randomization

Timing is a behavioral signal. A scraper that fires requests at exactly uniform intervals is statistically distinguishable from human browsing, even if every other signal looks clean.

Why fixed delays fail

Statistical analysis of inter-request timing over a full session exposes mechanical patterns. A human browsing a site will spend varying amounts of time reading each page. Delays cluster around a realistic mean but occasionally run long when the content is dense. A uniform delay of exactly 2 seconds on every request produces a variance of zero, which is statistically impossible for organic traffic.

Gaussian delays vs. uniform random

A Gaussian (normal) distribution centered around a realistic mean, say, 3 to 5 seconds, produces human-like variance where most delays cluster near the mean but occasionally run longer. Uniform random delays still produce a recognizable distribution. Gaussian is closer to actual human reading patterns.

import numpy as np
import time
 
def human_delay(mean=4.0, std=1.5, minimum=1.0):
    delay = max(np.random.normal(loc=mean, scale=std), minimum)
    time.sleep(delay)
 
human_delay()

Intra-page behavior with Playwright

When using a headless browser, simulate realistic intra-page behavior before extracting data. Scroll the page incrementally, move the mouse to interactive elements before clicking, and vary the entry point of each crawl rather than navigating directly to the target URL every time. Install Playwright with pip install playwright && playwright install chromium – the Playwright web scraping guide covers setup in detail.

from playwright.sync_api import sync_playwright
import random
import time
 
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/")
 
    for y in range(0, 3000, random.randint(200, 400)):
        page.evaluate(f"window.scrollTo(0, {y})")
        time.sleep(random.uniform(0.2, 0.7))
 
    quotes = page.locator(".quote .text").all_text_contents()
    print(quotes[:3])
    browser.close()

Run the script with python scroll_scrape.py. Sample output shows the first 3 extracted quotes from the page:

['“The world as we have created it is a process of our thinking...”',
 '“It is our choices, Harry, that show what we truly are...”',
 '“There are only two ways to live your life...”']

Off-peak crawling is a supplementary strategy: for non-time-sensitive data, scheduling crawls during midnight local time to the server reduces competition for shared resources and often means lower detection thresholds.

Headless browsers and browser fingerprinting

When a target runs JavaScript challenges, plain HTTP clients fail. A real browser engine is required. However, default headless Chrome leaks identifying signals that anti-bot systems detect within the first script execution cycle.

What default headless Chrome leaks

Out of the box, headless Chrome exposes:

navigator.webdriver = true: readable _b_y any JavaScript running on the page
HeadlessChrome in the User-Agent string
Missing chrome.runtime object, which real Chrome always exposes
Abnormal screen dimensions (0x0 in some configurations)
Empty plugin list, while real browsers expose several entries

playwright-stealth: what it patches and what it misses

The playwright-stealth library patches the most commonly probed signals: navigator.webdriver, the UA string, chrome.runtime, and plugin lists. This defeats most automated detection checks. Advanced fingerprinting systems that compare canvas rendering output, the WebGL GL_RENDERER string, and AudioContext oscillator signatures against device baselines can still identify patched headless browsers. These are signals that playwright-stealth doesn't address by default.

The full fingerprint surface

Signals that advanced systems probe beyond the basics:

Canvas rendering output_:_ different GPU hardware renders identical canvases with subtle pixel-level differences
WebGL GL_RENDERER string_:_ identifies the specific GPU, which in headless environments often reads as SwiftShader or llvmpipe (software rendering)
AudioContext oscillator behavior: hardware-dependent acoustic fingerprinting
Font enumeration: real devices have a specific set of installed fonts that varies by OS and configuration
Screen color depth and pixel ratio: headless environments default to values that don't match common monitor configurations

Camoufox as an alternative

Camoufox is a Firefox-based headless browser tool that patches fingerprinting signals at a lower level than Playwright stealth plugins. It's compatible with the Playwright API, which means existing Playwright web scraping code can use it with minimal changes. Its key advantage is targeting CreepJS-class detection – a class of browser fingerprinting that Chromium-based tools struggle to defeat because Chrome's headless mode shares fingerprinting characteristics with its non-headless counterpart at the engine level.

When to use headless browsers and when not to

Headless browsers are slower and memory-intensive. Running more than 20 concurrent instances is impractical on most infrastructure. Plain HTTP clients with a clean TLS stack, using tools like curl_cffi or httpx with Chrome-impersonation mode, solve more detection cases than developers assume, at a fraction of the resource cost. Reach for a headless browser only when JavaScript rendering is genuinely required to load the target data.

CAPTCHA and anti-bot evasion

When collecting publicly available data, always remember – CAPTCHAs are a checkpoint, not a wall. Understanding the architecture of each CAPTCHA type determines the most efficient resolution path, and in many cases, good session hygiene avoids the CAPTCHA entirely.

CAPTCHA taxonomy

The 4 main CAPTCHA architectures work differently and require different approaches:

reCAPTCHA v3: a silent risk scoring system with no visual challenge. A score between 0.0 (bot) and 1.0 (human) is calculated from session signals. Good headers, realistic timing, and a residential IP keep the score above the threshold without any solver intervention
reCAPTCHA v2: the familiar image grid challenge, triggered when v3 scores are low or as a standalone check. Requires a CAPTCHA solver service
hCaptcha: similar to reCAPTCHA v2 in architecture. Image selection challenges are compatible with most solver services
Cloudflare Turnstile: a JavaScript probe that requires full browser execution. Plain HTTP clients cannot pass it. Switching to a headless browser or a managed solution is the only viable path

The reCAPTCHA v3 insight most articles miss

ReCAPTCHA v3 presents no visual challenge to solve. The score is a function of how human the entire session looks – headers, timing, IP reputation, navigation patterns. A scraper with residential proxies, realistic headers, and Gaussian timing can often maintain a score above 0.7 and avoid triggering any challenge at all. This is the cheapest and most scalable CAPTCHA solution available.

Automated CAPTCHA solver integration

When a visual CAPTCHA is unavoidable, third-party services like 2Captcha receive the CAPTCHA token, return a solved response, and charge per solution (typically $1 to $3 per 1,000 solves). The returned token is injected into the form before submission. Latency typically runs 10 to 30 seconds per solve – acceptable for low-volume scraping, expensive for high-volume pipelines.

import requests
import time
 
API_KEY = "YOUR_2CAPTCHA_API_KEY"
SITE_KEY = "6LeIxAcTAAAAAJcZVRqyHh71UMIEGNQ_MXjiZKhI"
PAGE_URL = "https://www.google.com/recaptcha/api2/demo"
 
task = requests.post("http://2captcha.com/in.php", data={
    "key": API_KEY,
    "method": "userrecaptcha",
    "googlekey": SITE_KEY,
    "pageurl": PAGE_URL,
    "json": 1,
}).json()

token = None
max_retries = 12
 
for _ in range(max_retries):
    time.sleep(5)
    result = requests.get(
        f"http://2captcha.com/res.php?key={API_KEY}"
        f"&action=get&id={task['request']}&json=1"
    ).json()
    if result["status"] == 1:
        token = result["request"]
        break
 
print(token)

The returned token is the value injected into the g-recaptcha-response form field before submission. The Google demo URL above is a live endpoint for testing solver integration end-to-end.

When CAPTCHA volume exceeds the economics of per-solve pricing, Decodo’s Site Unblocker abstracts CAPTCHA solving, TLS fingerprint matching, and anti-bot evasion into a single endpoint, removing per-solve overhead entirely.

Honeypot and trap avoidance

Honeypot links are traps designed to catch automated crawlers that follow every link without checking visibility. Following a honeypot immediately triggers a block or flags the IP. A comprehensive honeypot taxonomy goes well beyond the CSS-hidden links most articles describe.

Types of honeypot traps

The main honeypot patterns in production use:

CSS-hidden links: display: none, visibility: hidden, opacity: 0, or positioned off-screen. Checking computed styles via Playwright's page.is_visible() is more reliable than checking raw attributes
Color-camouflaged links: link text set to the page background color (white-on-white). Present in the DOM and followable, but invisible to human users
Hidden form fields: an input field that a human never sees or fills in, but an automated form submitter will. Touching it triggers an immediate block. Any input with display: none or a name like website, url, or phone in a non-contact-form context should be skipped
Canary tokens and tarpit pages: links that resolve to monitored endpoints, or pages that serve artificially slow responses to drain bot resources. Compare link href patterns to the site's known URL structure before following

Practical filter: is_safe_link()

A reusable function that checks computed visibility, color contrast, and href relevance to the crawl scope before following any link:

from urllib.parse import urlparse
 
def is_safe_link(page, element, base_domain):
    if not element.is_visible():
        return False
 
    href = element.get_attribute("href") or ""
    if not href or href.startswith("#"):
        return False
 
    parsed = urlparse(href)
    if parsed.netloc and parsed.netloc != base_domain:
        return False
 
    color = page.evaluate("el => getComputedStyle(el).color", element)
    bg = page.evaluate("el => getComputedStyle(el).backgroundColor", element)
 
    return color != bg

Pair this filter with a check against robots.txt disallow patterns to skip paths the site has marked off-limits.

Identifying and respecting rate limits

Rate limiting is distinct from being blocked. It's a server-side signal sent before a harder block occurs. Most guides say to add delays without explaining how to read the rate-limit signals the server is broadcasting in its response headers.

Reading rate-limit headers

The HTTP standard provides several rate-limit response headers that tell a scraper exactly how to behave:

Retry-After: the number of seconds to wait before retrying. Respect this value before attempting exponential backoff
X-RateLimit-Limit: the total number of requests allowed in the current window
X-RateLimit-Remaining: requests left in the current window
X-RateLimit-Reset: Unix timestamp when the window resets
RateLimit-* headers (RFC 6585 draft standard): a newer standardized format increasingly adopted by major APIs

Soft blocks disguised as 200 OK

Some sites return an HTTP 200 with a CAPTCHA page or an empty body instead of a 429. Checking the status code alone isn't enough. Always validate the response content by checking for expected selectors or a minimum content length before treating a response as successful.

Exponential backoff with jitter

import requests
import time
import random
 
def fetch_with_backoff(url, headers, proxies, max_retries=5):
    wait = 2
    for _ in range(max_retries):
        response = requests.get(url, headers=headers, proxies=proxies, timeout=15)
 
        if response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", wait))
            time.sleep(retry_after + random.uniform(0, 1))
            wait *= 2
            continue
 
        if response.status_code == 200 and len(response.text) > 500:
            return response
 
    return None

Polite scraping is also a practical operational stance, not just an ethical one. Staying well within observed rate limits and citing the Crawl-Delay value in robots.txt directly reduces block rates, which lowers maintenance overhead.

Reverse engineering and anti-bot system analysis

This is the most advanced diagnostic tool in a scraper developer's kit. Reverse engineering anti-bot JavaScript isn't about circumventing every site – it's about diagnosing why a specific scraper is being detected. The methodology is systematic and analytical.

Identifying which vendor is protecting the target

Response headers and cookies reveal the anti-bot vendor:

cf-ray header: Cloudflare
x-datadome-requestid header: DataDome
_abck cookie: Akamai Bot Manager
_px cookie: PerimeterX/HUMAN

JavaScript injection patterns in the page source also indicate the vendor. Each vendor's script has a recognizable structure under browser DevTools.

Reading obfuscated anti-bot JavaScript

Use browser DevTools to set breakpoints on anti-bot script execution. The sensor data collection function, the one that assembles the fingerprint payload, is identifiable by the volume of signals it reads: canvas, fonts, timing arrays, and screen dimensions. Observing what it collects doesn't require defeating it. Understanding which signals matter informs which patches to apply.

The Akamai _abck cookie is generated client-side from sensor data and validated server-side. The server rejects a cookie that was generated by a different fingerprint than the current request. Simply copying the cookie from a valid browser session doesn't work because it's bound to the session's specific sensor fingerprint. Understanding this flow explains why the fix requires spoofing the sensor data collection, not the cookie value itself.

Network tab analysis for hidden API endpoints

Watching XHR and fetch calls in the Network tab during page interaction often reveals JSON endpoints the site uses internally – paths like /api/v2/products or /graphql. These endpoints often carry lower anti-bot scrutiny than the rendered HTML page and return structured data that's easier to parse.

When reverse engineering is the wrong approach

Some anti-bot systems update continuously. Cloudflare Turnstile, for example, receives frequent updates that can break reverse-engineered evasion logic overnight. Recognizing when the ongoing maintenance cost exceeds the cost of switching to a managed solution is a practical engineering judgment, not a failure.

Leveraging APIs and alternative data sources

Before building a complex scraper against a heavily protected target, check whether the target exposes an unofficial JSON API that its frontend already consumes. This is often the path of least resistance: no HTML parsing, lower anti-bot scrutiny, and structured data.

The discovery workflow

Open DevTools Network tab, interact with the page (scroll, trigger filters, click load-more buttons), and filter XHR/fetch calls. Most modern single-page applications load data from dedicated API paths. A GraphQL endpoint is particularly valuable – a single query can return deeply structured data that would require multiple scraped pages to reconstruct.

Extracting authentication tokens

Many internal APIs require a Bearer token or a CSRF token obtained from the initial page load. Playwright can capture this token before switching to direct API calls:

from playwright.sync_api import sync_playwright
import requests
 
captured = {}
 
def handle_request(request):
    auth = request.headers.get("authorization")
    if auth:
        captured["token"] = auth
 
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.on("request", handle_request)
    page.goto("https://quotes.toscrape.com/")
    browser.close()
 
if "token" in captured:
    api_headers = {"Authorization": captured["token"]}
    data = requests.get("https://api.example.com/v2/items", headers=api_headers).json()
    print(data)

Official APIs and the hybrid approach

Before scraping at all, verify whether the target offers an official API. Twitter/X, Reddit, LinkedIn, and most major eCommerce platforms provide API access. Pre-scraped datasets cover many common use cases without any scraping overhead.

The hybrid approach works well when an official or unofficial API covers most fields but leaves gaps. Use the API for structured fields and fall back to HTML parsing only for data that the API doesn't expose.

For targets where direct scraping is impractical at scale, Decodo’s Web Scraping API is one of the best choices. It handles the entire anti-bot stack for you, proxy rotation, fingerprint management, JavaScript rendering, and CAPTCHA solving, so you only deal with clean, structured output. The free plan gives you access to over 100 ready-made scraping templates for popular targets like Google, Amazon, and real estate listings, plus a pool of 125M+ ethically sourced IPs across 195+ locations to keep block rates low.

Collect data at scale with Web Scraping API

Activate your free plan and get results in HTML, JSON, CSV, XHR, PNG, or Markdown.

Get started

Legal and ethical considerations

Web scraping isn't illegal by default, but there are no scraping-specific laws either. Instead, it falls under a patchwork of existing regulations, like copyright, data privacy, computer access, and contract law, that vary by jurisdiction. What matters is the context – what data you're collecting, how you're accessing it, and what you do with it afterward.

Public data doesn't mean free data

Just because information is publicly visible doesn't mean you can use it however you want. In many jurisdictions, personal data like names, emails, or behavioral patterns is still protected by privacy regulations regardless of visibility. The EU's GDPR, for example, treats any data that can identify an individual as personal data, no matter where it was found. U.S. state privacy laws like the CCPA take a different approach, generally excluding publicly available information from the personal data definition.

Terms of Service

Many websites restrict automated access through their ToS. The enforceability depends on the type of agreement. Browsewrap agreements (terms posted in a footer) are often unenforceable because users don't explicitly consent. Clickwrap agreements, where you actively click "I agree," are far more likely to hold up. Simply visiting a page doesn't create a binding contract, but logging in or clicking through a consent screen can.

Copyright and database protection

Raw facts aren't copyrightable, so scraping individual data points like prices, dates, or statistics is generally fine. However, creative expression of those facts, articles, images, and curated collections is protected. The EU's Database Directive adds another layer, protecting databases that require significant investment to compile. To stay safe, focus on extracting factual data rather than copying how someone presented it, and transform scraped data into something new rather than republishing it.

Practical guidelines

Before running a production scraper, it's worth following a few baseline practices:

Read the target's ToS. Understand any restrictions on automated access before you start collecting data.
Use official public APIs when available. They offer cleaner data, clearer usage terms, and more stable long-term access.
Space out your requests. Avoid overwhelming the server, which also reduces your chance of triggering anti-bot measures.
Identify your scraper. Set a proper User-Agent string with contact information so site owners can reach you if needed.
Be cautious with personal data. Collect only what you need, and make sure you're complying with applicable privacy regulations.
Consult a legal professional when in doubt. Especially for commercial projects, cross-border scraping, or work in regulated industries.

Industry-leading residential proxies

Access 115M+ residential IPs with fast response times and high success rates.

Start free trial

Bottom line

Most scraper failures are fixable if you know which layer is rejecting the request. The progression is logical – start with residential proxies and a realistic Chrome header set to address the network layer. Add TLS fingerprint matching, using curl_cffi or a similar tool for targets that inspect the ClientHello before reading headers. Reach for headless browsers only when JavaScript rendering is genuinely required, and patch the fingerprint surface thoroughly before using them in production.

Treat CAPTCHA solving and reverse engineering as specialist tools for specific situations, not default starting points. ReCAPTCHA v3 rarely requires a solver if session hygiene is maintained. Reserve reverse engineering for cases where all standard countermeasures have failed, and the target is worth the maintenance overhead.

Anti-bot systems update continuously. Session monitoring, selector validation, and API endpoint checks are ongoing maintenance tasks, not one-time setups. The scraper that works today may need adjustment next month – build monitoring in from the start.

About the author

Benediktas Kazlauskas

Content & PR Team Lead

Benediktas is a content professional with over 8 years of experience in B2C, B2B, and SaaS industries. He has worked with startups, marketing agencies, and fast-growing companies, helping brands turn complex topics into clear, useful content.

Connect with Benediktas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

In this article

Collect data faster with Web Scraping API

Bypass CAPTCHAs with 125M+ IPs, JavaScript rendering, and automated anti-bot bypassing.

Activate free plan

Frequently asked questions

Why does my scraper get blocked even with a VPN?

VPN IPs originate from datacenter ASN ranges. Anti-bot systems maintain blocklists of known datacenter and VPN ASNs, so the IP gets flagged regardless of the request content. Residential proxies use ISP-assigned IP ranges that are indistinguishable from legitimate end-user traffic, which is why they're the standard choice for scraping high-security targets.

What is a JA3 fingerprint and how does it affect scraping?

A JA3 fingerprint is a hash of specific fields in the TLS ClientHello message – cipher suites, extensions, elliptic curves, and elliptic curve point formats. Each HTTP client (Python requests, curl, Chrome, Firefox) produces a distinct JA3 hash. Anti-bot systems compare the JA3 hash of an incoming connection against known browser fingerprints. A request with a Chrome User-Agent but a python-requests JA3 hash is flagged immediately at the TLS layer.

Can I scrape a website that has Cloudflare protection?

Cloudflare's detection systems operate across multiple layers. Basic Cloudflare protection, HTTP-level rules, and IP reputation checks are passable with residential proxies and correct headers. Cloudflare Turnstile, used on higher-security sites, requires full browser execution because it runs a JavaScript probe that plain HTTP clients can replicate. For Turnstile-protected sites, a headless browser with stealth patching or a managed solution like Decodo’s Site Unblocker is required.

How many requests per second can I send without getting blocked?

There's no universal answer. Rate limits vary by site, endpoint, and time of day. The practical approach is to read X-RateLimit-Remaining headers in real time, back off on 429 responses, and start conservatively – 1 request every 3 to 5 seconds per IP is a safe baseline for most targets. Increase gradually while monitoring for soft blocks (200 responses with unexpected content). Off-peak crawling during midnight local-to-server time often allows higher throughput.

Is rotating proxies enough to avoid getting blocked?

Proxy rotation addresses the network layer, but modern anti-bot systems operate across 4 layers. IP rotation alone doesn't help against TLS fingerprinting (Layer 2), browser signal detection (Layer 3), or behavioral analysis (Layer 4). Residential proxy rotation combined with correct headers, TLS fingerprint matching, and human-like timing addresses all 4 layers simultaneously.

What's the difference between reCAPTCHA v2 and v3?

reCAPTCHA v2 presents a visual challenge (image grid selection) that requires a human or a solver service to complete. reCAPTCHA v3 operates silently in the background, scoring sessions on a scale from 0.0 (bot) to 1.0 (human) based on behavioral signals. There's no visual challenge to solve in v3 – the score determines whether the site serves content normally, adds friction, or blocks the request. Good session hygiene (residential IP, realistic headers, Gaussian timing) can keep the v3 score high enough to avoid any challenge.

Unlocked padlock being cut by dashed diagonal line over a browser search-results window on dark blue background

DATA COLLECTION

BUSINESS AUTOMATION

UNBLOCK

How to Scrape Google Without Getting Blocked

Nowadays, web scraping is essential for any business interested in gaining a competitive edge. It allows quick and efficient data extraction from a variety of sources and acts as an integral step toward advanced business and marketing strategies.

If done responsibly, web scraping rarely leads to any issues. But if you don’t follow data scraping best practices, you become more likely to get blocked. Thus, we’re here to share with you practical ways to avoid blocks while scraping Google.

James Keenan

Last updated: Feb 20, 2023

8 min read

Magnifying glass highlighting a robot face icon over a browser window on a dark background

UNBLOCK

DATA COLLECTION

Navigating Anti-Bot Systems: Pro Tips For 2026

With the rapid improvements in artificial intelligence technologies, it seems that 2026 will present some new challenges for web scraping enthusiasts and professionals. Over the years, anti-bot systems have become increasingly sophisticated, which makes extracting valuable data from websites a true challenge. As businesses intensify their efforts to protect against automated bots, traditional web scraping methods are being put to the test. The surge in anti-bot measures is not only due to heightened cybersecurity awareness but also signifies a shift in the digital ecosystem and growing competition. As a result, those who want to leverage publicly available data need to recalibrate their strategies to navigate and circumvent anti-bot systems.

If CAPTCHAs and IP bans were not on your bingo card for 2026, our comprehensive guide is a must-read. We’ve sat down with our scraping gurus and discussed the best practices, gathered all the pro tips, and summarized what’s coming next for anti-bot systems and scrapers. As 2026 approaches, it demands a proactive approach to understanding, outsmarting, and ultimately thriving in the face of escalating anti-bot measures, so grab a cup of coffee and dive into our guide.

If you can't access the whole article, make sure you have disabled your ad blocker

Vilius Sakutis

Last updated: Jan 05, 2026

11 min read

DATA COLLECTION

Playwright Web Scraping: A Practical Tutorial

Web scraping can feel like directing a play without a script – unpredictable and chaotic. That’s where Playwright steps in: a powerful, headless browser automation tool that makes scraping modern, dynamic websites smoother than ever. In this practical tutorial, you’ll learn how to use Playwright to reliably extract data from any web page.

Zilvinas Tamulis

Last updated: Jan 13, 2025

8 min read

Web Scraping Without Getting Blocked: A Practical Guide for 2026

TL;DR

How websites detect scrapers: the four-layer model

Layer #1: Network

Layer #2: TLS

Layer #3: Browser

Layer #4: Behavioral

IP rotation and proxy usage

Why a single IP fails

Proxy type trade-offs

Implementing rotation in Python

User-agent and request header management

What default headers reveal

The full Chrome header set

Practical code: Building a realistic header dict

Request timing and behavior randomization

Why fixed delays fail

Gaussian delays vs. uniform random

Intra-page behavior with Playwright

Headless browsers and browser fingerprinting

What default headless Chrome leaks

playwright-stealth: what it patches and what it misses

The full fingerprint surface

Camoufox as an alternative

When to use headless browsers and when not to

CAPTCHA and anti-bot evasion

CAPTCHA taxonomy

The reCAPTCHA v3 insight most articles miss

Automated CAPTCHA solver integration

Honeypot and trap avoidance

Types of honeypot traps

Practical filter: is_safe_link()

Identifying and respecting rate limits

Reading rate-limit headers

Soft blocks disguised as 200 OK

Exponential backoff with jitter

Reverse engineering and anti-bot system analysis

Identifying which vendor is protecting the target

Reading obfuscated anti-bot JavaScript

Tracing the Akamai _abck cookie flow

Network tab analysis for hidden API endpoints

When reverse engineering is the wrong approach

Leveraging APIs and alternative data sources

The discovery workflow

Extracting authentication tokens

Official APIs and the hybrid approach

Legal and ethical considerations

Public data doesn't mean free data

Terms of Service

Copyright and database protection

Practical guidelines

Bottom line

Frequently asked questions

Why does my scraper get blocked even with a VPN?

What is a JA3 fingerprint and how does it affect scraping?

Can I scrape a website that has Cloudflare protection?

How many requests per second can I send without getting blocked?

Is rotating proxies enough to avoid getting blocked?

What's the difference between reCAPTCHA v2 and v3?

Related articles