NEW

Asynchronous Web Scraping in Python: Build Faster Scrapers With asyncio and aiohttp

A scraper that fetches pages one at a time spends most of its time waiting on the network. Asynchronous web scraping in Python (built on asyncio and aiohttp) fixes that by handling many requests at once on a single event loop. This guide walks through building a working async scraper, then covers the proxy, retry, and anti-bot escalation patterns you'll need at scale.

Justinas Tamasevicius

Last updated: May 18, 2026

25 min read

TL;DR

Async scraping in Python uses asyncio and aiohttp to send many requests at once on a single thread. A multi-URL scrape finishes in seconds instead of minutes.
The baseline pattern combines ClientSession for connection pooling, asyncio.Semaphore for concurrency, split 5xx/429 backoff for retries, and JSON/CSV for export.
At scale, add rotating residential proxies – and escalate to TLS impersonation (curl_cffi) or headless browsers (Patchright) when targets start blocking.
For the hardest targets, a managed service like Decodo's Site Unblocker or Web Scraping API removes the anti-bot maintenance entirely.

What is asyncio and how does it work?

Python's asyncio is a standard-library module for writing concurrent code that runs on a single thread. Instead of creating operating-system threads, it uses an event loop that schedules coroutines, pausing one whenever it reaches an I/O wait and passing the CPU to another. For network-bound work like web scraping, that pause-and-switch pattern is what makes async faster.

Three concepts matter:

Event loop – the scheduler. It watches which coroutines are ready to run and which are waiting on I/O, and interleaves them.
Coroutine – a function defined with async def. Calling it returns a coroutine object; it doesn't run until awaited.
Task – a coroutine scheduled on the event loop. asyncio.create_task() schedules one; asyncio.gather() schedules many and waits for all.

Here's a minimal example with no network calls, so you can see the interleaving:

import asyncio
import time

async def fetch_mock(name: str, delay: float) -> str:
    print(f"[{time.strftime('%X')}] start {name}")
    await asyncio.sleep(delay)  # simulates waiting on a response
    print(f"[{time.strftime('%X')}] done  {name}")
    return f"response from {name}"

async def main():
    start = time.perf_counter()
    results = await asyncio.gather(
        fetch_mock("A", 2.0),
        fetch_mock("B", 3.0),
        fetch_mock("C", 1.0),
    )
    elapsed = time.perf_counter() - start
    print(f"finished in {elapsed:.2f}s")
    print(results)

asyncio.run(main())

import asyncio
import time

async def fetch_mock(name: str, delay: float) -> str:
    print(f"[{time.strftime('%X')}] start {name}")
    await asyncio.sleep(delay)  # simulates waiting on a response
    print(f"[{time.strftime('%X')}] done  {name}")
    return f"response from {name}"

async def main():
    start = time.perf_counter()
    results = await asyncio.gather(
        fetch_mock("A", 2.0),
        fetch_mock("B", 3.0),
        fetch_mock("C", 1.0),
    )
    elapsed = time.perf_counter() - start
    print(f"finished in {elapsed:.2f}s")
    print(results)

asyncio.run(main())

Running it produces interleaved timestamps that show the overlap directly:

[12:00:00] start A
[12:00:00] start B
[12:00:00] start C
[12:00:01] done  C
[12:00:02] done  A
[12:00:03] done  B
finished in 3.00s

All 3 coroutines start in the same second, each pauses during its asyncio.sleep(), and the event loop resumes each when its timer expires – C (1s) finishes first, then A (2s), then B (3s). Total time is the slowest task, not the sum. Real HTTP requests behave the same way: the slow part is waiting for bytes, which asyncio overlaps across many tasks.

asyncio gives you 3 ways to wait on a batch:

asyncio.gather(*tasks) – waits for every task and returns results in the same order as the inputs. The * in *tasks is Python's argument-unpacking operator; it expands a list into separate positional arguments, so gather(*[a, b, c]) is the same as gather(a, b, c). Use gather when you need the full batch before continuing.
asyncio.as_completed(tasks) – yields tasks in the order they finish. Use it when you want to process each result as soon as it arrives, for example to write to disk without waiting for the slowest response.
asyncio.TaskGroup() (Python 3.11+) – an async context manager. You create tasks with tg.create_task() inside the async with block, and the group applies structured concurrency: if any task raises an exception, every other task in the group is canceled, the block waits for cleanup to finish, then raises an ExceptionGroup that bundles the failures together. (ExceptionGroup is a Python 3.11 type for grouping multiple exceptions into one; you catch it with the new except* syntax – note the asterisk). The gather() function is still common for simple batch waits, but TaskGroup is the safer default when failures matter, because it doesn't leave tasks running after another task in the group has already crashed.

The rest of this guide uses gather() because it keeps the examples short and matches the 3.10+ baseline. If you're on 3.11 or later and want stricter failure semantics, replace any await asyncio.gather(*tasks) with a TaskGroup block; the rest of the code doesn't change.

asyncio isn't the same as threading or multiprocessing. Threads run multiple OS threads and work well for I/O-bound code that can't be rewritten in async form. Multiprocessing runs multiple Python processes and is typically the right choice for CPU-bound work like image processing. asyncio is the lightest of the three: one thread, one process, thousands of coroutines.

For scraping, where nearly all the time is network I/O, asyncio handles more concurrent requests with the same amount of memory – a coroutine costs a few KB, while an OS thread reserves around 8 MB of stack on typical Linux defaults. The trade-offs between these models are covered in the guide on concurrency vs. parallelism.

Synchronous vs. asynchronous web scraping: performance compared

Synchronous web scraping sends one request at a time and waits for each response before starting the next. Asynchronous web scraping sends many requests concurrently on a single thread, overlapping the network waits – which is why an async scraper finishes a 50-URL batch in seconds where a sync one takes minutes. In sync scraping, every network wait runs one after another, adding up in real time. Async overlaps them inside the time of the slowest one:

Synchronous vs. asynchronous performance comparison

To verify the speedup, build the same scraper twice – once with Requests, once with aiohttp – and run both against the same target. The example below fetches the Hacker News front page, extracts story links, then fetches each story page and parses out title, score, author, and comment count. Different HTTP clients suit different jobs; the guide on HTTPX vs. Requests vs. AIOHTTP covers when to pick which.

The synchronous baseline looks like this:

# sync_scraper.py
import time
import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/142.0.0.0 Safari/537.36"
    )
}
BASE = "https://news.ycombinator.com/"

def fetch(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=30)
    response.raise_for_status()
    return response.text

def extract_story_urls(html: str) -> list[str]:
    """Pull HN discussion URLs from the front page using each row's id attribute."""
    soup = BeautifulSoup(html, "html.parser")
    urls = []
    for row in soup.select("tr.athing"):
        story_id = row.get("id")
        if story_id:
            urls.append(f"{BASE}item?id={story_id}")
    return urls

def parse_story(html: str) -> dict:
    soup = BeautifulSoup(html, "html.parser")
    title_tag = soup.select_one("span.titleline > a")
    score_tag = soup.select_one("span.score")
    author_tag = soup.select_one("a.hnuser")
    subtext = soup.select_one("td.subtext")

    comments = 0
    if subtext:
        for anchor in subtext.select("a"):
            text = anchor.get_text(strip=True)
            if "comment" in text:
                parts = text.split()
                if parts and parts[0].isdigit():
                    comments = int(parts[0])

    return {
        "title": title_tag.get_text(strip=True) if title_tag else None,
        "url": title_tag.get("href") if title_tag else None,
        "score": int(score_tag.get_text().split()[0]) if score_tag else 0,
        "author": author_tag.get_text(strip=True) if author_tag else None,
        "comments": comments,
    }

def run():
    front_html = fetch(BASE)
    urls = extract_story_urls(front_html)

    start = time.perf_counter()
    results = [parse_story(fetch(url)) for url in urls]
    elapsed = time.perf_counter() - start

    print(f"sync: {len(results)} stories in {elapsed:.2f}s")
    return results

if __name__ == "__main__":
    run()

# sync_scraper.py
import time
import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/142.0.0.0 Safari/537.36"
    )
}
BASE = "https://news.ycombinator.com/"

def fetch(url: str) -> str:
    response = requests.get(url, headers=HEADERS, timeout=30)
    response.raise_for_status()
    return response.text

def extract_story_urls(html: str) -> list[str]:
    """Pull HN discussion URLs from the front page using each row's id attribute."""
    soup = BeautifulSoup(html, "html.parser")
    urls = []
    for row in soup.select("tr.athing"):
        story_id = row.get("id")
        if story_id:
            urls.append(f"{BASE}item?id={story_id}")
    return urls

def parse_story(html: str) -> dict:
    soup = BeautifulSoup(html, "html.parser")
    title_tag = soup.select_one("span.titleline > a")
    score_tag = soup.select_one("span.score")
    author_tag = soup.select_one("a.hnuser")
    subtext = soup.select_one("td.subtext")

    comments = 0
    if subtext:
        for anchor in subtext.select("a"):
            text = anchor.get_text(strip=True)
            if "comment" in text:
                parts = text.split()
                if parts and parts[0].isdigit():
                    comments = int(parts[0])

    return {
        "title": title_tag.get_text(strip=True) if title_tag else None,
        "url": title_tag.get("href") if title_tag else None,
        "score": int(score_tag.get_text().split()[0]) if score_tag else 0,
        "author": author_tag.get_text(strip=True) if author_tag else None,
        "comments": comments,
    }

def run():
    front_html = fetch(BASE)
    urls = extract_story_urls(front_html)

    start = time.perf_counter()
    results = [parse_story(fetch(url)) for url in urls]
    elapsed = time.perf_counter() - start

    print(f"sync: {len(results)} stories in {elapsed:.2f}s")
    return results

if __name__ == "__main__":
    run()

If you used the href on span.titleline > a, you would scrape the external article. The HN discussion URL has to be built from the row's id attribute (the story ID), which is why extract_story_urls reads row.get("id") and constructs item?id=… directly. To avoid this on your own targets, inspect the element in browser devtools and check the actual href / id before trusting a selector.

The async version reuses the same extract_story_urls and parse_story functions and only changes the fetch strategy. Save the file as async_scraper.py next to sync_scraper.py in the same directory so the import resolves:

# async_scraper.py
import asyncio
import time
import aiohttp
from sync_scraper import extract_story_urls, parse_story, HEADERS, BASE

async def fetch(session: aiohttp.ClientSession, url: str) -> str:
    async with session.get(url) as response:
        response.raise_for_status()
        return await response.text()

async def run():
    async with aiohttp.ClientSession(headers=HEADERS) as session:
        front_html = await fetch(session, BASE)
        urls = extract_story_urls(front_html)

        start = time.perf_counter()
        tasks = [fetch(session, url) for url in urls]
        pages = await asyncio.gather(*tasks)
        results = [parse_story(p) for p in pages]
        elapsed = time.perf_counter() - start

    print(f"async: {len(results)} stories in {elapsed:.2f}s")
    return results

if __name__ == "__main__":
    asyncio.run(run())

# async_scraper.py
import asyncio
import time
import aiohttp
from sync_scraper import extract_story_urls, parse_story, HEADERS, BASE

async def fetch(session: aiohttp.ClientSession, url: str) -> str:
    async with session.get(url) as response:
        response.raise_for_status()
        return await response.text()

async def run():
    async with aiohttp.ClientSession(headers=HEADERS) as session:
        front_html = await fetch(session, BASE)
        urls = extract_story_urls(front_html)

        start = time.perf_counter()
        tasks = [fetch(session, url) for url in urls]
        pages = await asyncio.gather(*tasks)
        results = [parse_story(p) for p in pages]
        elapsed = time.perf_counter() - start

    print(f"async: {len(results)} stories in {elapsed:.2f}s")
    return results

if __name__ == "__main__":
    asyncio.run(run())

To measure the speedup without sending heavy load to a real target (or being rate-limited during the benchmark), the cleanest demo is to point both scrapers at httpbin.org/delay/1, an endpoint that waits one second before responding. Each request has roughly one second of network wait, so the sync scraper pays it N times and the async scraper overlaps them.

# benchmark.py
import asyncio
import time
import requests
import aiohttp

URL = "https://httpbin.org/delay/1"
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/142.0.0.0 Safari/537.36"
    )
}

def run_sync(n: int) -> float:
    start = time.perf_counter()
    for _ in range(n):
        requests.get(URL, headers=HEADERS, timeout=30)
    return time.perf_counter() - start

async def run_async(n: int, concurrency: int) -> float:
    sem = asyncio.Semaphore(concurrency)
    async def bound(s):
        async with sem:
            async with s.get(URL) as r:
                return r.status
    timeout = aiohttp.ClientTimeout(total=30, connect=5, sock_read=10)
    async with aiohttp.ClientSession(headers=HEADERS, timeout=timeout) as s:
        start = time.perf_counter()
        await asyncio.gather(*[bound(s) for _ in range(n)])
        return time.perf_counter() - start

if __name__ == "__main__":
    print(f"{'count':<8}{'sync (s)':<14}{'async (s)':<14}{'speedup':<10}")
    print("-" * 50)
    for n in [10, 25, 50]:
        sync_t = run_sync(n)
        async_t = asyncio.run(run_async(n, concurrency=20))
        print(f"{n:<8}{sync_t:<14.2f}{async_t:<14.2f}{sync_t/async_t:.1f}x")

# benchmark.py
import asyncio
import time
import requests
import aiohttp

URL = "https://httpbin.org/delay/1"
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/142.0.0.0 Safari/537.36"
    )
}

def run_sync(n: int) -> float:
    start = time.perf_counter()
    for _ in range(n):
        requests.get(URL, headers=HEADERS, timeout=30)
    return time.perf_counter() - start

async def run_async(n: int, concurrency: int) -> float:
    sem = asyncio.Semaphore(concurrency)
    async def bound(s):
        async with sem:
            async with s.get(URL) as r:
                return r.status
    timeout = aiohttp.ClientTimeout(total=30, connect=5, sock_read=10)
    async with aiohttp.ClientSession(headers=HEADERS, timeout=timeout) as s:
        start = time.perf_counter()
        await asyncio.gather(*[bound(s) for _ in range(n)])
        return time.perf_counter() - start

if __name__ == "__main__":
    print(f"{'count':<8}{'sync (s)':<14}{'async (s)':<14}{'speedup':<10}")
    print("-" * 50)
    for n in [10, 25, 50]:
        sync_t = run_sync(n)
        async_t = asyncio.run(run_async(n, concurrency=20))
        print(f"{n:<8}{sync_t:<14.2f}{async_t:<14.2f}{sync_t/async_t:.1f}x")

Output from a real run on a residential connection (your numbers will vary with network round-trip time and httpbin's load, but the ratios stay in the same range):

Requests

Sync (s)

Async (s)

Speedup

32.67

2.63

12.4x

97.36

6.54

14.9x

169.85

7.07

24.0x

Each request to httpbin.org/delay/1 costs 1 second of server delay plus 2-3 seconds of network round-trip and httpbin's variable load, so the per-request wall time on a residential link is closer to 3-4 seconds than to 1. The async time stays close to flat as N grows until the concurrency cap is fully used, which is why the speedup widens with batch size.

The same pattern holds against real-world targets (HN, news listings, product pages), but absolute numbers depend on the target's response time, your network latency, and how strictly the target rate-limits a single IP. A high-volume scraper reaches the rate-limit before reaching the throughput maximum, which is why proxies (covered later in this guide) matter.

Sync scraping isn't always worse. It's the better choice when:

You need to scrape a single page, where the setup cost of asyncio doesn't pay off.
You're debugging selectors and want a fast print-then-inspect cycle.
The target requires sequential interaction (login, then form, then result) where request order matters and there's no parallelism to exploit.

For anything past a few dozen pages against a tolerant target, async is the right default.

Setting up your async web scraping project

A clean project layout and a proxy slot set up from the start will save an hour of refactoring later. Proxies matter because a high request rate from a single IP often gets blocked quickly on strict targets, and adding proxy auth to a scraper afterwards usually means modifying every fetch call; the article on residential proxies explains why rotating residential IPs help at scale.

The stack is Python 3.10 or newer, aiohttp on the 3.13.x line, beautifulsoup4 4.14 or later, and python-dotenv for credentials. Why Python 3.10 as the minimum: the str | None union syntax used in the type hints below requires it, and the optional escalation tools (curl_cffi 0.14+ and Scrapling) both require 3.10+. Python 3.11 or later also enables asyncio.TaskGroup and the asyncio.timeout() context manager; 3.14 adds the python -m asyncio ps introspection CLI covered later.

If you have uv installed – the Rust-based package manager from Astral that became one of the most-adopted Python install tools in 2025-2026, 10-100x faster than pip on most workloads – the setup is 2 commands inside a fresh project directory:

mkdir async-scraper && cd async-scraper
uv venv
source .venv/bin/activate          # macOS / Linux
# .venv\Scripts\activate           # Windows

uv pip install "aiohttp>=3.13" "beautifulsoup4>=4.14" python-dotenv

If you use pip, the equivalent works the same, just slower:

python -m venv .venv
source .venv/bin/activate

pip install "aiohttp>=3.13" "beautifulsoup4>=4.14" python-dotenv

For one-off scripts where setting up a project is more than you need, uv also supports PEP 723 inline script metadata: declare the dependencies inside a comment block at the top of a single .py file and run it with uv run, which builds a temporary environment automatically when you run the script.

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "aiohttp>=3.13",
#     "beautifulsoup4>=4.14",
# ]
# ///
import asyncio, aiohttp
# ... your scraper here ...

uv run scraper.py     # builds env from the inline block, runs the script

This is useful for sharing a working scraper as one file: the recipient runs uv run scraper.py and the dependencies install automatically. The full multi-file project layout below is still the right structure for anything you'll maintain long-term.

aiohttp's stable releases don't support HTTP/2 (open issue upstream but not yet released); if you're scraping targets that serve measurably different content or prioritize HTTP/2 clients, HTTPX is the common alternative (async API, HTTP/2 support, slower at raw throughput). For most scraping work where the target serves plain HTTP/1.1 HTML, aiohttp's throughput advantage makes it the better choice.

Lay out the project with one file per responsibility. Small files are easier to test and change than a single scraper script that does everything:

async-scraper/
├── .env                 # proxy credentials (gitignored)
├── config.py            # credentials, headers, concurrency settings
├── parser.py            # Beautiful Soup extraction logic
├── resilient_fetch.py   # retry / backoff wrapper
├── export.py            # JSON and CSV writers
└── scraper.py           # the orchestrator that ties them together

Store your Decodo residential proxy credentials in a .env file at the project root. Keeping them out of code means the repo stays safe to commit, and rotating credentials is a one-line change. The username and password come from your Decodo dashboard once your residential plan is active; the same gateway host and port work for rotating, sticky, and geo-targeted requests, with the differences encoded in the username string.

If you don't have proxy credentials yet, leave DECODO_USER and DECODO_PASS unset – PROXY_URL will default to None in config.py and the rest of this guide will run against a tolerant target without a proxy. Add real credentials before increasing volume against any site that enforces rate limits.

# .env
DECODO_USER=your_username
DECODO_PASS=your_password
DECODO_HOST=gate.decodo.com
DECODO_PORT=7000

Load those values in config.py so every module reads them from a single place:

# config.py
import os
from dotenv import load_dotenv

load_dotenv()

DECODO_USER = os.getenv("DECODO_USER")
DECODO_PASS = os.getenv("DECODO_PASS")
DECODO_HOST = os.getenv("DECODO_HOST", "gate.decodo.com")
DECODO_PORT = os.getenv("DECODO_PORT", "7000")

PROXY_URL = (
    f"http://{DECODO_USER}:{DECODO_PASS}@{DECODO_HOST}:{DECODO_PORT}"
    if DECODO_USER and DECODO_PASS
    else None
)

DEFAULT_HEADERS = {
    # Realistic Chrome header set - the practical default for general scraping
    # in 2026. Update the Chrome version periodically (check chromium.org/release):
    # anti-bot vendors flag UAs more than 6-9 months stale.
    #
    # For cooperative targets where you want to identify your scraper - sites
    # you own, public-data sources, partners with explicit allowance - swap
    # the User-Agent for a polite form like:
    #   "Mozilla/5.0 (compatible; YourProject/1.0; +https://your-contact-url)"
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/142.0.0.0 Safari/537.36"
    ),
    "Accept": (
        "text/html,application/xhtml+xml,application/xml;q=0.9,"
        "image/avif,image/webp,image/apng,*/*;q=0.8,"
        "application/signed-exchange;v=b3;q=0.7"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    # "br" (brotli) needs brotli or brotlicffi installed to decode.
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Ch-Ua": '"Google Chrome";v="142", "Not-A.Brand";v="99"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"macOS"',
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
}

CONCURRENCY = 10
REQUEST_TIMEOUT = 30

# config.py
import os
from dotenv import load_dotenv

load_dotenv()

DECODO_USER = os.getenv("DECODO_USER")
DECODO_PASS = os.getenv("DECODO_PASS")
DECODO_HOST = os.getenv("DECODO_HOST", "gate.decodo.com")
DECODO_PORT = os.getenv("DECODO_PORT", "7000")

PROXY_URL = (
    f"http://{DECODO_USER}:{DECODO_PASS}@{DECODO_HOST}:{DECODO_PORT}"
    if DECODO_USER and DECODO_PASS
    else None
)

DEFAULT_HEADERS = {
    # Realistic Chrome header set - the practical default for general scraping
    # in 2026. Update the Chrome version periodically (check chromium.org/release):
    # anti-bot vendors flag UAs more than 6-9 months stale.
    #
    # For cooperative targets where you want to identify your scraper - sites
    # you own, public-data sources, partners with explicit allowance - swap
    # the User-Agent for a polite form like:
    #   "Mozilla/5.0 (compatible; YourProject/1.0; +https://your-contact-url)"
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/142.0.0.0 Safari/537.36"
    ),
    "Accept": (
        "text/html,application/xhtml+xml,application/xml;q=0.9,"
        "image/avif,image/webp,image/apng,*/*;q=0.8,"
        "application/signed-exchange;v=b3;q=0.7"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    # "br" (brotli) needs brotli or brotlicffi installed to decode.
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Ch-Ua": '"Google Chrome";v="142", "Not-A.Brand";v="99"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"macOS"',
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
}

CONCURRENCY = 10
REQUEST_TIMEOUT = 30

That's the practical default for general scraping in 2026. The Sec-Ch-Ua and Sec-Fetch-* headers are sent by every modern Chrome and their absence is a strong "this is automation" signal, so they belong in the default set even when a target doesn't strictly check them today.

The polite alternative in the comment is the right choice for cooperative targets where the relationship matters more than looking like a browser. Pair this header set with curl_cffi's TLS impersonation later in this guide for a stack that matches Chrome at both the TLS layer and the HTTP-header layer.

Before you write any extraction logic, confirm aiohttp is installed correctly and the network is working. Save the script below as verify.py in the project root, then run python verify.py (with the virtual environment activated). Expect a 200 status and a JSON body echoing your User-Agent. If you see ModuleNotFoundError: No module named 'aiohttp', your virtual environment isn't active – re-run source .venv/bin/activate.

# verify.py
import asyncio
import aiohttp
from config import DEFAULT_HEADERS

async def main():
    async with aiohttp.ClientSession(headers=DEFAULT_HEADERS) as session:
        async with session.get("https://httpbin.org/get") as resp:
            print(resp.status, await resp.json())

asyncio.run(main())

If this fails, fix networking or dependency issues before continuing; debugging is much harder once retries and proxies are added on top.

Is uvloop still faster than asyncio?

On Linux and macOS, uvloop is a drop-in asyncio event loop built on libuv, the same C library that Node.js uses for its event loop. It used to deliver a 2x to 4x throughput improvement on network-heavy workloads. On Python 3.13 and 3.14 the difference has shrunk for HTTP scraping – CPython's native asyncio loop has improved enough that the difference varies by workload, sometimes in either direction. Treat uvloop as a measure-then-decide optimization, not an automatic default.

Where uvloop still helps: code that does lots of small awaits in tight loops, timer-driven workloads, and WebSocket servers that broadcast to many clients at once.

uv pip install uvloop

The current API is uvloop.run(), a drop-in replacement for asyncio.run() (the older uvloop.install() relies on asyncio.set_event_loop_policy(), deprecated in Python 3.14 and slated for removal in 3.16). On Windows, use asyncio.run() instead since uvloop doesn't run there.

Sending async HTTP requests with aiohttp

The async benchmark above used aiohttp.ClientSession in its simplest form, but the most common performance bug in async scrapers comes from a misunderstanding of what it is. The ClientSession class is a connection pool, not a convenience wrapper around session.get(). It holds open TCP and TLS connections to the same host so the next request can reuse them – what HTTP calls keep-alive. (Each new HTTPS request normally costs an extra round-trip to set up the TCP connection, plus a few more for the TLS encryption handshake; keep-alive lets the next request skip both). Readers used to sync requests can compare the patterns in the guide on Python Requests.

Creating 1 session per request discards the pool, forces a new TCP connection and TLS handshake every time, and slows the scraper down to something close to sync speed. The correct pattern is 1 session for the whole run:

# session_example.py
import asyncio
import aiohttp
from config import DEFAULT_HEADERS, REQUEST_TIMEOUT

async def fetch(session: aiohttp.ClientSession, url: str) -> str:
    async with session.get(url) as response:
        response.raise_for_status()
        return await response.text()

async def main():
    urls = [
        "https://news.ycombinator.com/",
        "https://news.ycombinator.com/newest",
        "https://news.ycombinator.com/ask",
    ]
    timeout = aiohttp.ClientTimeout(total=30, connect=5, sock_read=10)
    async with aiohttp.ClientSession(
        headers=DEFAULT_HEADERS,
        timeout=timeout,
        raise_for_status=True,
    ) as session:
        tasks = [fetch(session, u) for u in urls]
        pages = await asyncio.gather(*tasks)
        for url, page in zip(urls, pages):
            print(url, len(page), "bytes")

asyncio.run(main())

# session_example.py
import asyncio
import aiohttp
from config import DEFAULT_HEADERS, REQUEST_TIMEOUT

async def fetch(session: aiohttp.ClientSession, url: str) -> str:
    async with session.get(url) as response:
        response.raise_for_status()
        return await response.text()

async def main():
    urls = [
        "https://news.ycombinator.com/",
        "https://news.ycombinator.com/newest",
        "https://news.ycombinator.com/ask",
    ]
    timeout = aiohttp.ClientTimeout(total=30, connect=5, sock_read=10)
    async with aiohttp.ClientSession(
        headers=DEFAULT_HEADERS,
        timeout=timeout,
        raise_for_status=True,
    ) as session:
        tasks = [fetch(session, u) for u in urls]
        pages = await asyncio.gather(*tasks)
        for url, page in zip(urls, pages):
            print(url, len(page), "bytes")

asyncio.run(main())

A few details in this snippet matter for production scrapers:

headers=DEFAULT_HEADERS on the session applies to every request inside the async with block. Set the User-Agent once there rather than on every session.get() call.
timeout=ClientTimeout(total=30, connect=5, sock_read=10) on the session sets a default timeout for every request without you having to pass one to each session.get(). The aiohttp default is total=300 with no other per-phase caps, which is too generous for most scrapers. Splitting the budget across connect (max time to open the connection) and sock_read (max time between 2 chunks of bytes) lets a stuck connect or a stalled read fail in seconds rather than waiting for the full total.
raise_for_status=True on the session calls response.raise_for_status() automatically on every response, so 4xx and 5xx codes turn into ClientResponseError exceptions you can catch centrally.

One security note on raise_for_status: by default the exception's message attribute can include the original Authorization header value if the server echoes it back. If you're using basic auth on the proxy or target, either set raise_for_status=False and check status manually, or strip the header from logs before recording the exception.

If the target requires cookies, set them on the session so every request sends them automatically:

cookies = {"session_id": "abc123", "consent": "accepted"}
async with aiohttp.ClientSession(headers=DEFAULT_HEADERS, cookies=cookies) as session:
    ...

Custom headers per request (for example, a referer that changes per URL) go on the individual session.get() call and override the session defaults for that request only.

Parsing HTML and extracting data with Beautiful Soup

aiohttp returns HTML as a string; Beautiful Soup is the parser this guide uses for the reason most scraping stacks default to it: forgiving with malformed HTML, clean CSS-selector API, no extra binary dependencies. Keep parsing in its own function, separate from fetching, so you can save a few HTML files and iterate on selectors without making any network requests. More patterns for common HTML structures are in the guide on Beautiful Soup web scraping, and the trade-offs between selector styles are covered in XPath vs. CSS selectors.

The Hacker News story page has a predictable structure: one tr.athing row per story on the front page, and each story page has a span.titleline, a span.score, an a.hnuser, and a comment count inside td.subtext. The parser below handles all fields and returns None or 0 when a field is missing rather than crashing:

# parser.py
from bs4 import BeautifulSoup

def parse_story(html: str) -> dict:
    soup = BeautifulSoup(html, "html.parser")

    title_tag = soup.select_one("span.titleline > a")
    score_tag = soup.select_one("span.score")
    author_tag = soup.select_one("a.hnuser")
    subtext = soup.select_one("td.subtext")

    comments = 0
    if subtext:
        for anchor in subtext.select("a"):
            text = anchor.get_text(strip=True)
            if "comment" in text:
                parts = text.split()
                if parts and parts[0].isdigit():
                    comments = int(parts[0])

    return {
        "title": title_tag.get_text(strip=True) if title_tag else None,
        "url": title_tag.get("href") if title_tag else None,
        "score": int(score_tag.get_text().split()[0]) if score_tag else 0,
        "author": author_tag.get_text(strip=True) if author_tag else None,
        "comments": comments,
    }

# parser.py
from bs4 import BeautifulSoup

def parse_story(html: str) -> dict:
    soup = BeautifulSoup(html, "html.parser")

    title_tag = soup.select_one("span.titleline > a")
    score_tag = soup.select_one("span.score")
    author_tag = soup.select_one("a.hnuser")
    subtext = soup.select_one("td.subtext")

    comments = 0
    if subtext:
        for anchor in subtext.select("a"):
            text = anchor.get_text(strip=True)
            if "comment" in text:
                parts = text.split()
                if parts and parts[0].isdigit():
                    comments = int(parts[0])

    return {
        "title": title_tag.get_text(strip=True) if title_tag else None,
        "url": title_tag.get("href") if title_tag else None,
        "score": int(score_tag.get_text().split()[0]) if score_tag else 0,
        "author": author_tag.get_text(strip=True) if author_tag else None,
        "comments": comments,
    }

Returning a dictionary (rather than a tuple or a custom class) keeps the parser easy to use with JSON and CSV exporters. Defensive checks on every select_one result matter because HN occasionally serves a deleted or flagged story with fewer fields; crashing on None.get_text() in the middle of a 500-URL run wastes every successful response that came before it.

One note on the html.parser backend: it's part of the Python standard library, so you don't need lxml or html5lib installed. It's a bit slower and more forgiving of malformed HTML than lxml. For large pages or limited CPU, switch to lxml and the only change is the second argument to BeautifulSoup().

selectolax vs. Beautiful Soup: when parsing becomes the bottleneck

At a small scale, Beautiful Soup parsing time is small compared to network time. At a large scale, parsing can take longer than fetching. selectolax is a Python wrapper around the Lexbor HTML parser (written in C) and is significantly faster than both Beautiful Soup backends. Running 500 parses of the same ~700 KB HN story page on Python 3.14:

Parser

Time per parse

Speedup vs. baseline

Beautiful Soup (html.parser)

176.56 ms

1.0x (baseline)

Beautiful Soup (lxml)

101.32 ms

1.7x

selectolax (lexbor)

3.21 ms

54.9x

selectolax is also 31.5x faster than Beautiful Soup with lxml, which is the fastest Beautiful Soup configuration. On a 500-URL scrape, that's the difference between 50 seconds of parsing and 1.6 seconds. v0.4.7 is the current release, and the underlying lexbor backend is actively developed and more spec-compliant than the older, now-deprecated modest backend.

uv pip install selectolax

Here’s the code:

# parser_fast.py
from selectolax.lexbor import LexborHTMLParser

def parse_story_fast(html: str) -> dict:
    tree = LexborHTMLParser(html)
    title_tag = tree.css_first("span.titleline > a")
    score_tag = tree.css_first("span.score")
    author_tag = tree.css_first("a.hnuser")
    subtext = tree.css_first("td.subtext")

    comments = 0
    if subtext:
        for anchor in subtext.css("a"):
            text = anchor.text(strip=True)
            if "comment" in text:
                parts = text.split()
                if parts and parts[0].isdigit():
                    comments = int(parts[0])

    return {
        "title": title_tag.text(strip=True) if title_tag else None,
        "url": title_tag.attributes.get("href") if title_tag else None,
        "score": int(score_tag.text().split()[0]) if score_tag else 0,
        "author": author_tag.text(strip=True) if author_tag else None,
        "comments": comments,
    }

# parser_fast.py
from selectolax.lexbor import LexborHTMLParser

def parse_story_fast(html: str) -> dict:
    tree = LexborHTMLParser(html)
    title_tag = tree.css_first("span.titleline > a")
    score_tag = tree.css_first("span.score")
    author_tag = tree.css_first("a.hnuser")
    subtext = tree.css_first("td.subtext")

    comments = 0
    if subtext:
        for anchor in subtext.css("a"):
            text = anchor.text(strip=True)
            if "comment" in text:
                parts = text.split()
                if parts and parts[0].isdigit():
                    comments = int(parts[0])

    return {
        "title": title_tag.text(strip=True) if title_tag else None,
        "url": title_tag.attributes.get("href") if title_tag else None,
        "score": int(score_tag.text().split()[0]) if score_tag else 0,
        "author": author_tag.text(strip=True) if author_tag else None,
        "comments": comments,
    }

The API is close enough to Beautiful Soup that porting a parser is straightforward: select_one becomes css_first, select becomes css, get_text becomes text, and attribute access uses .attributes. The trade-off is that selectolax is stricter about malformed HTML; pages with deeply broken markup still parse better with Beautiful Soup. A common pattern: try selectolax first, use Beautiful Soup as a backup if a required selector returns None.

Controlling concurrency with asyncio.Semaphore

To limit concurrent requests in asyncio, wrap each fetch task in an asyncio.Semaphore(N) – only N tasks can hold the semaphore at once, the rest wait. This is the standard pattern for capping concurrency in aiohttp scrapers, and it solves the hidden problem in asyncio.gather(): gather does not limit how many tasks run at once. Pass 500 tasks and aiohttp tries to open 500 connections, which can overwhelm the target, trigger rate limiting or IP bans, and sometimes reach your local OS file-descriptor limit (default 1024 on Linux) before the requests even reach the server. Broader anti-detection techniques are covered in the guide on anti-scraping techniques.

Wrap the fetch inside an async with on the semaphore and you get a configurable concurrency cap with about 3 extra lines of code:

# throttled_fetch.py
import asyncio
import random
import aiohttp
from config import DEFAULT_HEADERS, REQUEST_TIMEOUT, CONCURRENCY

async def throttled_fetch(
    semaphore: asyncio.Semaphore,
    session: aiohttp.ClientSession,
    url: str,
) -> str:
    async with semaphore:
        async with session.get(url) as response:
            response.raise_for_status()
            text = await response.text()
        # small randomized jitter between batches lowers the detection signature
        await asyncio.sleep(random.uniform(0.1, 0.5))
        return text

async def scrape(urls: list[str]) -> list[str]:
    semaphore = asyncio.Semaphore(CONCURRENCY)
    timeout = aiohttp.ClientTimeout(total=REQUEST_TIMEOUT, connect=5, sock_read=10)
    async with aiohttp.ClientSession(headers=DEFAULT_HEADERS, timeout=timeout) as session:
        tasks = [throttled_fetch(semaphore, session, u) for u in urls]
        return await asyncio.gather(*tasks)

# throttled_fetch.py
import asyncio
import random
import aiohttp
from config import DEFAULT_HEADERS, REQUEST_TIMEOUT, CONCURRENCY

async def throttled_fetch(
    semaphore: asyncio.Semaphore,
    session: aiohttp.ClientSession,
    url: str,
) -> str:
    async with semaphore:
        async with session.get(url) as response:
            response.raise_for_status()
            text = await response.text()
        # small randomized jitter between batches lowers the detection signature
        await asyncio.sleep(random.uniform(0.1, 0.5))
        return text

async def scrape(urls: list[str]) -> list[str]:
    semaphore = asyncio.Semaphore(CONCURRENCY)
    timeout = aiohttp.ClientTimeout(total=REQUEST_TIMEOUT, connect=5, sock_read=10)
    async with aiohttp.ClientSession(headers=DEFAULT_HEADERS, timeout=timeout) as session:
        tasks = [throttled_fetch(semaphore, session, u) for u in urls]
        return await asyncio.gather(*tasks)

The right value for CONCURRENCY depends on three things: how tolerant the target is, how large your proxy pool is, and how fast each request responds. A safe starting point for most public targets is 10 to 20; increase only if the target accepts it and responses remain successful. A pool of rotating residential IPs lets you run higher concurrency because each request appears to come from a different client. (For cooperative targets where IP reputation isn't a factor, datacenter or ISP proxies are cheaper per GB; residential is the safer default when targets actively block).

Two throttles exist at different layers and they work together cleanly. asyncio.Semaphore caps coroutine-level concurrency: how many fetch tasks are allowed to run at once. aiohttp's TCPConnector caps connection-level concurrency: how many TCP sockets the pool will open. The connector defaults to limit=100 total and limit_per_host=0 (unlimited per host), which is too permissive for most scraping. Passing an explicit connector narrows both:

import aiohttp

connector = aiohttp.TCPConnector(limit=50, limit_per_host=10)
async with aiohttp.ClientSession(connector=connector, headers=DEFAULT_HEADERS) as session:
    ...

Using both a semaphore and a configured connector is the layered approach: the connector limits the network layer and the semaphore limits the task queue. They cover different failure modes.

The await asyncio.sleep(random.uniform(0.1, 0.5)) inside the semaphore adds a small randomized jitter after each request. It is not a replacement for proxies, but it widens the time between requests, which weakens the fixed-interval pattern that simple bot detectors look for.

Integrating proxies into async scraping with aiohttp

Async increases the need for proxies. Sync scraping sends 1 request at a time from your IP, so even a slow rate-limiter might not detect it. Async can send dozens at once from the same IP, which triggers blocks far faster. Rotating proxies change the source IP per request; the post on rotating proxies covers the models in detail.

aiohttp accepts a proxy URL as a keyword argument on session.get(). With Decodo residential proxies the URL format is http://YOUR\_PROXY\_USERNAME:YOUR\_PROXY\_PASSWORD@host:port, which is exactly what the config.py file built earlier:

# proxied_fetch.py
import asyncio
import aiohttp
from config import DEFAULT_HEADERS, PROXY_URL, REQUEST_TIMEOUT

async def fetch_with_proxy(session: aiohttp.ClientSession, url: str) -> str:
    timeout = aiohttp.ClientTimeout(total=REQUEST_TIMEOUT)
    async with session.get(url, proxy=PROXY_URL, timeout=timeout) as response:
        response.raise_for_status()
        return await response.text()

async def main():
    urls = [
        "https://ip.decodo.com/json",
        "https://ip.decodo.com/json",
        "https://ip.decodo.com/json",
    ]
    async with aiohttp.ClientSession(headers=DEFAULT_HEADERS) as session:
        tasks = [fetch_with_proxy(session, u) for u in urls]
        results = await asyncio.gather(*tasks)
        for r in results:
            print(r)

asyncio.run(main())

# proxied_fetch.py
import asyncio
import aiohttp
from config import DEFAULT_HEADERS, PROXY_URL, REQUEST_TIMEOUT

async def fetch_with_proxy(session: aiohttp.ClientSession, url: str) -> str:
    timeout = aiohttp.ClientTimeout(total=REQUEST_TIMEOUT)
    async with session.get(url, proxy=PROXY_URL, timeout=timeout) as response:
        response.raise_for_status()
        return await response.text()

async def main():
    urls = [
        "https://ip.decodo.com/json",
        "https://ip.decodo.com/json",
        "https://ip.decodo.com/json",
    ]
    async with aiohttp.ClientSession(headers=DEFAULT_HEADERS) as session:
        tasks = [fetch_with_proxy(session, u) for u in urls]
        results = await asyncio.gather(*tasks)
        for r in results:
            print(r)

asyncio.run(main())

Calling an IP-echo endpoint 3 times concurrently through a rotating gateway returns 3 different IPs, which confirms rotation is working.

3 patterns cover most production needs, and on Decodo they all work through the same gateway host and port; what changes is the username string. Decodo's username format is user-<your_username> with optional parameters appended as -key-value pairs:

Per-request rotation. The bare user-<your_username> form gets a new IP on every connection. This is the right choice for high-volume scraping where each URL is independent.
Sticky sessions. Appending -session-<id> keeps the same IP across requests carrying that identifier (for example, user-jane-session-abc123). Use it when a login flow has to be followed by data fetches on the same identity. An optional -sessionduration-<minutes> sets how long the IP is held.
Geo-targeted requests. Appending -country-<code> (for example, user-jane-country-de) routes the request through an IP in that country. City and US state are also available (-city-new_york with underscores for multi-word cities, -state-us_california with the us_ prefix), as is -zip- for US ZIP codes.

Combining the semaphore from the previous section with the proxy parameter shown above gives you controlled concurrency plus IP rotation in one pattern:

async def throttled_proxied_fetch(
    semaphore: asyncio.Semaphore,
    session: aiohttp.ClientSession,
    url: str,
) -> str:
    async with semaphore:
        async with session.get(url, proxy=PROXY_URL) as response:
            response.raise_for_status()
            return await response.text()

Same structure as throttled_fetch, with 1 extra keyword argument. The semaphore controls how many requests run at once; the proxy URL rotates the source IP per request. Together they make a scraper that's more likely to stay inside the target's tolerance and look like traffic from many clients rather than one. With Decodo's rotating proxies, rotation happens server-side inside the gateway, so the snippet above is the full integration.

One implementation detail worth knowing: rotation happens per TCP connection, not per request. In the async pattern above, every concurrent task opens its own TCP connection to the gateway, so each concurrent request gets a different IP and rotation appears per-request to the target. But if you reuse a single ClientSession to make sequential requests one after another, aiohttp's keep-alive will reuse the same TCP connection and the gateway will return the same IP across those sequential calls.

Verified live on Decodo's residential gateway: 20 concurrent requests through a shared session returned 20 distinct IPs, while 3 sequential requests through the same session returned the same IP 3 times. For the async-scraping pattern this guide builds, this is invisible – concurrent tasks open separate connections, so rotation behaves as expected. It's the kind of detail that causes problems for anyone adding the same proxy URL to a serial script later. (With HTTP/2, a single TCP connection carries many requests, so rotation behavior depends on the gateway's policy. aiohttp doesn't support HTTP/2, but if you switch to HTTPX or curl_cffi, test rotation behavior before assuming it matches the per-TCP model).

Async error handling, retries, and resilience

Every production scraper hits the same failures: timeouts, connection resets, 429 rate limits, 403 blocks, and the occasional 500 from an overloaded upstream. The goal is to let one failure fail one URL, not the whole run. Retry patterns in Python are covered more broadly in retry failed Python requests.

Wrap each fetch in try/except, catch the exceptions that async HTTP calls raise, and retry with an increasing delay before stopping. The delays follow an exponential pattern (1, 2, 4 seconds) plus a small random jitter, which spreads the retries in time so many failures don't retry at the same moment and overwhelm the target again.

# resilient_fetch.py
import asyncio
import logging
import random
import aiohttp

logger = logging.getLogger(__name__)

# Different statuses get different backoff curves. A 5xx is usually a
# transient server hiccup that clears in seconds; a 429 is the target
# explicitly asking you to slow down on a minutes timescale.
RETRYABLE_5XX = {500, 502, 503, 504}
RATE_LIMITED = {429}

# Backoff schedules in seconds, one entry per attempt
BACKOFF_5XX = [1, 2, 4]
BACKOFF_429 = [60, 300, 900]   # 1 min, 5 min, 15 min

def parse_retry_after(value: str | None) -> float | None:
    """RFC 7231: Retry-After is either delta-seconds or an HTTP-date.
    We honor the seconds form; HTTP-date form is rare and fine to skip."""
    if value and value.isdigit():
        return float(value)
    return None

async def fetch_with_retry(
    semaphore: asyncio.Semaphore,
    session: aiohttp.ClientSession,
    url: str,
    proxy: str | None = None,
    retries: int = 3,
) -> str | None:
    # split timeout: 30s total, 5s to connect, 10s between reads
    timeout = aiohttp.ClientTimeout(total=30, connect=5, sock_read=10)

    for attempt in range(retries):
        try:
            async with semaphore:
                async with session.get(url, proxy=proxy, timeout=timeout) as response:
                    if response.status == 200:
                        return await response.text()

                    if response.status in RETRYABLE_5XX:
                        if attempt < retries - 1:
                            delay = BACKOFF_5XX[min(attempt, len(BACKOFF_5XX) - 1)]
                            delay += random.uniform(0, 0.5)
                            logger.info("5xx %s for %s, retry in %.1fs",
                                        response.status, url, delay)
                            await asyncio.sleep(delay)
                            continue
                        logger.error("exhausted 5xx retries for %s", url)
                        return None

                    if response.status in RATE_LIMITED:
                        if attempt < retries - 1:
                            # Honor Retry-After if set, else use long backoff
                            retry_after = parse_retry_after(response.headers.get("Retry-After"))
                            delay = retry_after or BACKOFF_429[min(attempt, len(BACKOFF_429) - 1)]
                            delay += random.uniform(0, 5)
                            logger.info("429 for %s, retry in %.0fs (server-hint=%s)",
                                        url, delay, retry_after is not None)
                            await asyncio.sleep(delay)
                            continue
                        logger.error("exhausted 429 retries for %s", url)
                        return None

                    # Genuinely non-retryable status (403, 404, etc.)
                    logger.warning("non-retryable %s for %s", response.status, url)
                    return None
        except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
            if attempt == retries - 1:
                logger.error("exhausted retries on %s: %s", url, exc)
                return None
            delay = BACKOFF_5XX[min(attempt, len(BACKOFF_5XX) - 1)]
            delay += random.uniform(0, 0.5)
            logger.info("network error on %s, retry in %.1fs (%s)",
                        url, delay, type(exc).__name__)
            await asyncio.sleep(delay)

    return None

# resilient_fetch.py
import asyncio
import logging
import random
import aiohttp

logger = logging.getLogger(__name__)

# Different statuses get different backoff curves. A 5xx is usually a
# transient server hiccup that clears in seconds; a 429 is the target
# explicitly asking you to slow down on a minutes timescale.
RETRYABLE_5XX = {500, 502, 503, 504}
RATE_LIMITED = {429}

# Backoff schedules in seconds, one entry per attempt
BACKOFF_5XX = [1, 2, 4]
BACKOFF_429 = [60, 300, 900]   # 1 min, 5 min, 15 min

def parse_retry_after(value: str | None) -> float | None:
    """RFC 7231: Retry-After is either delta-seconds or an HTTP-date.
    We honor the seconds form; HTTP-date form is rare and fine to skip."""
    if value and value.isdigit():
        return float(value)
    return None

async def fetch_with_retry(
    semaphore: asyncio.Semaphore,
    session: aiohttp.ClientSession,
    url: str,
    proxy: str | None = None,
    retries: int = 3,
) -> str | None:
    # split timeout: 30s total, 5s to connect, 10s between reads
    timeout = aiohttp.ClientTimeout(total=30, connect=5, sock_read=10)

    for attempt in range(retries):
        try:
            async with semaphore:
                async with session.get(url, proxy=proxy, timeout=timeout) as response:
                    if response.status == 200:
                        return await response.text()

                    if response.status in RETRYABLE_5XX:
                        if attempt < retries - 1:
                            delay = BACKOFF_5XX[min(attempt, len(BACKOFF_5XX) - 1)]
                            delay += random.uniform(0, 0.5)
                            logger.info("5xx %s for %s, retry in %.1fs",
                                        response.status, url, delay)
                            await asyncio.sleep(delay)
                            continue
                        logger.error("exhausted 5xx retries for %s", url)
                        return None

                    if response.status in RATE_LIMITED:
                        if attempt < retries - 1:
                            # Honor Retry-After if set, else use long backoff
                            retry_after = parse_retry_after(response.headers.get("Retry-After"))
                            delay = retry_after or BACKOFF_429[min(attempt, len(BACKOFF_429) - 1)]
                            delay += random.uniform(0, 5)
                            logger.info("429 for %s, retry in %.0fs (server-hint=%s)",
                                        url, delay, retry_after is not None)
                            await asyncio.sleep(delay)
                            continue
                        logger.error("exhausted 429 retries for %s", url)
                        return None

                    # Genuinely non-retryable status (403, 404, etc.)
                    logger.warning("non-retryable %s for %s", response.status, url)
                    return None
        except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
            if attempt == retries - 1:
                logger.error("exhausted retries on %s: %s", url, exc)
                return None
            delay = BACKOFF_5XX[min(attempt, len(BACKOFF_5XX) - 1)]
            delay += random.uniform(0, 0.5)
            logger.info("network error on %s, retry in %.1fs (%s)",
                        url, delay, type(exc).__name__)
            await asyncio.sleep(delay)

    return None

A few design choices in that function are deliberate:

Returning None on final failure rather than re-raising lets asyncio.gather() collect results without crashing the whole batch. The caller filters None out before exporting.
The semaphore wraps the request, not the whole retry loop. Network-error retries release the slot during the backoff sleep (the sleep is in the except clause, outside async with semaphore). HTTP-status retries hold it, since the 5xx/429 sleeps run inside the async with block – move them outside the semaphore if you need tighter concurrency control.
Only retryable statuses trigger retries. A 403 usually means the request was identified and banned; retrying the same URL with the same fingerprint doesn't change the outcome. A 429 or a 5xx is worth retrying because the condition is often temporary.
Timeouts count as errors, thanks to asyncio.TimeoutError being in the except tuple.

The 2 backoff schedules above are deliberately asymmetric (the code comment explains why). On the 429 path, parse_retry_after uses the server's Retry-After hint when set, since that tells you exactly how long to wait. In testing for this post, 5 concurrent requests to one target all returned 429; the rate limit cleared only after roughly 15 minutes. A retry window under a minute would not have been long enough, which is why the 2 schedules are split.

On Python 3.11 and later, asyncio.timeout() is a context-manager alternative to aiohttp.ClientTimeout that cancels everything inside the async with block when the deadline is reached, not just the HTTP call. Use it when a single logical operation spans several awaits (fetch, parse a response header, then fetch a follow-up URL) and you want one deadline for all of them. For a plain single session.get(), aiohttp.ClientTimeout is equivalent and keeps the timeout next to the request it applies to.

aiohttp client middleware: retry and logging examples

The retry function above mixes 3 concerns in one place: HTTP, retry policy, and logging. aiohttp 3.12 added client middleware, which lets you split each concern into its own async function and apply it across every request on the session. For scrapers, it removes the per-call wrapper pattern.

A middleware is an async function that takes a ClientRequest and a handler, calls the handler to get a ClientResponse, and returns it. You register middleware on the session via the middlewares parameter:

# middleware_retry.py
import asyncio
import logging
import random
from collections.abc import Awaitable, Callable
import aiohttp
from aiohttp import ClientRequest, ClientResponse

# aiohttp 3.12 exports ClientMiddlewareType as the middleware function
# signature; the handler argument is the simpler type below.
Handler = Callable[[ClientRequest], Awaitable[ClientResponse]]

logger = logging.getLogger(__name__)
RETRYABLE = {429, 500, 502, 503, 504}

async def retry_middleware(
    req: ClientRequest, handler: Handler
) -> ClientResponse:
    for attempt in range(3):
        try:
            response = await handler(req)
            if response.status in RETRYABLE and attempt < 2:
                delay = 2 ** attempt + random.uniform(0, 0.5)
                logger.info("retry %d for %s in %.2fs", attempt + 1, req.url, delay)
                await asyncio.sleep(delay)
                continue
            return response
        except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
            if attempt == 2:
                raise
            delay = 2 ** attempt + random.uniform(0, 0.5)
            logger.info("retry %d for %s in %.2fs (%s)", attempt + 1, req.url, delay, exc)
            await asyncio.sleep(delay)
    return response

async def logging_middleware(
    req: ClientRequest, handler: Handler
) -> ClientResponse:
    logger.info("-> %s %s", req.method, req.url)
    response = await handler(req)
    logger.info("<- %s %s", response.status, req.url)
    return response

async def main():
    async with aiohttp.ClientSession(
        middlewares=(logging_middleware, retry_middleware),
    ) as session:
        async with session.get("https://news.ycombinator.com/") as resp:
            print(resp.status, len(await resp.text()))

asyncio.run(main())

# middleware_retry.py
import asyncio
import logging
import random
from collections.abc import Awaitable, Callable
import aiohttp
from aiohttp import ClientRequest, ClientResponse

# aiohttp 3.12 exports ClientMiddlewareType as the middleware function
# signature; the handler argument is the simpler type below.
Handler = Callable[[ClientRequest], Awaitable[ClientResponse]]

logger = logging.getLogger(__name__)
RETRYABLE = {429, 500, 502, 503, 504}

async def retry_middleware(
    req: ClientRequest, handler: Handler
) -> ClientResponse:
    for attempt in range(3):
        try:
            response = await handler(req)
            if response.status in RETRYABLE and attempt < 2:
                delay = 2 ** attempt + random.uniform(0, 0.5)
                logger.info("retry %d for %s in %.2fs", attempt + 1, req.url, delay)
                await asyncio.sleep(delay)
                continue
            return response
        except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
            if attempt == 2:
                raise
            delay = 2 ** attempt + random.uniform(0, 0.5)
            logger.info("retry %d for %s in %.2fs (%s)", attempt + 1, req.url, delay, exc)
            await asyncio.sleep(delay)
    return response

async def logging_middleware(
    req: ClientRequest, handler: Handler
) -> ClientResponse:
    logger.info("-> %s %s", req.method, req.url)
    response = await handler(req)
    logger.info("<- %s %s", response.status, req.url)
    return response

async def main():
    async with aiohttp.ClientSession(
        middlewares=(logging_middleware, retry_middleware),
    ) as session:
        async with session.get("https://news.ycombinator.com/") as resp:
            print(resp.status, len(await resp.text()))

asyncio.run(main())

Two details matter. First, order matters – middleware runs in the order listed, so logging_middleware processes the request before retry_middleware tries to retry it, and the response goes back through the chain in reverse order. Second, if a middleware calls session.get() itself, it must pass middlewares=() on the inner call, or the middleware chain re-enters itself and creates infinite recursion.

The retry function from earlier in this section is still useful when you want per-call control or need to support pre-3.12 aiohttp – which is why the production scraper.py below uses it. Middleware is the cleaner choice for new code on aiohttp 3.12+ when the same retry/logging policy applies to every request in the session.

For failures that still fail after 3 retries, a common pattern is a second pass. Collect the URLs that returned None, wait a minute, and run them again with lower concurrency. Most transient issues resolve on the second pass; the ones that still fail probably never will. Python's standard logging module is enough to track this; configure it once at the top of your main script:

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s: %(message)s",
)

How to debug a stuck asyncio scraper (python -m asyncio ps in 3.14)

Sometimes a scraper doesn't crash with errors; it just stops making progress. Half the tasks are stuck in some unknown await, the rest are waiting on a semaphore, and the logs stop showing new entries. Python 3.14 added a built-in CLI for this case that attaches to a running Python process and prints every asyncio task, what coroutine it's currently in, and which task is waiting on which.

# flat table of all running asyncio tasks in process 12345
python -m asyncio ps 12345

# tree view showing await relationships and any cycles
python -m asyncio pstree 12345

For programmatic use, asyncio.capture_call_graph() and asyncio.print_call_graph() give the same data inside the running process. A common pattern on Linux and macOS is to connect them to a SIGUSR1 signal handler so you can dump the task tree on demand without stopping the scraper. The introspection CLI is Python 3.14 only, so it doesn't help on older versions; for a new scraper, choosing 3.14 includes this debugger by default.

Exporting scraped data to JSON and CSV

Once the scraper returns a list of dictionaries, 2 exports cover almost every downstream consumer: JSON for pipelines and APIs, CSV for spreadsheets and quick inspection. Python's standard library does both without extra dependencies. More storage options (databases, cloud buckets, parquet) are covered in how to save your scraped data.

Include enough metadata in the output to answer basic questions months later without re-running the scraper: when did this run? how many URLs did it attempt? How many succeeded? The export functions below put the metadata in the JSON wrapper and add a timestamp prefix to every filename so runs don't overwrite each other:

# export.py
import csv
import json
from datetime import datetime, timezone
from pathlib import Path

OUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)

def export_json(results: list[dict], total_attempted: int, prefix: str = "hn") -> Path:
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
    path = OUTPUT_DIR / f"{prefix}_{timestamp}.json"
    payload = {
        "scraped_at": datetime.now(timezone.utc).isoformat(),
        "total_attempted": total_attempted,
        "total_succeeded": len(results),
        "total_failed": total_attempted - len(results),
        "results": results,
    }
    path.write_text(
        json.dumps(payload, ensure_ascii=False, indent=2),
        encoding="utf-8",
    )
    return path

def export_csv(results: list[dict], prefix: str = "hn") -> Path:
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
    path = OUTPUT_DIR / f"{prefix}_{timestamp}.csv"
    if not results:
        path.write_text("", encoding="utf-8")
        return path
    fieldnames = list(results[0].keys())
    with path.open("w", encoding="utf-8", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(results)
    return path

# export.py
import csv
import json
from datetime import datetime, timezone
from pathlib import Path

OUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)

def export_json(results: list[dict], total_attempted: int, prefix: str = "hn") -> Path:
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
    path = OUTPUT_DIR / f"{prefix}_{timestamp}.json"
    payload = {
        "scraped_at": datetime.now(timezone.utc).isoformat(),
        "total_attempted": total_attempted,
        "total_succeeded": len(results),
        "total_failed": total_attempted - len(results),
        "results": results,
    }
    path.write_text(
        json.dumps(payload, ensure_ascii=False, indent=2),
        encoding="utf-8",
    )
    return path

def export_csv(results: list[dict], prefix: str = "hn") -> Path:
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
    path = OUTPUT_DIR / f"{prefix}_{timestamp}.csv"
    if not results:
        path.write_text("", encoding="utf-8")
        return path
    fieldnames = list(results[0].keys())
    with path.open("w", encoding="utf-8", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(results)
    return path

Three flags in these exporters matter in production. The ensure_ascii=False flag on json.dumps keeps non-ASCII characters (accents, CJK, emoji) as readable text instead of \uXXXX escape sequences, which is easier to read and produces smaller files. newline="" on the CSV open call is required on Windows to prevent blank rows between records. Timestamps use UTC via datetime.now(timezone.utc) so runs from different machines or time zones stay comparable.

The full scraper combines the exporters with the parser, retry function, and proxy config from earlier sections:

# scraper.py
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from config import DEFAULT_HEADERS, PROXY_URL, CONCURRENCY
from parser import parse_story
from resilient_fetch import fetch_with_retry
from export import export_json, export_csv

BASE = "https://news.ycombinator.com/"

async def collect_story_urls(session: aiohttp.ClientSession) -> list[str]:
    async with session.get(BASE) as response:
        response.raise_for_status()
        html = await response.text()
    soup = BeautifulSoup(html, "html.parser")
    urls = []
    for row in soup.select("tr.athing"):
        story_id = row.get("id")
        if story_id:
            urls.append(f"{BASE}item?id={story_id}")
    return urls

async def run():
    semaphore = asyncio.Semaphore(CONCURRENCY)
    async with aiohttp.ClientSession(headers=DEFAULT_HEADERS) as session:
        story_urls = await collect_story_urls(session)
        tasks = [
            fetch_with_retry(semaphore, session, url, proxy=PROXY_URL)
            for url in story_urls
        ]
        pages = await asyncio.gather(*tasks)

    results = [parse_story(p) for p in pages if p is not None]
    json_path = export_json(results, total_attempted=len(story_urls))
    csv_path = export_csv(results)
    print(f"json: {json_path}")
    print(f"csv:  {csv_path}")

if __name__ == "__main__":
    asyncio.run(run())

# scraper.py
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from config import DEFAULT_HEADERS, PROXY_URL, CONCURRENCY
from parser import parse_story
from resilient_fetch import fetch_with_retry
from export import export_json, export_csv

BASE = "https://news.ycombinator.com/"

async def collect_story_urls(session: aiohttp.ClientSession) -> list[str]:
    async with session.get(BASE) as response:
        response.raise_for_status()
        html = await response.text()
    soup = BeautifulSoup(html, "html.parser")
    urls = []
    for row in soup.select("tr.athing"):
        story_id = row.get("id")
        if story_id:
            urls.append(f"{BASE}item?id={story_id}")
    return urls

async def run():
    semaphore = asyncio.Semaphore(CONCURRENCY)
    async with aiohttp.ClientSession(headers=DEFAULT_HEADERS) as session:
        story_urls = await collect_story_urls(session)
        tasks = [
            fetch_with_retry(semaphore, session, url, proxy=PROXY_URL)
            for url in story_urls
        ]
        pages = await asyncio.gather(*tasks)

    results = [parse_story(p) for p in pages if p is not None]
    json_path = export_json(results, total_attempted=len(story_urls))
    csv_path = export_csv(results)
    print(f"json: {json_path}")
    print(f"csv:  {csv_path}")

if __name__ == "__main__":
    asyncio.run(run())

Common asyncio and aiohttp scraping mistakes (and how to fix them)

Most async scraping problems don't crash – they degrade. The scraper still runs and produces output. The bug only appears when you check the data and find it wrong. The list below covers the most common failure modes once your scraper grows beyond a small example.

Forgetting await. A bare fetch(session, url) call returns a coroutine object, not a result. The bug appears as a list of strings like <coroutine object fetch at 0x…> instead of HTML, or as a RuntimeWarning: coroutine was never awaited in the logs. Type checkers in strict mode catch most cases; enabling strict mode for your scraper module is worth the one-time annotation cost.
Mixing Requests into async code. A single sync requests.get() inside an async function blocks the event loop for the entire duration of that call, blocking every other coroutine in the program. If you're inside async def, all HTTP must go through aiohttp (or any other async client). The same warning applies to file I/O, database drivers, and any third-party library that doesn't expose an async API.
One ClientSession per request. Wrapping every fetch in its own async with aiohttp.ClientSession() destroys the connection pool. Every request opens a fresh TCP connection and TLS handshake, so you lose most of the keep-alive savings and performance drops closer to sync speed. Use one session per scraper run.
Creating 100,000 coroutines at once. tasks = [fetch(s, u) for u in urls] creates every coroutine in memory before gather schedules a single one. For URL sets in the hundreds of thousands, this can exhaust memory before any work starts. Chunk the work in batches of a few thousand:


async def scrape_chunked(urls, chunk_size=500):
    results = []
    for i in range(0, len(urls), chunk_size):
        chunk = urls[i:i + chunk_size]
        tasks = [fetch_with_retry(sem, session, u) for u in chunk]
        results.extend(await asyncio.gather(*tasks))
    return results

For unbounded URL sets (recursive crawls, streaming URL discovery, scrapers that run continuously), the standard async pattern is asyncio.Queue with a worker pool, covered in its own section below:

One semaphore across multiple domains. Semaphore(20) applied to a mixed-domain crawl lets all 20 slots target one slow site, even when the other domains have spare capacity. For multi-domain scrapes, give each domain its own semaphore (a dict[str, asyncio.Semaphore] keyed by hostname) so a single slow target can't take all the slots and block the others.
No checkpoint on long runs. A scraper that fails at URL 7,432 of 10,000 should resume from where it failed, not restart from the beginning. The simplest pattern: append each completed URL to a flat file as it finishes, and on startup load that file into a set and skip anything already present. SQLite or Redis work too; what matters is keeping the state outside the process.
Truncated responses raised as aiohttp.ClientPayloadError. This is raised when a connection drops mid-response, which is common over unreliable proxies. Catch it alongside aiohttp.ClientError and asyncio.TimeoutError in the retry block; don't let it stop the batch.
DNS cache surprises. aiohttp caches DNS resolutions for 10 seconds by default via the TCPConnector. For very long runs against rotating-IP infrastructure, raise it with TCPConnector(ttl_dns_cache=300) so you're not re-resolving every 10 seconds for the duration of the run.

Scaling beyond a fixed URL list: asyncio.Queue and worker pools

Everything so far assumes the URL list is known before the run starts. Most production scrapers don't work that way: a crawl discovers new URLs during the run (pagination, internal links, recursive comment threads), and the list grows while workers are still consuming from it. asyncio.gather doesn't fit this pattern because it expects all tasks to be defined before it starts. The async-native pattern for an unbounded or growing URL set is a queue and a pool of worker tasks:

# crawler_queue.py
import asyncio
import aiohttp
from bs4 import BeautifulSoup

CONCURRENCY = 10

async def worker(name: str, queue: asyncio.Queue, session: aiohttp.ClientSession,
                 results: list, seen: set):
    # Bounded memory: only CONCURRENCY workers exist regardless of URL count.
    while True:
        url = await queue.get()
        try:
            # Dedupe at dequeue time. The check-and-add below is safe only
            # because asyncio is single-threaded and no `await` runs between
            # the `in` check and `seen.add()` - don't insert one. For multi-
            # process scaling, replace `seen` with a Redis SET or similar.
            if url in seen:
                continue
            seen.add(url)
            async with session.get(url) as response:
                if response.status != 200:
                    continue
                html = await response.text()
                results.append({"url": url, "size": len(html)})
                # Discover and enqueue new URLs from this page (pagination, etc.)
                for next_url in extract_links(html):
                    if next_url not in seen:
                        await queue.put(next_url)
        except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
            # Production: route to a dead-letter list instead of swallowing.
            # E.g., maintain a `failed: list[tuple[str, Exception]]` and append
            # `(url, exc)` here for retry on a second pass.
            pass
        finally:
            # task_done() is required for queue.join() to know we're complete.
            queue.task_done()

def extract_links(html: str) -> list[str]:
    """Discover the next-page link on Hacker News (the 'More' link at the bottom)."""
    soup = BeautifulSoup(html, "html.parser")
    return [
        f"https://news.ycombinator.com/{a['href']}"
        for a in soup.select("a.morelink")
        if a.get("href")
    ]

async def crawl(seed_urls: list[str]) -> list[dict]:
    queue: asyncio.Queue = asyncio.Queue()
    seen: set[str] = set()
    results: list = []

    for u in seed_urls:
        await queue.put(u)

    async with aiohttp.ClientSession() as session:
        workers = [
            asyncio.create_task(worker(f"w{i}", queue, session, results, seen))
            for i in range(CONCURRENCY)
        ]
        await queue.join()              # wait until queue is fully drained
        for w in workers:
            w.cancel()                  # workers loop forever; cancel on done
        await asyncio.gather(*workers, return_exceptions=True)

    return results

# crawler_queue.py
import asyncio
import aiohttp
from bs4 import BeautifulSoup

CONCURRENCY = 10

async def worker(name: str, queue: asyncio.Queue, session: aiohttp.ClientSession,
                 results: list, seen: set):
    # Bounded memory: only CONCURRENCY workers exist regardless of URL count.
    while True:
        url = await queue.get()
        try:
            # Dedupe at dequeue time. The check-and-add below is safe only
            # because asyncio is single-threaded and no `await` runs between
            # the `in` check and `seen.add()` - don't insert one. For multi-
            # process scaling, replace `seen` with a Redis SET or similar.
            if url in seen:
                continue
            seen.add(url)
            async with session.get(url) as response:
                if response.status != 200:
                    continue
                html = await response.text()
                results.append({"url": url, "size": len(html)})
                # Discover and enqueue new URLs from this page (pagination, etc.)
                for next_url in extract_links(html):
                    if next_url not in seen:
                        await queue.put(next_url)
        except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
            # Production: route to a dead-letter list instead of swallowing.
            # E.g., maintain a `failed: list[tuple[str, Exception]]` and append
            # `(url, exc)` here for retry on a second pass.
            pass
        finally:
            # task_done() is required for queue.join() to know we're complete.
            queue.task_done()

def extract_links(html: str) -> list[str]:
    """Discover the next-page link on Hacker News (the 'More' link at the bottom)."""
    soup = BeautifulSoup(html, "html.parser")
    return [
        f"https://news.ycombinator.com/{a['href']}"
        for a in soup.select("a.morelink")
        if a.get("href")
    ]

async def crawl(seed_urls: list[str]) -> list[dict]:
    queue: asyncio.Queue = asyncio.Queue()
    seen: set[str] = set()
    results: list = []

    for u in seed_urls:
        await queue.put(u)

    async with aiohttp.ClientSession() as session:
        workers = [
            asyncio.create_task(worker(f"w{i}", queue, session, results, seen))
            for i in range(CONCURRENCY)
        ]
        await queue.join()              # wait until queue is fully drained
        for w in workers:
            w.cancel()                  # workers loop forever; cancel on done
        await asyncio.gather(*workers, return_exceptions=True)

    return results

The pattern has three useful properties for crawls:

Bounded memory. The queue holds pending URLs, the seen set holds discovered URLs, and only CONCURRENCY coroutines exist at any moment. A 1-million-URL crawl uses roughly the same memory as a 1,000-URL crawl, plus the size of the seen set.
Backpressure included. asyncio.Queue(maxsize=N) blocks producers when the queue is full, which prevents URL discovery from going faster than fetching. For a crawl that finds links faster than it processes them, this stops the queue from growing without limit.
Graceful shutdown. queue.join() waits until every queued item has been marked complete with task_done(), which is how you distinguish "queue is empty right now" (more might arrive) from "queue is empty and every prior item has been processed" (truly done).

The retry, proxy, and middleware patterns from earlier sections work with this pattern unchanged: replace session.get(url) with fetch_with_retry(sem, session, url, proxy=PROXY_URL), add the same raise_for_status=True on the session, register the same middleware. The queue pattern doesn't replace any building blocks; it replaces gather as the orchestrator when the URL list is not known before the run starts.

For the simpler case of a known URL list, gather with chunking (shown in the gotchas section above) is enough. The queue pattern is only worth its overhead when your work is unbounded or streaming. When you grow beyond a single process – multi-million URLs/day across multiple machines – replace the in-memory pieces with durable ones:

asyncio.Queue → Redis or SQS.
The seen set → a Redis SET, or a Bloom filter (a probabilistic membership structure that uses a fraction of the memory at the cost of rare false positives).
Single-process retry → an idempotent dead-letter queue (a separate queue holding failed messages for inspection or out-of-band retry; idempotent so the same task can run twice without corrupting state).

asyncio still drives each worker process. Only the queue and the seen-set become external.

Streaming large response bodies with iter_chunked

When a response body is large (a multi-megabyte JSON dump, a sitemap, a file download, a server-sent-event stream), reading it with await response.text() loads the entire body into memory before you can process it. aiohttp's response.content is an async iterator that yields chunks as they arrive, so you can process or write each chunk without holding the full body:

async with session.get(big_url) as response:
    response.raise_for_status()
    with open("large_payload.bin", "wb") as f:
        async for chunk in response.content.iter_chunked(64 * 1024):
            f.write(chunk)

For large downloads, this turns a 500 MB memory peak per task into a steady 64 KB working set. It works with everything else (semaphore, proxies, retries) because the streaming happens inside the same async with session.get(…) block.

When aiohttp gets blocked: TLS fingerprint impersonation with curl_cffi

Before using a headless browser, try TLS impersonation first. A growing share of blocks in 2026 is not about IP reputation or rate limits – it's TLS fingerprinting.

Every HTTPS connection starts with a TLS handshake. The first message your client sends is called the ClientHello. It lists which encryption ciphers the client supports, which TLS extensions it understands, and which order it prefers. Each Python HTTP library produces a different ClientHello from a real Chrome or Firefox browser (different extension order, different cipher list), and those differences are stable enough to fingerprint. JA3 and the newer JA4 are short hashes computed from the ClientHello details. Cloudflare, Akamai, DataDome, and others use them to detect and block traffic from known automation tools. aiohttp, Requests, and httpx all produce JA3/JA4 hashes that anti-bot systems flag.

The same fingerprinting also extends to HTTP/2. Anti-bot vendors fingerprint the HTTP/2 SETTINGS frame and the order of HEADERS / WINDOW_UPDATE frames – called the Akamai HTTP/2 fingerprint. A client that sends a real Chrome's TLS ClientHello but Python's HTTP/2 frame order still gets blocked. curl_cffi's impersonation profiles match both the TLS and HTTP/2 layers, which is why the profile needs to track real browser builds rather than a static cipher list.

The fix is a client whose ClientHello matches a real browser's exactly. curl_cffi is Python bindings for curl-impersonate, a fork of curl patched to send the exact ClientHello and HTTP/2 settings of a real Chrome, Firefox, Safari, or Edge build. To the target's TLS and HTTP/2 fingerprint checks, the connection matches a browser. Behavioral signals, JavaScript challenges, and IP reputation are separate detection layers and can still flag you even with a perfect fingerprint match. curl_cffi has async support, proxy support, and an API close to Requests, so porting is straightforward.

uv pip install curl_cffi

Here’s the code:

# tls_impersonation.py
import asyncio
from curl_cffi.requests import AsyncSession

PROXY = "http://YOUR_PROXY_USERNAME:YOUR_PROXY_PASSWORD@gate.decodo.com:7000"

async def fetch(session: AsyncSession, url: str) -> str:
    response = await session.get(url, proxy=PROXY, timeout=30)
    response.raise_for_status()
    return response.text

async def main():
    urls = ["https://target-with-cloudflare.example/listing?page=1"]
    async with AsyncSession(impersonate="chrome") as session:
        tasks = [fetch(session, u) for u in urls]
        pages = await asyncio.gather(*tasks)
        for p in pages:
            print(len(p), "bytes")

asyncio.run(main())

# tls_impersonation.py
import asyncio
from curl_cffi.requests import AsyncSession

PROXY = "http://YOUR_PROXY_USERNAME:YOUR_PROXY_PASSWORD@gate.decodo.com:7000"

async def fetch(session: AsyncSession, url: str) -> str:
    response = await session.get(url, proxy=PROXY, timeout=30)
    response.raise_for_status()
    return response.text

async def main():
    urls = ["https://target-with-cloudflare.example/listing?page=1"]
    async with AsyncSession(impersonate="chrome") as session:
        tasks = [fetch(session, u) for u in urls]
        pages = await asyncio.gather(*tasks)
        for p in pages:
            print(len(p), "bytes")

asyncio.run(main())

The generic impersonate="chrome" profile auto-tracks the latest Chrome fingerprint when curl_cffi updates. This is usually what you want for scraper code that should keep working over time. The catch: auto-tracking only happens when you update curl_cffi, so a pinned lockfile keeps the fingerprint at the install-time version and goes out of date in a few months.

If you need a pinned profile, recent curl_cffi releases include HTTP/3 fingerprints for newer Chrome and Firefox builds – check the changelog for available profile names. The rest of your pipeline (semaphore, retries, parsing, export) stays the same. curl_cffi replaces aiohttp only for requests that need the impersonation.

A routing pattern that works in production: start with aiohttp for speed, catch 403 or Cloudflare challenge pages, retry those URLs through curl_cffi, and only use a headless browser if both fail. This three-tier order uses heavier tools only when cheaper ones are blocked.

Decodo's products follow the same order: datacenter for cooperative targets, residential when IP reputation matters, Site Unblocker when fingerprint and CAPTCHA become the main blocker, Web Scraping API when you'd rather not maintain the stack yourself. Anti-detection beyond curl_cffi (header randomization, fingerprint rotation, behavioral patterns) is covered in the anti-scraping techniques guide.

Scraping shouldn't be this hard

Replace proxy configs, retry logic, and fingerprint workarounds with a single API call that returns clean data.

Start for free

Scrapling: a higher-level framework

If assembling aiohttp + curl_cffi + retry middleware yourself is more work than the value, Scrapling is an actively maintained Python framework that bundles all of these under one async API. It provides AsyncStealthySession for browser-based stealth, AsyncDynamicSession for full browser automation, and FetcherSession for HTTP requests with browser impersonation.

Scrapling also includes adaptive element tracking, which can keep a scraper running through small selector changes (renamed classes, reordered siblings) by re-locating elements through similarity heuristics rather than failing silently. Major redesigns still break it, and a near-match re-location can return wrong data quietly – check the output after each Scrapling release.

Install Scrapling with the fetchers extra to include the HTTP and browser fetchers:

uv pip install "scrapling[fetchers]"
scrapling install            # downloads browser binaries used by the stealth fetchers

Here’s the code:

import asyncio
from scrapling.fetchers import FetcherSession

async def scrape():
    async with FetcherSession(impersonate="chrome") as session:
        page = await session.get("https://news.ycombinator.com/")
        titles = page.css("span.titleline > a::text")
        return titles

async def main():
    titles = await scrape()
    for t in titles[:5]:
        print(t)

asyncio.run(main())

Scrapling requires Python 3.10 or later. The fetchers extra is heavier than aiohttp + curl_cffi alone – even if you only use FetcherSession, it installs playwright, browserforge (a header-and-fingerprint generator), and camoufox (a stealth-patched Firefox build) for the browser-based fetchers. The trade-off: you have less control over the underlying HTTP behavior, but you get smaller code and the adaptive selector layer.

Two situations make Scrapling the better choice over an aiohttp stack you build yourself: a one-off scrape where you don't need to maintain the code, and a target whose markup changes often enough that brittle selectors are a real maintenance cost. For everything else, a hand-built aiohttp stack is the better default because you control every layer and can replace parts (selectolax for parsing, curl_cffi for impersonation, middleware for retries) without rewriting the orchestration.

When to go beyond HTTP: headless browsers and scraping APIs

aiohttp and curl_cffi cover static HTML – the long tail of listings, articles, and archival content that makes up most of the web. They don't cover pages that only finish rendering after JavaScript runs. The guide on headless browsers explains how browser-based scraping differs from HTTP-only scraping, and the Playwright tutorial goes through the browser approach end to end.

Headless browsers: Playwright and Patchright

To check if a target needs a browser, disable JavaScript in your browser and load the page. If the content you need is missing, aiohttp will get the same empty page. Single-page applications, infinite-scroll pages, and most modern dashboards work this way. The standard library for this is Playwright, a headless-browser library maintained by Microsoft with a native async API. It uses the same event loop and concurrency patterns as aiohttp. Browsers are heavy: they use orders of magnitude more memory than an HTTP client, so concurrency drops from dozens to a few, memory per task grows from kilobytes to hundreds of megabytes, and throughput drops too.

For a crawl that's mostly static with a few dynamic pages, a hybrid pipeline works well: aiohttp handles the URLs that return usable HTML, and only the JavaScript-rendered ones go through Playwright. The async API on both sides makes that routing easy to write.

Playwright has one detail worth knowing in 2026. Anti-bot vendors (Cloudflare, DataDome, and others) detect automated browsers by checking for a specific Chrome DevTools Protocol command called Runtime.Enable, which Playwright and Puppeteer send by default on every frame. (CDP is the wire protocol Playwright uses to drive Chromium-based browsers.) The command itself is harmless, but its presence almost always means the browser is automated, and detection vendors use it as a signal.

Two community projects fix this. Patchright is a drop-in replacement for playwright-python that skips the Runtime.Enable call. It evaluates Playwright's own scripts inside isolated execution contexts (the page's scripts and Playwright's scripts run separately, so neither can read the other) and patches the Console API. The Runtime.Enable signal stops firing. rebrowser-patches is a separate set of source patches for both Playwright and Puppeteer with the same goal.

If you use stock Playwright against a target with strict anti-bot defenses, replacing it with Patchright is usually the smallest change that improves success rates. It doesn't work on every target, but it removes the most common detection signal.

Managed scraping APIs

When maintaining anti-bot logic, rendering infrastructure, and proxy rotation becomes more work than the scraper itself, a managed service is the practical option. Decodo's Web Scraping API handles proxy rotation, browser rendering, and common CAPTCHA challenges behind a single endpoint. It also includes optional structured-data parsing and Markdown output for LLM pipelines. Success rates depend on the target and the CAPTCHA type, so test on a small sample before scaling.

The request is a POST to https://scraper-api.decodo.com/v2/scrape with a JSON body like {"url": "…", "headless": "html"} (use headless="html" when the page needs JavaScript to render). The response is {"results": [{"content": "…", "status_code": 200, "headers": {…}, "cookies": […]}]}.

Site Unblocker works differently: it's a proxy endpoint, so your aiohttp code only changes the proxy URL. On the backend it runs a real browser to render JavaScript, sends a Chrome-class TLS/HTTP fingerprint, and routes common CAPTCHA challenges through a solver pool. Use it for targets where stealth alone isn't enough. Success rates still depend on the target, and rendered or solved requests cost more than plain proxy traffic. To use it, change only the proxy URL:

PROXY = "http://YOUR_PROXY_USERNAME:YOUR_PROXY_PASSWORD@unblock.decodo.com:60000"  # was gate.decodo.com:7000
async with aiohttp.ClientSession() as session:
    async with session.get(url, proxy=PROXY) as response:
        html = await response.text()

The aiohttp code is the same as elsewhere in this guide. The unblocker runs the JS, sends the fingerprint, and rotates the IP behind the endpoint.

These aren't replacements for aiohttp. They're the next option when most of your time goes to anti-bot work instead of scraping.

Bottom line

For most async scrapers, the stack here works well: aiohttp with Decodo's rotating residential proxies handles most of the work. When TLS or CAPTCHA blocks appear, switch to curl_cffi or use Decodo's Site Unblocker, which handles JavaScript rendering and CAPTCHAs through a single proxy URL change. For larger projects where maintenance becomes more work than the data is worth, Decodo's Web Scraping API removes the proxy and anti-bot setup entirely. Pick the option that fits the target and the time you can spend on infrastructure.

Skip the boilerplate

Decodo's Web Scraping API handles proxies, CAPTCHAs, and anti-bot detection so your code stays short and your requests actually land.

Try for free

About the author

Justinas Tamasevicius

Director of Engineering

Justinas Tamaševičius is Director of Engineering with over two decades of expertise in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.

Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

In this article

Proxies handled for you

Built-in IP rotation, geo-targeting, and anti-bot bypass – all that with Decodo's single endpoint.

Start for free

Frequently asked questions

Is aiohttp faster than requests for web scraping?

Yes for batches of URLs, no for a single page – aiohttp keeps dozens of requests in flight on one thread while Requests processes them one at a time, so a multi-URL scrape finishes much faster. For one HTTP call, the per-request time is roughly the same either way.

What's the difference between asyncio and threading for web scraping?

Both handle I/O-bound work like web scraping, but asyncio runs thousands of coroutines on a single thread by pausing each one when it hits an I/O wait, while threading spawns multiple OS threads. For scraping, asyncio is lighter: less memory per concurrent request, no GIL contention on Python-side work, and explicit await points instead of preemptive context switches. Threading still wins when you can't rewrite a sync library into async form. Note that free-threaded CPython (the 3.13t and 3.14t builds from PEP 703) removes the GIL contention argument for threading; for I/O-bound scraping asyncio still wins on memory per task, but the gap narrows.

Should I use asyncio.gather or TaskGroup for async scraping?

Use asyncio.TaskGroup() on Python 3.11+ as the safer default; asyncio.gather() still fits simple batch waits where partial failures are OK or when you're stuck on 3.10. TaskGroup gives you structured concurrency – if any task raises, every sibling is canceled and the async with block raises an ExceptionGroup you catch with except*. Most scrapers don't strictly need that, but it's the right default for new code.

Can I use asyncio with Scrapy or Playwright instead of aiohttp?

Yes for Playwright (native async API that pairs cleanly with aiohttp for hybrid pipelines); partially for Scrapy (its engine still runs on Twisted, but recent versions ship a reactor bridge – AsyncioSelectorReactor – that lets you await coroutines inside Scrapy callbacks). You can't simply swap aiohttp into Scrapy's engine. Scrapy's HttpProxyMiddleware reads meta['proxy'] per request; that's the integration point if you want proxies in a Scrapy pipeline.

Do I need proxies for async web scraping?

Almost always, if you're scraping at any real volume. Async scraping sends many requests in a short window from the same IP, which trips rate limits and IP bans far faster than sequential scraping does. Rotating residential proxies spread traffic across many IPs so no single IP exceeds the target's per-IP rate limit. Targets can still detect aggregate scraping via fingerprint correlation, ASN clustering, and behavioral patterns – proxies are necessary, not sufficient.

How many concurrent requests should I use for async scraping?

Start at 10 to 20 and tune from there based on how the target responds. The ceiling is set by the target's tolerance, your proxy pool size, and per-request latency. Public targets without a strict rate limiter usually accept 20 to 50 from a single IP, and a rotating residential proxy pool lets you push much higher because each request appears to come from a different client. Watch for 429 responses and rising error rates as the signal that you've gone too high.

Why is my async scraper getting blocked even with proxies?

Three detection layers usually cause it beyond per-IP rate limiting: TLS fingerprinting, header fingerprinting, and behavioral signals. First, TLS fingerprinting: anti-bot vendors recognize the ClientHello that aiohttp/Requests/httpx send and block on JA3/JA4 hash – address with curl_cffi (works when the block is genuinely at the TLS layer; not a fix for JS-challenge or behavioral blocks). Second, header fingerprinting: missing Sec-Ch-Ua and Sec-Fetch-* headers signal automation – use a realistic Chrome header set. Third, behavioral signals: fixed-interval batches and identical request shapes look like bots – randomize jitter and vary request shapes (header order, occasional referer changes).

How do I handle 429 rate limits in aiohttp?

Catch the 429 status, honor the Retry-After response header if the server sets one, and fall back to a long backoff (60s/300s/900s) if not. Don't reuse the 1s/2s/4s exponential pattern that suits 5xx errors – 429s usually take minutes to clear, not seconds. A split-backoff retry function with separate BACKOFF_5XX and BACKOFF_429 schedules implements this cleanly.

PYTHON

DATA COLLECTION

HTTPX vs. Requests vs. AIOHTTP: How to Choose the Right Python HTTP Client

Requests, HTTPX, and AIOHTTP all make HTTP requests, but they differ in how they handle concurrency. Requests is synchronous and has been the default since 2011. HTTPX gives you both sync and async with HTTP/2 support. AIOHTTP is async-only and faster at high concurrency, but has a steeper learning curve. The right choice depends on your async model, whether you need WebSockets or HTTP/2, and how much code you're willing to rewrite. This article covers architecture, performance data, proxy setup, migration paths, and common mistakes in production scraping setups.

Justinas Tamasevicius

Last updated: Mar 03, 2026

12 min read

PYTHON

DATA COLLECTION

Concurrency vs. Parallelism: Key Differences and When To Use Each

A bootstrapped data operation found that their web scrapers crawled to a halt as they tried to scale from 100 to 10,000 URLs. This is a common challenge with sequential processing and exactly why understanding concurrency vs parallelism is key to building efficient, scalable systems. This guide explains both concepts, their key differences, and limitations, so you can quickly decide the best mechanism for your project.

Justinas Tamasevicius

Last updated: Mar 10, 2026

10 min read

PARSING

DATA COLLECTION

Beautiful Soup Web Scraping: How to Parse Scraped HTML with Python

Web scraping with Python is a powerful technique for extracting valuable data from the web, enabling automation, analysis, and integration across various domains. Using libraries like Beautiful Soup and Requests, developers can efficiently parse HTML and XML documents, transforming unstructured web data into structured formats for further use. This guide explores essential tools and techniques to navigate the vast web and extract meaningful insights effortlessly.

Zilvinas Tamulis

Last updated: Mar 25, 2025

14 min read

Asynchronous Web Scraping in Python: Build Faster Scrapers With asyncio and aiohttp

TL;DR

What is asyncio and how does it work?

Synchronous vs. asynchronous web scraping: performance compared

Setting up your async web scraping project

Is uvloop still faster than asyncio?

Sending async HTTP requests with aiohttp

Parsing HTML and extracting data with Beautiful Soup

selectolax vs. Beautiful Soup: when parsing becomes the bottleneck

Controlling concurrency with asyncio.Semaphore

Integrating proxies into async scraping with aiohttp

Async error handling, retries, and resilience

aiohttp client middleware: retry and logging examples

How to debug a stuck asyncio scraper (python -m asyncio ps in 3.14)

Exporting scraped data to JSON and CSV

Common asyncio and aiohttp scraping mistakes (and how to fix them)

Scaling beyond a fixed URL list: asyncio.Queue and worker pools

Streaming large response bodies with iter_chunked

When aiohttp gets blocked: TLS fingerprint impersonation with curl_cffi

Scrapling: a higher-level framework

When to go beyond HTTP: headless browsers and scraping APIs

Headless browsers: Playwright and Patchright

Managed scraping APIs

Bottom line

Frequently asked questions

Is aiohttp faster than requests for web scraping?

What's the difference between asyncio and threading for web scraping?

Should I use asyncio.gather or TaskGroup for async scraping?

Can I use asyncio with Scrapy or Playwright instead of aiohttp?

Do I need proxies for async web scraping?

How many concurrent requests should I use for async scraping?

Why is my async scraper getting blocked even with proxies?

How do I handle 429 rate limits in aiohttp?

Related articles