Back to blog

Mastering Scrapy for Scalable Python Web Scraping: A Practical Guide

Scrapy is a powerful web scraping framework available in Python. Its asynchronous architecture makes it faster than sequential scrapers built with Requests or Beautiful Soup, and it includes everything needed for production-ready scraping: spiders, items, pipelines, throttling, retries, data export, and middleware. In this guide, you'll learn how to set up Scrapy, build and customize spiders, handle pagination, structure and store data, extend Scrapy with middlewares and proxies, and apply best practices for scraping at scale.

Installing Scrapy and setting up your first project

Prerequisites

Before installing Scrapy, make sure you have Python 3.7 or higher on your computer. You can check your current version by running the following command in the terminal:

python --version

If you need to install or upgrade Python, get the latest version from their official website. And if you’re new to running Python code from the terminal, our guide explains the basics.

Creating a virtual environment

Isolate your Scrapy project in a virtual environment to keep dependencies tidy and avoid conflicts with other Python projects:

python -m venv scrapy-env
# Activate it on macOS/Linux:
source scrapy-env/bin/activate
# Activate it on Windows:
scrapy-env\Scripts\activate

Once activated, your terminal prompt will change to show the environment name. From here, any package you install stays contained inside it.

Installing Scrapy

With the environment active, install Scrapy via pip:

pip install scrapy

To confirm the installation worked, check with:

scrapy version

Creating your first project

Navigate to the folder where you want your project to live, then run:

scrapy startproject bookstore
cd bookstore

This generates the following structure:

bookstore/
├── scrapy.cfg
└── bookstore/
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders/
        └── __init__.py

Here's what each file does:

  • spiders/. Where your spider classes live. Each spider defines what to scrape and how.
  • items.py. Defines structured data containers for your scraped fields.
  • pipelines.py. Processes items after they're scraped – validation, cleaning, storage.
  • middlewares.py. Hooks into the request/response cycle for custom behavior. Useful for rotating user agents, handling retries, or adding proxy logic.
  • settings.py. Controls everything from concurrency to user agents to export formats.
  • scrapy.cfg. A deployment configuration file. You'll rarely need to touch this during development.

Using Scrapy Shell for interactive data extraction

Before writing a full spider, Scrapy Shell lets you test selectors interactively against a live page. This saves a lot of trial and error.

Launching the shell

If you have IPython installed (pip install ipython), Scrapy will use it automatically, providing syntax highlighting and tab completion in the interactive shell.

To launch the shell, run the following command. Scrapy will fetch the page and drop you into an interactive Python session with the response already loaded:

scrapy shell "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

Exploring the response object

Once the shell loads, you have access to a response object:

response.status        # 200
response.url           # The URL you fetched
response.headers       # Response headers

To open the page visually in your browser, you can enter:

view(response)

Testing XPath and CSS selectors

You can test CSS selectors directly in the shell to extract specific elements from the page. Here's how to extract product data from a books.toscrape.com product page:

# Product name
response.css("h1::text").get()
# → 'A Light in the Attic'
# Price
response.css("p.price_color::text").get()
# → '£51.77'
# Availability (raw)
response.css("p.availability::text").getall()
# → ['\n    ', '\n    \n        In stock (22 available)\n    \n']
# Availability (cleaned)
" ".join(response.css("p.availability::text").getall()).strip()
# → 'In stock (22 available)'
# Rating (stored as a word in the class attribute)
response.css("p.star-rating::attr(class)").get()
# → 'star-rating Three'

You can run the same extractions using XPath selectors. These queries target some of the same elements as the CSS examples above, but use XPath syntax instead:

response.xpath("//h1/text()").get()
response.xpath("//p[@class='price_color']/text()").get()

Pro tips for working in the shell

  • Test selectors in the shell before adding them to spider code. It’s much faster to iterate and debug here.
  • Use your browser's DevTools (right-click → Inspect) to identify element paths before switching to the shell.
  • Expect variations in page structure. Use .get() (returns None on failure) instead of .getall()[0] to avoid errors when elements are missing.
  • Exit the shell with Ctrl+D, or by typing exit() or quit() to return to your terminal.

For a deeper look at how XPath and CSS selectors compare, check out our guide on choosing the right selector for web scraping.

Creating and customizing Scrapy spiders

Spider basics

A spider is a Python class that tells Scrapy what to crawl and how to extract data from responses.

Each spider is defined in its own Python file inside the project’s spiders/ directory (for example, bookstore/spiders/book_spider.py).

The snippets in this section are illustrative. They show different ways to structure a spider as you add features. In a real project, you would typically create a single spider file inside the spiders/ directory and extend it progressively, rather than creating a new file for every example shown here.

Every spider follows the same core anatomy:

import scrapy
class BookSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/catalogue/category/books_1/index.html"]
    def parse(self, response):
        for book in response.css('article.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('p.price_color::text').get(),
                'rating': book.css('p.star-rating::attr(class)').get().split()[-1],
                'availability': book.css('p.availability::text').getall()[1].strip(),
            }

Breaking down the key parts:

  • name. A unique identifier for the spider. This is what you use to run it (scrapy crawl books). No two spiders in the same project can share a name.
  • allowed_domains. Scrapy won't follow links outside these domains.
  • start_urls. The URLs Scrapy fetches first. Each one triggers a request that gets passed to the parse method.
  • parse method. The default callback that handles responses. It receives a Response object and can yield items (extracted data) or new Request objects to follow.

The spider can either yield items (data) or yield new scrapy.Request objects to follow links. You can mix both in the same parse method. This distinction (between scraping (extracting data) and crawling (following links to discover pages)) is worth understanding clearly if you're new to the concepts; check out our overview for a breakdown.

Spider types

Scrapy ships with several spider classes beyond the base one:

  • scrapy.Spider. The default. You control all request logic manually.
  • CrawlSpider. Uses Rule objects with link extractors to follow links automatically. Good for crawling an entire site.
  • SitemapSpider. Reads an XML sitemap to discover URLs. Efficient when the site provides one.
  • CSVFeedSpiderXMLFeedSpider. Parse structured feeds rather than HTML. Useful for data imports.

Customizing request behavior

You can customize request headers either per spider or globally. Per-spider overrides are useful when a specific crawler needs different headers than the rest of the project. For a global default, set USER_AGENT in settings.py. The example below shows how to define custom headers by overriding start_requests.

def start_requests(self):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
    }
    for url in self.start_urls:
        yield scrapy.Request(url, headers=headers, callback=self.parse)

Passing data between callbacks with request.meta

When scraping detail pages, you often need to carry data from a listing page into the detail page's callback. Scrapy’s meta dictionary makes this straightforward:

def parse(self, response):
    for book in response.css('article.product_pod'):
        detail_url = book.css('h3 a::attr(href)').get()
        yield response.follow(
            detail_url,
            callback=self.parse_book,
            meta={'rating': book.css('p.star-rating::attr(class)').get().split()[-1]}
        )
def parse_book(self, response):
    yield {
        'title': response.css('h1::text').get(),
        'price': response.css('p.price_color::text').get(),
        'description': response.css('#product_description ~ p::text').get(),
        'rating': response.request.meta['rating'],
    }

Error handling with errback

Network errors and 4xx/5xx responses don't automatically stop a crawl, but you can handle them cleanly using errback:

import logging
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError, TimeoutError
def parse(self, response):
    yield scrapy.Request(
        url,
        callback=self.parse_book,
        errback=self.handle_error
    )
def handle_error(self, failure):
    if failure.check(HttpError):
        response = failure.value.response
        logging.error(f"HTTP error {response.status} on {response.url}")
    elif failure.check(DNSLookupError):
        logging.error(f"DNS lookup failed: {failure.request.url}")
    elif failure.check(TimeoutError):
        logging.error(f"Request timed out: {failure.request.url}")

Putting it together: a complete example spider

Here's a spider for crawling books.toscrape.com that combines the three core building blocks you'll use in most real Scrapy projects:

  • Parse a listing page and extract links to detail pages.
  • Follow each detail page link and scrape additional fields there.
  • Pass data from the listing page to the detail page callback using meta.

To use it, create a new file in your project’s spiders/ directory (for example, bookstore/spiders/book_details.py) and paste the code below into it. Scrapy automatically discovers spiders placed in this folder, as long as the class inherits from scrapy.Spider and has a unique name.

import scrapy
import logging
from scrapy.spidermiddlewares.httperror import HttpError
class BookDetailSpider(scrapy.Spider):
    name = "book_details"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/catalogue/category/books_1/index.html"]
    def parse(self, response):
        for book in response.css('article.product_pod'):
            detail_url = book.css('h3 a::attr(href)').get()
            rating = book.css('p.star-rating::attr(class)').get().split()[-1]
            yield response.follow(
                detail_url,
                callback=self.parse_book,
                errback=self.handle_error,
                meta={'rating': rating}
            )
    def parse_book(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('p.price_color::text').get(),
            'availability': response.css('p.availability::text').getall()[1].strip(),
            'description': response.css('#product_description ~ p::text').get(),
            'rating': response.request.meta['rating'],
            'upc': response.css('table tr:first-child td::text').get(),
        }
    def handle_error(self, failure):
        if failure.check(HttpError):
            logging.error(f"HTTP {failure.value.response.status}: {failure.request.url}")
        else:
            logging.error(repr(failure))

Run the spider from the project root (the folder with scrapy.cfg) using the value defined in the spider’s name attribute. The filename doesn’t matter as long as the spider is placed in the spiders/ directory:

scrapy crawl book_details

Scraping multiple pages and handling pagination

Real-world scraping rarely stops at a single page. Most sites spread their content across multiple pages, and Scrapy gives you several ways to navigate them.

Pagination patterns you'll encounter

  • "Next" button pagination. A "Next" link appears at the bottom of each page. You follow it until it disappears.
  • Numbered page links. The site shows page numbers (1, 2, 3 …) as individual links. You can follow them or generate the URLs directly.
  • Infinite scroll. The page loads more content as the user scrolls down. This is driven by JavaScript and XHR requests, so standard Scrapy can't handle it without additional tooling (Splash or Scrapy-Playwright). You'd need to identify and hit the underlying API endpoint instead.
  • Load more buttons. Similar to infinite scroll – clicking a button fires an XHR request. Inspect the network tab to find the API call and replicate it directly.

This is the most common pagination pattern. Check if a next-page link exists and follow it if present. Scrapy’s response.follow() automatically resolves relative URLs, so you don’t need to manually construct absolute URLs:

def parse(self, response):
    for book in response.css('article.product_pod'):
        yield {
            'title': book.css('h3 a::attr(title)').get(),
            'price': book.css('p.price_color::text').get(),
        }
    next_page = response.css('li.next a::attr(href)').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)

Building page URLs programmatically

When the URL pattern is predictable (for example, ?page=1?page=2), you can generate page URLs upfront instead of following links dynamically. This approach works well when you know the total number of pages in advance:

def start_requests(self):
    base_url = "https://books.toscrape.com/catalogue/page-{}.html"
    for page in range(1, 51):  # Pages 1-50
        yield scrapy.Request(base_url.format(page), callback=self.parse)

Using CrawlSpider rules

CrawlSpider lets you define link-following behavior declaratively using rules, instead of writing pagination logic by hand. It’s well-suited for crawling entire site sections where pagination and detail links follow consistent patterns. Rules are evaluated in order: pagination links are followed first, and item pages are then routed to a parsing callback:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BookCrawlSpider(CrawlSpider):
    name = "book_crawl"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]
    rules = (
        # Follow pagination links
        Rule(LinkExtractor(restrict_css='li.next a')),
        # Parse each book's detail page
        Rule(LinkExtractor(restrict_css='article.product_pod h3 a'), callback='parse_book'),
    )
    def parse_book(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('p.price_color::text').get(),
            'availability': response.css('p.availability::text').getall()[1].strip(),
            'description': response.css('#product_description ~ p::text').get(),
        }

Using SitemapSpider

If the target site has an XML sitemap, SitemapSpider is the cleanest approach. It reads the sitemap, filters URLs by pattern, and calls the appropriate callback (no pagination logic needed – the sitemap handles URL discovery entirely):

from scrapy.spiders import SitemapSpider
class BookSitemapSpider(SitemapSpider):
    name = "book_sitemap"
    sitemap_urls = ["https://books.toscrape.com/sitemap.xml"]
    sitemap_rules = [
        ('/catalogue/', 'parse_book'),
    ]
    def parse_book(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('p.price_color::text').get(),
        }

Saving and processing scraped data

Extracting the data from the page is only half the job. Scrapy's Items, Item Loaders, and Pipelines give you a structured way to clean, validate, and store it.

Scrapy Items

An Item is a schema for your scraped data. Items catch typos in field names early (a raw dict would silently accept any key), make it easier to pass consistent data through pipelines, and improve readability across a larger project. Rather than yielding raw dictionaries from your spider, you yield Item objects that enforce structure.

Define your item schema inside the items.py file located in your project’s root module directory:

import scrapy
class BookItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    availability = scrapy.Field()
    description = scrapy.Field()
    rating = scrapy.Field()
    upc = scrapy.Field()

Item Loaders

Item Loaders handle the messy work of populating Items (stripping whitespace, cleaning strings, and dealing with missing fields), so your spider code stays clean. Use Item Loaders inside your spider file in the spiders/ directory:

from scrapy.loader import ItemLoader
from bookstore.items import BookItem
def parse_book(self, response):
    loader = ItemLoader(item=BookItem(), response=response)
    loader.add_css('title', 'h1::text')
    loader.add_css('price', 'p.price_color::text')
    loader.add_css('availability', 'p.availability::text')
    loader.add_css('description', '#product_description ~ p::text')
    return loader.load_item()

By default, each field collects a list of values. Input processors transform values as they're added; output processors transform the final list when load_item() is called.

Scrapy's built-in processors cover most common needs:

  • TakeFirst. Returns the first non-null value from the list. Good for most single-value fields.
  • MapCompose. Applies a chain of functions to each value before storing it. Perfect for stripping whitespace or reformatting strings.
  • Join. Joins a list of strings into one. Useful for multi-line descriptions.

Define processors inside your project’s items.py file alongside your Item class:

import scrapy
from itemloaders.processors import TakeFirst, MapCompose, Join
import re
def clean_price(value):
    return re.sub(r'[^\d.]', '', value)
def normalize_availability(value):
    return value.strip().lower()
class BookItem(scrapy.Item):
    title = scrapy.Field(
        input_processor=MapCompose(str.strip),
        output_processor=TakeFirst()
    )
    price = scrapy.Field(
        input_processor=MapCompose(str.strip, clean_price),
        output_processor=TakeFirst()
    )
    availability = scrapy.Field(
        input_processor=MapCompose(normalize_availability),
        output_processor=TakeFirst()
    )
    description = scrapy.Field(
        input_processor=MapCompose(str.strip),
        output_processor=Join(' ')
    )
    rating = scrapy.Field(
        output_processor=TakeFirst()
    )

Pipelines

Pipelines receive each item after the spider yields it. Chain multiple pipelines with specific responsibilities and control execution order via ITEM_PIPELINES in settings.py.

Validation pipeline drops items that are missing critical fields. Define pipelines inside your project’s pipelines.py file:

from itemadapter import ItemAdapter
import scrapy
class ValidationPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        required = ['title', 'price']
        for field in required:
            if not adapter.get(field):
                raise scrapy.exceptions.DropItem(f"Missing {field} in {item}")
        return item

Cleaning pipeline normalizes data after extraction:

class CleaningPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter.get('price'):
            adapter['price'] = float(adapter['price'])
        if adapter.get('availability'):
            adapter['availability'] = 'in_stock' if 'in stock' in adapter['availability'] else 'out_of_stock'
        return item

Database pipeline saves items to SQLite:

import sqlite3
class SQLitePipeline:
    def open_spider(self, spider):
        self.conn = sqlite3.connect('books.db')
        self.cursor = self.conn.cursor()
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS books (
                title TEXT,
                price REAL,
                availability TEXT,
                description TEXT,
                rating TEXT
            )
        ''')
    def close_spider(self, spider):
        self.conn.commit()
        self.conn.close()
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        self.cursor.execute(
            'INSERT INTO books VALUES (?, ?, ?, ?, ?)',
            (adapter.get('title'), adapter.get('price'),
            adapter.get('availability'), adapter.get('description'),
            adapter.get('rating'))
        )
        return item

Enable and order your pipelines in settings.py. Lower numbers run first. Keep validation at the top so cleaning and storage don't run on invalid items.:

ITEM_PIPELINES = {
    'bookstore.pipelines.ValidationPipeline': 100,
    'bookstore.pipelines.CleaningPipeline': 200,
    'bookstore.pipelines.SQLitePipeline': 300,
}

Export formats

For quick exports without a custom pipeline, you can use Scrapy’s FEEDS setting in settings.py. The example below shows how to export the same crawl output into multiple formats at once:

FEEDS = {
    'output/books.json': {'format': 'json', 'encoding': 'utf8', 'indent': 2},
    'output/books.jl': {'format': 'jsonlines'},
    'output/books.csv': {'format': 'csv'},
    'output/books.xml': {'format': 'xml'},
}

If you plan to process large datasets or stream results incrementally, JSON Lines (.jl) is often the most practical format, since each line is a standalone JSON object.

To export directly to cloud storage, set the feed URI to a remote destination. Exporting to S3 using a s3:// URI requires boto3 and configured AWS credentials. Scrapy also supports Google Cloud Storage (gs://) and FTP destinations using the same mechanism. The example below writes JSON Lines output to an S3 bucket:

FEEDS = {
    's3://your-bucket/books.jl': {
        'format': 'jsonlines',
        'encoding': 'utf8',
    }
}

Alternatively, if you only need a one-off export and don’t want to modify settings.py, you can specify the output file when running the spider.

scrapy crawl book_details -o output/books.csv

Extending Scrapy with middlewares and custom settings

What downloader middlewares do

Downloader middlewares sit between Scrapy's Engine and the Downloader, intercepting every request before it goes out and every response before it reaches your spider. They're your main tool for controlling how requests are made and responses are handled.

When downloader middlewares run

A downloader middleware can hook into three points:

  • process_request(request, spider) runs before each request is sent. You can modify headers, change the request URL, or even return a fake response to bypass the actual download.
  • process_response(request, response, spider) runs after a response arrives. You can validate it, modify it, or return a different response entirely.
  • process_exception(request, exception, spider) handles errors during the download. You can retry failed requests or log them for later inspection.

Middleware common use cases

  • Proxy rotation. When scraping at scale, rotating proxies prevents IP bans. A middleware can assign a different proxy to each request from a pool, handling failures and retries automatically.
  • User agent rotation. Rotating user agents makes your traffic look more organic, reducing the chance of detection. You'd maintain a list of real browser user agent strings and cycle through them per request.
  • Custom retry logic. With backoff delays and maximum attempt counts, you can retry specific errors like network timeouts, rate limits, or transient server issues.

Built-in middlewares and their default priorities

Scrapy ships with several middlewares active by default. They run in priority order (lower numbers run first for requests, higher numbers run first for responses). Here are some key ones:

  • HttpProxyMiddleware (750) handles proxy settings from request meta or settings
  • UserAgentMiddleware (500) sets the User-Agent header
  • RetryMiddleware (550) retries failed requests
  • RedirectMiddleware (600) follows HTTP redirects
  • CookiesMiddleware (700) manages cookies

You can see the full list and their priorities in Scrapy's documentation.

Writing a custom middleware

Custom downloader middlewares let you intercept requests before they are sent and react to failures when something goes wrong. A common use case is proxy rotation, where each request is routed through a different proxy to reduce blocks and rate limits.

The example below shows a simple proxy rotation middleware. It does three things:

  • Loads a list of proxies from project settings when Scrapy starts.
  • Assigns a random proxy to each outgoing request.
  • Retries failed requests with a different proxy.

Save this code in your project’s middlewares.py file:

import random
from scrapy import signals
from scrapy.exceptions import NotConfigured
class ProxyRotationMiddleware:
    def __init__(self, proxy_list):
        self.proxy_list = proxy_list
    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist('PROXY_LIST')
        if not proxy_list:
            raise NotConfigured('PROXY_LIST setting is required')
        return cls(proxy_list)
    def process_request(self, request, spider):
        proxy = random.choice(self.proxy_list)
        request.meta['proxy'] = proxy
        spider.logger.info(f"Using proxy: {proxy}")
    def process_exception(self, request, exception, spider):
        # Retry with a different proxy on failure
        proxy = random.choice(self.proxy_list)
        request.meta['proxy'] = proxy
        spider.logger.warning(f"Request failed, retrying with: {proxy}")
        return request

To activate the middleware, you need to define a proxy list and register the middleware in settings.py. The snippet below shows the minimum configuration required to enable it:

PROXY_LIST = [
    'http://proxy1.example.com:7000',
    'http://proxy2.example.com:7000',
    'http://proxy3.example.com:7000',
]
DOWNLOADER_MIDDLEWARES = {
    'bookstore.middlewares.ProxyRotationMiddleware': 350,
}

For production scraping with anti-bot protection, you'll want residential proxies rather than datacenter ones. Decodo's residential proxies handle rotation, authentication, and geographic targeting automatically, which saves you from building all this logic yourself.

If you're new to working with proxies in Python, check out this guide to mastering Python requests with proxies for the foundational concepts.

Get residential proxies

Claim your 3-day free trial of residential proxies and explore full features with unrestricted access.

Spider middlewares

Spider middlewares are less commonly used than downloader middlewares, but they serve a specific purpose: processing the input and output of your spider's callbacks.

Difference between downloader and spider middlewares

Downloader middlewares work with raw HTTP requests and responses. Spider middlewares work with the items and requests that your spider yields. They run after the response reaches the spider but before items enter the pipeline.

Processing spider input/output

Spider middlewares operate on the data flowing into and out of your spider’s callbacks. They are useful when you need visibility or control over what your spider receives and what it yields.

Spider middlewares can:

  • Filter or modify responses before they reach the spider's parse method.
  • Process items before they go to pipelines.
  • Catch exceptions raised during parsing.

The example below shows a simple spider middleware that counts how many items your spider has yielded so far. This is useful for debugging, progress tracking, or sanity checks during long crawls. Save it in your project’s middlewares.py file:

class ItemCounterMiddleware:
    def __init__(self):
        self.item_count = 0
    def process_spider_output(self, response, result, spider):
        for item in result:
            if not isinstance(item, Request):
                self.item_count += 1
                spider.logger.info(f"Items scraped so far: {self.item_count}")
            yield item

To enable the middleware, register it in settings.py using the SPIDER_MIDDLEWARES setting. The priority value controls execution order, with lower numbers running earlier:

SPIDER_MIDDLEWARES = {
    'bookstore.middlewares.ItemCounterMiddleware': 543,
}

Essential settings

Scrapy's default settings work for small-scale scraping, but adjusting a few key lines in settings.py makes your crawler more respectful, efficient, and maintainable.

CONCURRENT_REQUESTS and DOWNLOAD_DELAY

CONCURRENT_REQUESTS controls how many requests Scrapy sends in parallel. The default is 16, which is fine for resilient sites. If you're scraping a small site or want to be polite, lower it:

CONCURRENT_REQUESTS = 8

DOWNLOAD_DELAY adds a delay (in seconds) between requests to the same domain. This is the primary "politeness" setting:

DOWNLOAD_DELAY = 2  # Wait 2 seconds between requests

You can also set delays per domain:

DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 4

USER_AGENT configuration

Set a descriptive user agent so site owners can identify your crawler:

USER_AGENT = 'MyBookScraper (+https://example.com/about)'

You can also rotate through a list of real browser user agents using custom middleware.

ROBOTSTXT_OBEY

This setting tells Scrapy to follow rules defined in a site's robots.txt file. It is enabled by default:

ROBOTSTXT_OBEY = True

If a site explicitly disallows crawlers in robots.txt, Scrapy will refuse to scrape it when this is enabled. For legitimate data collection where you have permission or the site is public, you might set it to False, but generally you should leave it on.

LOG_LEVEL and logging configuration

Control how verbose Scrapy's output is:

LOG_LEVEL = 'INFO'  # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL

INFO is a good middle ground for development. Use DEBUG when troubleshooting selectors or middlewares, and WARNING for production to keep logs clean.

You can also log to a file:

LOG_FILE = 'scrapy_output.log'

HTTPCACHE for development efficiency

The HTTP cache saves responses to disk so you don't re-fetch pages during development. This is a huge time-saver when you're tweaking selectors or pipelines:

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400  # 24 hours
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [500, 502, 503, 504]

With caching on, the first run downloads all pages. Subsequent runs use the cached copies until they expire. Just remember to disable it (or clear the cache) when you're ready for production scraping.

Integrating Selenium/Playwright

Standard Scrapy works by downloading raw HTML and parsing it. This is fast and efficient, but it breaks down on JavaScript-heavy sites where content loads dynamically after the initial page render. For those cases, you need a real browser.

When you need browser rendering

Use Selenium or Playwright when:

  • Content loads via JavaScript after the page renders (infinite scroll, lazy-loaded images, dynamic tables)
  • User interactions trigger data to appear (clicking "Load More," expanding sections, filling forms)
  • The site heavily relies on client-side rendering frameworks like React or Vue
  • You need to bypass bot detection systems that check for browser fingerprints

If the data you need is in the initial HTML source, stick with standard Scrapy. Rendering browsers is 10–50x slower.

scrapy-selenium

scrapy-selenium lets Scrapy render pages through a real browser using Selenium WebDriver. This is useful when the data you need appears only after JavaScript runs, for example infinite scroll, "Load more" buttons, or client-side rendered pages.

Install the integration with pip:

pip install scrapy-selenium

To connect Selenium to Scrapy, add the Selenium middleware and driver settings in settings.py. The snippet below shows a minimal Chrome setup using chromedriver:

from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = ['--headless', '--no-sandbox']
DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800,
}

Once configured, you can choose which requests should be rendered through the browser by using SeleniumRequest in your spider. The following code should be placed inside one of your spider files in the spiders/ directory, for example myproject/spiders/dynamic_spider.py:

from scrapy_selenium import SeleniumRequest
def start_requests(self):
    yield SeleniumRequest(
        url='https://example.com/dynamic-page',
        callback=self.parse,
        wait_time=3  # Wait 3 seconds for JavaScript to load
    )
def parse(self, response):
    # response.selector works as usual, but content is fully rendered
    for item in response.css('div.dynamic-item'):
        yield {'title': item.css('h2::text').get()}

For a deeper dive into Selenium-based scraping, check out this complete guide to web scraping with Selenium and Python.

scrapy-playwright

scrapy-playwright uses Playwright instead of Selenium to render JavaScript-heavy pages. It tends to be faster and more reliable on modern websites, and it integrates well with Scrapy’s async architecture.

Install the integration and download Playwright’s browser binaries:

pip install scrapy-playwright
playwright install

To enable Playwright in Scrapy, configure the download handlers and reactor in settings.py. The snippet below shows the core settings required for Playwright-powered requests.

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

In your spider, you enable browser rendering per request using meta. The example below waits for a selector to appear, performs a click, then extracts updated content using Scrapy selectors:

from scrapy import Spider
class DynamicSpider(Spider):
    name = "dynamic"
    def start_requests(self):
        yield scrapy.Request(
            url='https://example.com/js-heavy',
            callback=self.parse,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_page_methods=[
                    ('wait_for_selector', 'div.content'),
                ]
            )
        )
    async def parse(self, response):
        page = response.meta['playwright_page']
        # Perform browser actions
        await page.click('button.load-more')
        await page.wait_for_timeout(2000)
        
        # Extract data from the updated page
        content = await page.content()
        # Parse content with Scrapy selectors
        from scrapy.selector import Selector
        selector = Selector(text=content)
        
        for item in selector.css('div.item'):
            yield {'title': item.css('h2::text').get()}
        
        await page.close()

Performance considerations

Browser rendering is resource-intensive. A typical Scrapy setup can make hundreds of requests per minute. With Selenium or Playwright, you're limited to maybe 5-10 concurrent browser instances before CPU and memory become bottlenecks.

To minimize performance impact:

  • Only use browser rendering when absolutely necessary. If an API endpoint exists that returns the same data (check your browser's Network tab), hit that directly instead.
  • Cache rendered pages during development using HTTPCACHE_ENABLED.
  • Run browsers in headless mode to reduce overhead.
  • Scale horizontally by running multiple Scrapy instances on different machines rather than trying to run 50 browser tabs on one server.

For sites with serious anti-bot protection, you might need both browser rendering and residential proxies. Combining scrapy-playwright with a service like Decodo gives you the fingerprint of a real browser backed by residential IPs, which handles most modern bot detection.

Common limitations for Scrapy web scraping

JavaScript rendering not built-in

Scrapy downloads and parses raw HTML. If a site relies on JavaScript to populate content after the initial page load, Scrapy won't see it. You'll need to integrate Selenium or Playwright, which adds complexity and slows down scraping significantly. For simple projects where you just need a few data points from a JS-heavy site, Beautiful Soup combined with Selenium might be a simpler starting point.

Memory usage at extreme scale

Scrapy keeps request queues, response objects, and some state in memory. At tens of thousands of concurrent requests, memory consumption can grow quickly. For most projects this isn't an issue, but if you're crawling millions of URLs in a single session, you'll need to monitor memory usage and potentially tune garbage collection or split the work across multiple runs.

Single-machine limitations

Scrapy runs on one machine by default. Even with optimal settings, a single instance can only push so many requests per second before hitting CPU, memory, or network limits. If you need to scrape millions of pages daily, you'll eventually need distributed scraping with tools like scrapy-redis or a managed scraping service.

Learning curve for complex projects

Scrapy's architecture is powerful, but it takes time to internalize. Middlewares, pipelines, item loaders, and settings all interact in ways that aren't obvious at first. For a quick one-off scraping task, Beautiful Soup with Requests might get you there faster. Scrapy pays off when you're building something that needs to scale or run repeatedly.

Rate limiting and IP blocks

Most sites track request frequency by IP address. Send too many requests too fast, and you'll get temporarily or permanently blocked. The standard solution is rotating residential proxies. Datacenter proxies work for some sites, but many modern anti-bot systems flag entire datacenter IP ranges by default.

Decodo's residential proxies rotate IPs automatically and route traffic through real residential connections, which makes your requests indistinguishable from regular users.

CAPTCHAs and how to handle them

If you hit a CAPTCHA, you've been detected. There are CAPTCHA-solving services that use human labor or machine learning to solve them, but they're slow and expensive. The better approach is avoiding CAPTCHAs in the first place by:

  • Rotating proxies aggressively
  • Keeping your request rate low and randomized
  • Using realistic browser headers and fingerprints
  • Solving the CAPTCHA manually once and reusing the session cookies (works for some sites)

For sites with aggressive CAPTCHA protection, a managed scraping API that handles this for you is often more cost-effective than building your own solution.

Fingerprinting and detection

Modern anti-bot systems don't just look at IP addresses. They fingerprint your browser based on HTTP headers, TLS configuration, canvas rendering, WebGL, fonts installed, screen resolution, and dozens of other signals. A headless browser running with default settings has a distinct fingerprint that's easy to detect.

If you're using Playwright or Selenium, tools like playwright-stealth or undetected-chromedriver can help mask the automation signals. But even then, combining browser automation with residential proxies is usually necessary for well-protected sites.

Why residential proxies matter

Residential proxies route your traffic through real home internet connections, making each request appear to come from a different user in a different location. This bypasses IP-based rate limits and avoids the automatic blocking that datacenter IPs often face. For large-scale scraping or scraping sites with anti-bot protection, residential proxies aren't optional – they're the baseline requirement.

icon_tag

Try residential proxies

Forget about CAPTCHAs, IP blocks, and other obstacles with a global pool of 115M+ residential IPs.

Best practices for Scrapy web scraping

Respecting robots.txt and site rules

Keep ROBOTSTXT_OBEY = True in your settings. If a site explicitly disallows crawling, honor it. Even if you have a legitimate use case, ignoring robots.txt puts you on shaky ground and can lead to IP bans or even legal issues in some jurisdictions.

Read the site's Terms of Service too. Some sites explicitly prohibit automated access. Others allow it with restrictions (rate limits, attribution, non-commercial use only). Knowing the rules helps you stay on the right side of them.

Implementing polite delays

Set DOWNLOAD_DELAY to at least 1–2 seconds for small sites, more for larger ones. Scrapy is fast enough that even with a 2-second delay, you can still scrape thousands of pages per hour. The delay prevents you from accidentally overwhelming a server, especially if it's a smaller site without enterprise infrastructure.

You can also randomize delays to make traffic patterns look more human:

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True  # Adds ±50% random variation

Using a descriptive user agent

Use a descriptive user agent that includes contact information:

USER_AGENT = 'MyCrawler/1.0 (+https://example.com/crawler-info; [email protected])'

This makes it easy for site owners to reach you if there's an issue. Many sites are fine with scraping as long as you're not causing problems, and a clear user agent shows you're operating in good faith.

Caching during development

Enable HTTPCACHE_ENABLED = True while building and testing your spider. This saves every response to disk so subsequent runs don't hit the live site. It's faster, avoids unnecessary server load, and prevents you from getting blocked while debugging.

Just remember to disable it or clear the cache before running production scrapes, otherwise you'll be working with stale data.

Monitoring and logging

Set LOG_LEVEL = 'INFO' in production and save logs to a file with LOG_FILE = 'scraper.log'. Monitor for patterns like:

  • High error rates (connection timeouts, HTTP 429/503 responses)
  • Decreasing success rates over time (might indicate a block)
  • Unexpected item counts (could mean selectors broke due to a site redesign)

Scrapy's stats collection tracks these automatically and prints a summary at the end of each run. Review it regularly.

Handling failures gracefully with retries

Network hiccups, temporary server issues, and transient errors are normal when scraping at scale. Scrapy's built-in retry middleware handles most of this, but you can tune it:

RETRY_TIMES = 3  # Retry failed requests up to 3 times
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

For retry strategy fundamentals and how to implement exponential backoff, see this guide to Python requests retry logic.

You can also implement custom retry logic in a downloader middleware if you need more control (e.g., backing off longer after a 429 rate limit response).

Scaling considerations

Scrapy runs on a single machine by default, which is enough for most projects. When you need to crawl large sites or distribute work across multiple machines, you can move request scheduling out of memory and into a shared backend.

scrapy-redis enables distributed crawling by replacing Scrapy’s in-memory request queue with a Redis-backed queue. Multiple Scrapy instances can then pull URLs from the same queue and push results to the same destination, sharing the workload automatically.

Install scrapy-redis into your virtual environment:

pip install scrapy-redis

To enable Redis-based scheduling, add the following to your project settings file (meaning the one inside your project package, for example bookstore/bookstore/settings.py). The configuration below switches Scrapy to the Redis scheduler, enables request deduplication across all workers, and allows crawls to be paused and resumed:

# Use scrapy-redis scheduler
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Keep Redis queues between runs (pause / resume)
SCHEDULER_PERSIST = True
# Redis connection
REDIS_URL = 'redis://localhost:6379'

Next, create a new spider file inside your project’s spiders/ folder, for example bookstore/bookstore/spiders/distributed_books.py. This spider should inherit from RedisSpider. Rather than defining start_urls, it reads starting URLs from a Redis key:

from scrapy_redis.spiders import RedisSpider
class DistributedBookSpider(RedisSpider):
    name = "distributed_books"
    redis_key = "book_urls"  # Push start URLs to this Redis key
    
    def parse(self, response):
        # Same parsing logic as before
        pass

To start the crawl, push one or more URLs into the Redis list named in redis_key. Any running worker connected to the same Redis instance can pick those URLs up:

redis-cli lpush book_urls "https://books.toscrape.com/catalogue/page-1.html"

Once URLs are in the queue, start one or more workers from the project root, meaning the folder containing scrapy.cfg. Each worker runs the same spider name and pulls work from Redis:

scrapy crawl distributed_books

When to consider a scraping API instead

Building and maintaining a scraping infrastructure takes time. You need proxies, retry logic, CAPTCHA handling, monitoring, and scaling infrastructure. For many projects, a managed scraping API handles all of this for you.

Decodo's Web Scraping API provides:

  • Automatic proxy rotation with 115M+ residential IPs
  • JavaScript rendering when needed
  • Built-in retry and error handling
  • No infrastructure to manage

If you're scraping as a core part of your business and need full control, build with Scrapy. If scraping is a means to an end and you want to focus on using the data rather than collecting it, an API makes more sense.

Try Web Scraping API for free

Activate your 7-day free trial with 1K requests and scrape structured public data at scale.

Final thoughts

Scrapy is one of the most powerful web scraping frameworks available. It scales from small personal projects to large data collection workflows, and its architecture makes it easy to add custom behavior where it matters.

The learning curve is steeper than simpler tools like Beautiful Soup, but it pays off once you’re working with multiple pages, structured data, or non-trivial request logic.

Start simple. Build a basic spider, test selectors in the shell, and use Items and Pipelines to keep your data clean and predictable. From there, add middlewares, tuning, and browser rendering only when the target site requires it.

About the author

Dominykas Niaura

Technical Copywriter

Dominykas brings a unique blend of philosophical insight and technical expertise to his writing. Starting his career as a film critic and music industry copywriter, he's now an expert in making complex proxy and web scraping concepts accessible to everyone.


Connect with Dominykas via LinkedIn

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Is Scrapy better than Beautiful Soup?

Scrapy and Beautiful Soup solve different problems. Beautiful Soup is a parsing library that extracts data from HTML you've already downloaded, while Scrapy is a complete framework handling downloading, parsing, link following, concurrency, and data storage.

For small one-off tasks (single page or handful of URLs), Beautiful Soup with Requests is simpler and faster to set up. For crawling multiple pages, handling pagination, or running repeatedly on a schedule, Scrapy will save you time in the long run.

Can I use Scrapy with Selenium?

Yes, use the scrapy-selenium library to integrate Selenium WebDriver into Scrapy's download pipeline, letting you render JavaScript-heavy pages with a real browser while still using Scrapy's infrastructure. The main tradeoff is performance: browser rendering is 10-50x slower than standard HTTP requests. Only use it for pages where JavaScript rendering is absolutely necessary.

To learn more, see our guide on scraping with Selenium and Python.

Is Scrapy faster than other Python scraping tools?

Scrapy is generally the fastest option for crawling multiple pages because it's built on Twisted, an asynchronous networking framework that can make hundreds of concurrent requests without blocking.

For a single page, the speed difference is negligible, but for crawling thousands of pages, Scrapy is significantly faster than synchronous alternatives. The exception is browser-based scraping with Selenium or Playwright, where the browser becomes the bottleneck regardless of framework.

Why am I getting 403/429 errors when using Scrapy?

A 403 Forbidden usually means the site detected automation and blocked your request. Common causes:

  • Missing or suspicious user agent header
  • Requests coming from a datacenter IP that's flagged as a bot
  • Missing required headers (Referer, Accept-Language, etc.)
  • Too many requests from the same IP in a short time

A 429 Too Many Requests is an explicit rate limit. The site is telling you to slow down. Solutions:

  • Add a realistic user agent: USER_AGENT = 'Mozilla/5.0 ...'
  • Rotate residential proxies to distribute requests across many IPs
  • Increase DOWNLOAD_DELAY to 2-5 seconds
  • Respect Retry-After headers when present
  • Use browser rendering with Playwright/Selenium for heavily protected sites

For sites with serious anti-bot measures, combining Scrapy with residential proxies and browser rendering is usually necessary.

beautifulsoup-vs-scrapy

Scrapy vs BeautifulSoup – Which is Better for You?

Scrapy and BeautifulSoup are two extremely popular Python-based tools that will enable you to scrape the web. Ah, and they’re free and open-source! So if you’re thinking of building a scraper, you might be a bit lost between the two options. 


Don’t worry, we’ve got you covered. This blog post will compare these two tools by looking over their main fors and againsts. Ready? Let’s go!

🐍 Python Web Scraping: In-Depth Guide 2026

Welcome to 2026! What better way to celebrate than by mastering Python? If you’re new to web scraping, don’t worry – this guide starts from the basics, guiding you step-by-step on collecting data from websites. Whether you’re curious about automating simple tasks or diving into more significant projects, Python makes it easy and fun to start. Let’s slither into the world of web scraping and see how powerful this tool can be!

Scraping the Web with Selenium and Python: A Step-By-Step Tutorial

Modern websites rely heavily on JavaScript and anti-bot measures, making data extraction a challenge. Basic tools fail with dynamic content loaded after the initial page, but Selenium with Python can automate browsers to execute JavaScript and interact with pages like a user. In this tutorial, you'll learn to build scrapers that collect clean, structured data from even the most complex websites.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved