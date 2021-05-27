Mastering Scrapy for Scalable Python Web Scraping: A Practical Guide
Scrapy is a powerful web scraping framework available in Python. Its asynchronous architecture makes it faster than sequential scrapers built with Requests or Beautiful Soup, and it includes everything needed for production-ready scraping: spiders, items, pipelines, throttling, retries, data export, and middleware. In this guide, you'll learn how to set up Scrapy, build and customize spiders, handle pagination, structure and store data, extend Scrapy with middlewares and proxies, and apply best practices for scraping at scale.
Dominykas Niaura
Last updated: Mar 02, 2026
10 min read
Installing Scrapy and setting up your first project
Prerequisites
Before installing Scrapy, make sure you have Python 3.7 or higher on your computer. You can check your current version by running the following command in the terminal:
python --version
If you need to install or upgrade Python, get the latest version from their official website. And if you’re new to running Python code from the terminal, our guide explains the basics.
Creating a virtual environment
Isolate your Scrapy project in a virtual environment to keep dependencies tidy and avoid conflicts with other Python projects:
python -m venv scrapy-env# Activate it on macOS/Linux:source scrapy-env/bin/activate# Activate it on Windows:scrapy-env\Scripts\activate
Once activated, your terminal prompt will change to show the environment name. From here, any package you install stays contained inside it.
Installing Scrapy
With the environment active, install Scrapy via pip:
pip install scrapy
To confirm the installation worked, check with:
scrapy version
Creating your first project
Navigate to the folder where you want your project to live, then run:
scrapy startproject bookstorecd bookstore
This generates the following structure:
bookstore/├── scrapy.cfg└── bookstore/├── __init__.py├── items.py├── middlewares.py├── pipelines.py├── settings.py└── spiders/└── __init__.py
Here's what each file does:
- spiders/. Where your spider classes live. Each spider defines what to scrape and how.
- items.py. Defines structured data containers for your scraped fields.
- pipelines.py. Processes items after they're scraped – validation, cleaning, storage.
- middlewares.py. Hooks into the request/response cycle for custom behavior. Useful for rotating user agents, handling retries, or adding proxy logic.
- settings.py. Controls everything from concurrency to user agents to export formats.
- scrapy.cfg. A deployment configuration file. You'll rarely need to touch this during development.
Using Scrapy Shell for interactive data extraction
Before writing a full spider, Scrapy Shell lets you test selectors interactively against a live page. This saves a lot of trial and error.
Launching the shell
If you have IPython installed (pip install ipython), Scrapy will use it automatically, providing syntax highlighting and tab completion in the interactive shell.
To launch the shell, run the following command. Scrapy will fetch the page and drop you into an interactive Python session with the response already loaded:
scrapy shell "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
Exploring the response object
Once the shell loads, you have access to a response object:
response.status # 200response.url # The URL you fetchedresponse.headers # Response headers
To open the page visually in your browser, you can enter:
view(response)
Testing XPath and CSS selectors
You can test CSS selectors directly in the shell to extract specific elements from the page. Here's how to extract product data from a books.toscrape.com product page:
# Product nameresponse.css("h1::text").get()# → 'A Light in the Attic'# Priceresponse.css("p.price_color::text").get()# → '£51.77'# Availability (raw)response.css("p.availability::text").getall()# → ['\n ', '\n \n In stock (22 available)\n \n']# Availability (cleaned)" ".join(response.css("p.availability::text").getall()).strip()# → 'In stock (22 available)'# Rating (stored as a word in the class attribute)response.css("p.star-rating::attr(class)").get()# → 'star-rating Three'
You can run the same extractions using XPath selectors. These queries target some of the same elements as the CSS examples above, but use XPath syntax instead:
response.xpath("//h1/text()").get()response.xpath("//p[@class='price_color']/text()").get()
Pro tips for working in the shell
- Test selectors in the shell before adding them to spider code. It’s much faster to iterate and debug here.
- Use your browser's DevTools (right-click → Inspect) to identify element paths before switching to the shell.
- Expect variations in page structure. Use .get() (returns None on failure) instead of .getall()[0] to avoid errors when elements are missing.
- Exit the shell with Ctrl+D, or by typing exit() or quit() to return to your terminal.
For a deeper look at how XPath and CSS selectors compare, check out our guide on choosing the right selector for web scraping.
Creating and customizing Scrapy spiders
Spider basics
A spider is a Python class that tells Scrapy what to crawl and how to extract data from responses.
Each spider is defined in its own Python file inside the project’s spiders/ directory (for example, bookstore/spiders/book_spider.py).
The snippets in this section are illustrative. They show different ways to structure a spider as you add features. In a real project, you would typically create a single spider file inside the spiders/ directory and extend it progressively, rather than creating a new file for every example shown here.
Every spider follows the same core anatomy:
import scrapyclass BookSpider(scrapy.Spider):name = "books"allowed_domains = ["books.toscrape.com"]start_urls = ["https://books.toscrape.com/catalogue/category/books_1/index.html"]def parse(self, response):for book in response.css('article.product_pod'):yield {'title': book.css('h3 a::attr(title)').get(),'price': book.css('p.price_color::text').get(),'rating': book.css('p.star-rating::attr(class)').get().split()[-1],'availability': book.css('p.availability::text').getall()[1].strip(),}
Breaking down the key parts:
- name. A unique identifier for the spider. This is what you use to run it (scrapy crawl books). No two spiders in the same project can share a name.
- allowed_domains. Scrapy won't follow links outside these domains.
- start_urls. The URLs Scrapy fetches first. Each one triggers a request that gets passed to the parse method.
- parse method. The default callback that handles responses. It receives a Response object and can yield items (extracted data) or new Request objects to follow.
The spider can either yield items (data) or yield new scrapy.Request objects to follow links. You can mix both in the same parse method. This distinction (between scraping (extracting data) and crawling (following links to discover pages)) is worth understanding clearly if you're new to the concepts; check out our overview for a breakdown.
Spider types
Scrapy ships with several spider classes beyond the base one:
- scrapy.Spider. The default. You control all request logic manually.
- CrawlSpider. Uses Rule objects with link extractors to follow links automatically. Good for crawling an entire site.
- SitemapSpider. Reads an XML sitemap to discover URLs. Efficient when the site provides one.
- CSVFeedSpider & XMLFeedSpider. Parse structured feeds rather than HTML. Useful for data imports.
Customizing request behavior
You can customize request headers either per spider or globally. Per-spider overrides are useful when a specific crawler needs different headers than the rest of the project. For a global default, set USER_AGENT in settings.py. The example below shows how to define custom headers by overriding start_requests.
def start_requests(self):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36','Accept-Language': 'en-US,en;q=0.9',}for url in self.start_urls:yield scrapy.Request(url, headers=headers, callback=self.parse)
Passing data between callbacks with request.meta
When scraping detail pages, you often need to carry data from a listing page into the detail page's callback. Scrapy’s meta dictionary makes this straightforward:
def parse(self, response):for book in response.css('article.product_pod'):detail_url = book.css('h3 a::attr(href)').get()yield response.follow(detail_url,callback=self.parse_book,meta={'rating': book.css('p.star-rating::attr(class)').get().split()[-1]})def parse_book(self, response):yield {'title': response.css('h1::text').get(),'price': response.css('p.price_color::text').get(),'description': response.css('#product_description ~ p::text').get(),'rating': response.request.meta['rating'],}
Error handling with errback
Network errors and 4xx/5xx responses don't automatically stop a crawl, but you can handle them cleanly using errback:
import loggingfrom scrapy.spidermiddlewares.httperror import HttpErrorfrom twisted.internet.error import DNSLookupError, TimeoutErrordef parse(self, response):yield scrapy.Request(url,callback=self.parse_book,errback=self.handle_error)def handle_error(self, failure):if failure.check(HttpError):response = failure.value.responselogging.error(f"HTTP error {response.status} on {response.url}")elif failure.check(DNSLookupError):logging.error(f"DNS lookup failed: {failure.request.url}")elif failure.check(TimeoutError):logging.error(f"Request timed out: {failure.request.url}")
Putting it together: a complete example spider
Here's a spider for crawling books.toscrape.com that combines the three core building blocks you'll use in most real Scrapy projects:
- Parse a listing page and extract links to detail pages.
- Follow each detail page link and scrape additional fields there.
- Pass data from the listing page to the detail page callback using meta.
To use it, create a new file in your project’s spiders/ directory (for example, bookstore/spiders/book_details.py) and paste the code below into it. Scrapy automatically discovers spiders placed in this folder, as long as the class inherits from scrapy.Spider and has a unique name.
import scrapyimport loggingfrom scrapy.spidermiddlewares.httperror import HttpErrorclass BookDetailSpider(scrapy.Spider):name = "book_details"allowed_domains = ["books.toscrape.com"]start_urls = ["https://books.toscrape.com/catalogue/category/books_1/index.html"]def parse(self, response):for book in response.css('article.product_pod'):detail_url = book.css('h3 a::attr(href)').get()rating = book.css('p.star-rating::attr(class)').get().split()[-1]yield response.follow(detail_url,callback=self.parse_book,errback=self.handle_error,meta={'rating': rating})def parse_book(self, response):yield {'title': response.css('h1::text').get(),'price': response.css('p.price_color::text').get(),'availability': response.css('p.availability::text').getall()[1].strip(),'description': response.css('#product_description ~ p::text').get(),'rating': response.request.meta['rating'],'upc': response.css('table tr:first-child td::text').get(),}def handle_error(self, failure):if failure.check(HttpError):logging.error(f"HTTP {failure.value.response.status}: {failure.request.url}")else:logging.error(repr(failure))
Run the spider from the project root (the folder with scrapy.cfg) using the value defined in the spider’s name attribute. The filename doesn’t matter as long as the spider is placed in the spiders/ directory:
scrapy crawl book_details
Scraping multiple pages and handling pagination
Real-world scraping rarely stops at a single page. Most sites spread their content across multiple pages, and Scrapy gives you several ways to navigate them.
Pagination patterns you'll encounter
- "Next" button pagination. A "Next" link appears at the bottom of each page. You follow it until it disappears.
- Numbered page links. The site shows page numbers (1, 2, 3 …) as individual links. You can follow them or generate the URLs directly.
- Infinite scroll. The page loads more content as the user scrolls down. This is driven by JavaScript and XHR requests, so standard Scrapy can't handle it without additional tooling (Splash or Scrapy-Playwright). You'd need to identify and hit the underlying API endpoint instead.
- Load more buttons. Similar to infinite scroll – clicking a button fires an XHR request. Inspect the network tab to find the API call and replicate it directly.
Following "next" links
This is the most common pagination pattern. Check if a next-page link exists and follow it if present. Scrapy’s response.follow() automatically resolves relative URLs, so you don’t need to manually construct absolute URLs:
def parse(self, response):for book in response.css('article.product_pod'):yield {'title': book.css('h3 a::attr(title)').get(),'price': book.css('p.price_color::text').get(),}next_page = response.css('li.next a::attr(href)').get()if next_page:yield response.follow(next_page, callback=self.parse)
Building page URLs programmatically
When the URL pattern is predictable (for example, ?page=1, ?page=2), you can generate page URLs upfront instead of following links dynamically. This approach works well when you know the total number of pages in advance:
def start_requests(self):base_url = "https://books.toscrape.com/catalogue/page-{}.html"for page in range(1, 51): # Pages 1-50yield scrapy.Request(base_url.format(page), callback=self.parse)
Using CrawlSpider rules
CrawlSpider lets you define link-following behavior declaratively using rules, instead of writing pagination logic by hand. It’s well-suited for crawling entire site sections where pagination and detail links follow consistent patterns. Rules are evaluated in order: pagination links are followed first, and item pages are then routed to a parsing callback:
from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorclass BookCrawlSpider(CrawlSpider):name = "book_crawl"allowed_domains = ["books.toscrape.com"]start_urls = ["https://books.toscrape.com"]rules = (# Follow pagination linksRule(LinkExtractor(restrict_css='li.next a')),# Parse each book's detail pageRule(LinkExtractor(restrict_css='article.product_pod h3 a'), callback='parse_book'),)def parse_book(self, response):yield {'title': response.css('h1::text').get(),'price': response.css('p.price_color::text').get(),'availability': response.css('p.availability::text').getall()[1].strip(),'description': response.css('#product_description ~ p::text').get(),}
Using SitemapSpider
If the target site has an XML sitemap, SitemapSpider is the cleanest approach. It reads the sitemap, filters URLs by pattern, and calls the appropriate callback (no pagination logic needed – the sitemap handles URL discovery entirely):
from scrapy.spiders import SitemapSpiderclass BookSitemapSpider(SitemapSpider):name = "book_sitemap"sitemap_urls = ["https://books.toscrape.com/sitemap.xml"]sitemap_rules = [('/catalogue/', 'parse_book'),]def parse_book(self, response):yield {'title': response.css('h1::text').get(),'price': response.css('p.price_color::text').get(),}
Saving and processing scraped data
Extracting the data from the page is only half the job. Scrapy's Items, Item Loaders, and Pipelines give you a structured way to clean, validate, and store it.
Scrapy Items
An Item is a schema for your scraped data. Items catch typos in field names early (a raw dict would silently accept any key), make it easier to pass consistent data through pipelines, and improve readability across a larger project. Rather than yielding raw dictionaries from your spider, you yield Item objects that enforce structure.
Define your item schema inside the items.py file located in your project’s root module directory:
import scrapyclass BookItem(scrapy.Item):title = scrapy.Field()price = scrapy.Field()availability = scrapy.Field()description = scrapy.Field()rating = scrapy.Field()upc = scrapy.Field()
Item Loaders
Item Loaders handle the messy work of populating Items (stripping whitespace, cleaning strings, and dealing with missing fields), so your spider code stays clean. Use Item Loaders inside your spider file in the spiders/ directory:
from scrapy.loader import ItemLoaderfrom bookstore.items import BookItemdef parse_book(self, response):loader = ItemLoader(item=BookItem(), response=response)loader.add_css('title', 'h1::text')loader.add_css('price', 'p.price_color::text')loader.add_css('availability', 'p.availability::text')loader.add_css('description', '#product_description ~ p::text')return loader.load_item()
By default, each field collects a list of values. Input processors transform values as they're added; output processors transform the final list when load_item() is called.
Scrapy's built-in processors cover most common needs:
- TakeFirst. Returns the first non-null value from the list. Good for most single-value fields.
- MapCompose. Applies a chain of functions to each value before storing it. Perfect for stripping whitespace or reformatting strings.
- Join. Joins a list of strings into one. Useful for multi-line descriptions.
Define processors inside your project’s items.py file alongside your Item class:
import scrapyfrom itemloaders.processors import TakeFirst, MapCompose, Joinimport redef clean_price(value):return re.sub(r'[^\d.]', '', value)def normalize_availability(value):return value.strip().lower()class BookItem(scrapy.Item):title = scrapy.Field(input_processor=MapCompose(str.strip),output_processor=TakeFirst())price = scrapy.Field(input_processor=MapCompose(str.strip, clean_price),output_processor=TakeFirst())availability = scrapy.Field(input_processor=MapCompose(normalize_availability),output_processor=TakeFirst())description = scrapy.Field(input_processor=MapCompose(str.strip),output_processor=Join(' '))rating = scrapy.Field(output_processor=TakeFirst())
Pipelines
Pipelines receive each item after the spider yields it. Chain multiple pipelines with specific responsibilities and control execution order via ITEM_PIPELINES in settings.py.
Validation pipeline drops items that are missing critical fields. Define pipelines inside your project’s pipelines.py file:
from itemadapter import ItemAdapterimport scrapyclass ValidationPipeline:def process_item(self, item, spider):adapter = ItemAdapter(item)required = ['title', 'price']for field in required:if not adapter.get(field):raise scrapy.exceptions.DropItem(f"Missing {field} in {item}")return item
Cleaning pipeline normalizes data after extraction:
class CleaningPipeline:def process_item(self, item, spider):adapter = ItemAdapter(item)if adapter.get('price'):adapter['price'] = float(adapter['price'])if adapter.get('availability'):adapter['availability'] = 'in_stock' if 'in stock' in adapter['availability'] else 'out_of_stock'return item
Database pipeline saves items to SQLite:
import sqlite3class SQLitePipeline:def open_spider(self, spider):self.conn = sqlite3.connect('books.db')self.cursor = self.conn.cursor()self.cursor.execute('''CREATE TABLE IF NOT EXISTS books (title TEXT,price REAL,availability TEXT,description TEXT,rating TEXT)''')def close_spider(self, spider):self.conn.commit()self.conn.close()def process_item(self, item, spider):adapter = ItemAdapter(item)self.cursor.execute('INSERT INTO books VALUES (?, ?, ?, ?, ?)',(adapter.get('title'), adapter.get('price'),adapter.get('availability'), adapter.get('description'),adapter.get('rating')))return item
Enable and order your pipelines in settings.py. Lower numbers run first. Keep validation at the top so cleaning and storage don't run on invalid items.:
ITEM_PIPELINES = {'bookstore.pipelines.ValidationPipeline': 100,'bookstore.pipelines.CleaningPipeline': 200,'bookstore.pipelines.SQLitePipeline': 300,}
Export formats
For quick exports without a custom pipeline, you can use Scrapy’s FEEDS setting in settings.py. The example below shows how to export the same crawl output into multiple formats at once:
FEEDS = {'output/books.json': {'format': 'json', 'encoding': 'utf8', 'indent': 2},'output/books.jl': {'format': 'jsonlines'},'output/books.csv': {'format': 'csv'},'output/books.xml': {'format': 'xml'},}
If you plan to process large datasets or stream results incrementally, JSON Lines (.jl) is often the most practical format, since each line is a standalone JSON object.
To export directly to cloud storage, set the feed URI to a remote destination. Exporting to S3 using a s3:// URI requires boto3 and configured AWS credentials. Scrapy also supports Google Cloud Storage (gs://) and FTP destinations using the same mechanism. The example below writes JSON Lines output to an S3 bucket:
FEEDS = {'s3://your-bucket/books.jl': {'format': 'jsonlines','encoding': 'utf8',}}
Alternatively, if you only need a one-off export and don’t want to modify settings.py, you can specify the output file when running the spider.
scrapy crawl book_details -o output/books.csv
Extending Scrapy with middlewares and custom settings
What downloader middlewares do
Downloader middlewares sit between Scrapy's Engine and the Downloader, intercepting every request before it goes out and every response before it reaches your spider. They're your main tool for controlling how requests are made and responses are handled.
When downloader middlewares run
A downloader middleware can hook into three points:
- process_request(request, spider) runs before each request is sent. You can modify headers, change the request URL, or even return a fake response to bypass the actual download.
- process_response(request, response, spider) runs after a response arrives. You can validate it, modify it, or return a different response entirely.
- process_exception(request, exception, spider) handles errors during the download. You can retry failed requests or log them for later inspection.
Middleware common use cases
- Proxy rotation. When scraping at scale, rotating proxies prevents IP bans. A middleware can assign a different proxy to each request from a pool, handling failures and retries automatically.
- User agent rotation. Rotating user agents makes your traffic look more organic, reducing the chance of detection. You'd maintain a list of real browser user agent strings and cycle through them per request.
- Custom retry logic. With backoff delays and maximum attempt counts, you can retry specific errors like network timeouts, rate limits, or transient server issues.
Built-in middlewares and their default priorities
Scrapy ships with several middlewares active by default. They run in priority order (lower numbers run first for requests, higher numbers run first for responses). Here are some key ones:
- HttpProxyMiddleware (750) handles proxy settings from request meta or settings
- UserAgentMiddleware (500) sets the User-Agent header
- RetryMiddleware (550) retries failed requests
- RedirectMiddleware (600) follows HTTP redirects
- CookiesMiddleware (700) manages cookies
You can see the full list and their priorities in Scrapy's documentation.
Writing a custom middleware
Custom downloader middlewares let you intercept requests before they are sent and react to failures when something goes wrong. A common use case is proxy rotation, where each request is routed through a different proxy to reduce blocks and rate limits.
The example below shows a simple proxy rotation middleware. It does three things:
- Loads a list of proxies from project settings when Scrapy starts.
- Assigns a random proxy to each outgoing request.
- Retries failed requests with a different proxy.
Save this code in your project’s middlewares.py file:
import randomfrom scrapy import signalsfrom scrapy.exceptions import NotConfiguredclass ProxyRotationMiddleware:def __init__(self, proxy_list):self.proxy_list = proxy_list@classmethoddef from_crawler(cls, crawler):proxy_list = crawler.settings.getlist('PROXY_LIST')if not proxy_list:raise NotConfigured('PROXY_LIST setting is required')return cls(proxy_list)def process_request(self, request, spider):proxy = random.choice(self.proxy_list)request.meta['proxy'] = proxyspider.logger.info(f"Using proxy: {proxy}")def process_exception(self, request, exception, spider):# Retry with a different proxy on failureproxy = random.choice(self.proxy_list)request.meta['proxy'] = proxyspider.logger.warning(f"Request failed, retrying with: {proxy}")return request
To activate the middleware, you need to define a proxy list and register the middleware in settings.py. The snippet below shows the minimum configuration required to enable it:
PROXY_LIST = ['http://proxy1.example.com:7000','http://proxy2.example.com:7000','http://proxy3.example.com:7000',]DOWNLOADER_MIDDLEWARES = {'bookstore.middlewares.ProxyRotationMiddleware': 350,}
For production scraping with anti-bot protection, you'll want residential proxies rather than datacenter ones. Decodo's residential proxies handle rotation, authentication, and geographic targeting automatically, which saves you from building all this logic yourself.
If you're new to working with proxies in Python, check out this guide to mastering Python requests with proxies for the foundational concepts.
Spider middlewares are less commonly used than downloader middlewares, but they serve a specific purpose: processing the input and output of your spider's callbacks.
Difference between downloader and spider middlewares
Downloader middlewares work with raw HTTP requests and responses. Spider middlewares work with the items and requests that your spider yields. They run after the response reaches the spider but before items enter the pipeline.
Processing spider input/output
Spider middlewares operate on the data flowing into and out of your spider’s callbacks. They are useful when you need visibility or control over what your spider receives and what it yields.
Spider middlewares can:
- Filter or modify responses before they reach the spider's parse method.
- Process items before they go to pipelines.
- Catch exceptions raised during parsing.
The example below shows a simple spider middleware that counts how many items your spider has yielded so far. This is useful for debugging, progress tracking, or sanity checks during long crawls. Save it in your project’s middlewares.py file:
class ItemCounterMiddleware:def __init__(self):self.item_count = 0def process_spider_output(self, response, result, spider):for item in result:if not isinstance(item, Request):self.item_count += 1spider.logger.info(f"Items scraped so far: {self.item_count}")yield item
To enable the middleware, register it in settings.py using the SPIDER_MIDDLEWARES setting. The priority value controls execution order, with lower numbers running earlier:
SPIDER_MIDDLEWARES = {'bookstore.middlewares.ItemCounterMiddleware': 543,}
Essential settings
Scrapy's default settings work for small-scale scraping, but adjusting a few key lines in settings.py makes your crawler more respectful, efficient, and maintainable.
CONCURRENT_REQUESTS and DOWNLOAD_DELAY
CONCURRENT_REQUESTS controls how many requests Scrapy sends in parallel. The default is 16, which is fine for resilient sites. If you're scraping a small site or want to be polite, lower it:
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY adds a delay (in seconds) between requests to the same domain. This is the primary "politeness" setting:
DOWNLOAD_DELAY = 2 # Wait 2 seconds between requests
You can also set delays per domain:
DOWNLOAD_DELAY = 1CONCURRENT_REQUESTS_PER_DOMAIN = 4
USER_AGENT configuration
Set a descriptive user agent so site owners can identify your crawler:
USER_AGENT = 'MyBookScraper (+https://example.com/about)'
You can also rotate through a list of real browser user agents using custom middleware.
ROBOTSTXT_OBEY
This setting tells Scrapy to follow rules defined in a site's robots.txt file. It is enabled by default:
ROBOTSTXT_OBEY = True
If a site explicitly disallows crawlers in robots.txt, Scrapy will refuse to scrape it when this is enabled. For legitimate data collection where you have permission or the site is public, you might set it to False, but generally you should leave it on.
LOG_LEVEL and logging configuration
Control how verbose Scrapy's output is:
LOG_LEVEL = 'INFO' # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
INFO is a good middle ground for development. Use DEBUG when troubleshooting selectors or middlewares, and WARNING for production to keep logs clean.
You can also log to a file:
LOG_FILE = 'scrapy_output.log'
HTTPCACHE for development efficiency
The HTTP cache saves responses to disk so you don't re-fetch pages during development. This is a huge time-saver when you're tweaking selectors or pipelines:
HTTPCACHE_ENABLED = TrueHTTPCACHE_EXPIRATION_SECS = 86400 # 24 hoursHTTPCACHE_DIR = 'httpcache'HTTPCACHE_IGNORE_HTTP_CODES = [500, 502, 503, 504]
With caching on, the first run downloads all pages. Subsequent runs use the cached copies until they expire. Just remember to disable it (or clear the cache) when you're ready for production scraping.
Integrating Selenium/Playwright
Standard Scrapy works by downloading raw HTML and parsing it. This is fast and efficient, but it breaks down on JavaScript-heavy sites where content loads dynamically after the initial page render. For those cases, you need a real browser.
When you need browser rendering
Use Selenium or Playwright when:
- Content loads via JavaScript after the page renders (infinite scroll, lazy-loaded images, dynamic tables)
- User interactions trigger data to appear (clicking "Load More," expanding sections, filling forms)
- The site heavily relies on client-side rendering frameworks like React or Vue
- You need to bypass bot detection systems that check for browser fingerprints
If the data you need is in the initial HTML source, stick with standard Scrapy. Rendering browsers is 10–50x slower.
scrapy-selenium
scrapy-selenium lets Scrapy render pages through a real browser using Selenium WebDriver. This is useful when the data you need appears only after JavaScript runs, for example infinite scroll, "Load more" buttons, or client-side rendered pages.
Install the integration with pip:
pip install scrapy-selenium
To connect Selenium to Scrapy, add the Selenium middleware and driver settings in settings.py. The snippet below shows a minimal Chrome setup using chromedriver:
from shutil import whichSELENIUM_DRIVER_NAME = 'chrome'SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')SELENIUM_DRIVER_ARGUMENTS = ['--headless', '--no-sandbox']DOWNLOADER_MIDDLEWARES = {'scrapy_selenium.SeleniumMiddleware': 800,}
Once configured, you can choose which requests should be rendered through the browser by using SeleniumRequest in your spider. The following code should be placed inside one of your spider files in the spiders/ directory, for example myproject/spiders/dynamic_spider.py:
from scrapy_selenium import SeleniumRequestdef start_requests(self):yield SeleniumRequest(url='https://example.com/dynamic-page',callback=self.parse,wait_time=3 # Wait 3 seconds for JavaScript to load)def parse(self, response):# response.selector works as usual, but content is fully renderedfor item in response.css('div.dynamic-item'):yield {'title': item.css('h2::text').get()}
For a deeper dive into Selenium-based scraping, check out this complete guide to web scraping with Selenium and Python.
scrapy-playwright
scrapy-playwright uses Playwright instead of Selenium to render JavaScript-heavy pages. It tends to be faster and more reliable on modern websites, and it integrates well with Scrapy’s async architecture.
Install the integration and download Playwright’s browser binaries:
pip install scrapy-playwrightplaywright install
To enable Playwright in Scrapy, configure the download handlers and reactor in settings.py. The snippet below shows the core settings required for Playwright-powered requests.
DOWNLOAD_HANDLERS = {"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler","https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",}TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
In your spider, you enable browser rendering per request using meta. The example below waits for a selector to appear, performs a click, then extracts updated content using Scrapy selectors:
from scrapy import Spiderclass DynamicSpider(Spider):name = "dynamic"def start_requests(self):yield scrapy.Request(url='https://example.com/js-heavy',callback=self.parse,meta=dict(playwright=True,playwright_include_page=True,playwright_page_methods=[('wait_for_selector', 'div.content'),]))async def parse(self, response):page = response.meta['playwright_page']# Perform browser actionsawait page.click('button.load-more')await page.wait_for_timeout(2000)# Extract data from the updated pagecontent = await page.content()# Parse content with Scrapy selectorsfrom scrapy.selector import Selectorselector = Selector(text=content)for item in selector.css('div.item'):yield {'title': item.css('h2::text').get()}await page.close()
Performance considerations
Browser rendering is resource-intensive. A typical Scrapy setup can make hundreds of requests per minute. With Selenium or Playwright, you're limited to maybe 5-10 concurrent browser instances before CPU and memory become bottlenecks.
To minimize performance impact:
- Only use browser rendering when absolutely necessary. If an API endpoint exists that returns the same data (check your browser's Network tab), hit that directly instead.
- Cache rendered pages during development using HTTPCACHE_ENABLED.
- Run browsers in headless mode to reduce overhead.
- Scale horizontally by running multiple Scrapy instances on different machines rather than trying to run 50 browser tabs on one server.
For sites with serious anti-bot protection, you might need both browser rendering and residential proxies. Combining scrapy-playwright with a service like Decodo gives you the fingerprint of a real browser backed by residential IPs, which handles most modern bot detection.
Common limitations for Scrapy web scraping
JavaScript rendering not built-in
Scrapy downloads and parses raw HTML. If a site relies on JavaScript to populate content after the initial page load, Scrapy won't see it. You'll need to integrate Selenium or Playwright, which adds complexity and slows down scraping significantly. For simple projects where you just need a few data points from a JS-heavy site, Beautiful Soup combined with Selenium might be a simpler starting point.
Memory usage at extreme scale
Scrapy keeps request queues, response objects, and some state in memory. At tens of thousands of concurrent requests, memory consumption can grow quickly. For most projects this isn't an issue, but if you're crawling millions of URLs in a single session, you'll need to monitor memory usage and potentially tune garbage collection or split the work across multiple runs.
Single-machine limitations
Scrapy runs on one machine by default. Even with optimal settings, a single instance can only push so many requests per second before hitting CPU, memory, or network limits. If you need to scrape millions of pages daily, you'll eventually need distributed scraping with tools like scrapy-redis or a managed scraping service.
Learning curve for complex projects
Scrapy's architecture is powerful, but it takes time to internalize. Middlewares, pipelines, item loaders, and settings all interact in ways that aren't obvious at first. For a quick one-off scraping task, Beautiful Soup with Requests might get you there faster. Scrapy pays off when you're building something that needs to scale or run repeatedly.
Rate limiting and IP blocks
Most sites track request frequency by IP address. Send too many requests too fast, and you'll get temporarily or permanently blocked. The standard solution is rotating residential proxies. Datacenter proxies work for some sites, but many modern anti-bot systems flag entire datacenter IP ranges by default.
Decodo's residential proxies rotate IPs automatically and route traffic through real residential connections, which makes your requests indistinguishable from regular users.
CAPTCHAs and how to handle them
If you hit a CAPTCHA, you've been detected. There are CAPTCHA-solving services that use human labor or machine learning to solve them, but they're slow and expensive. The better approach is avoiding CAPTCHAs in the first place by:
- Rotating proxies aggressively
- Keeping your request rate low and randomized
- Using realistic browser headers and fingerprints
- Solving the CAPTCHA manually once and reusing the session cookies (works for some sites)
For sites with aggressive CAPTCHA protection, a managed scraping API that handles this for you is often more cost-effective than building your own solution.
Fingerprinting and detection
Modern anti-bot systems don't just look at IP addresses. They fingerprint your browser based on HTTP headers, TLS configuration, canvas rendering, WebGL, fonts installed, screen resolution, and dozens of other signals. A headless browser running with default settings has a distinct fingerprint that's easy to detect.
If you're using Playwright or Selenium, tools like playwright-stealth or undetected-chromedriver can help mask the automation signals. But even then, combining browser automation with residential proxies is usually necessary for well-protected sites.
Why residential proxies matter
Residential proxies route your traffic through real home internet connections, making each request appear to come from a different user in a different location. This bypasses IP-based rate limits and avoids the automatic blocking that datacenter IPs often face. For large-scale scraping or scraping sites with anti-bot protection, residential proxies aren't optional – they're the baseline requirement.
Best practices for Scrapy web scraping
Respecting robots.txt and site rules
Keep ROBOTSTXT_OBEY = True in your settings. If a site explicitly disallows crawling, honor it. Even if you have a legitimate use case, ignoring robots.txt puts you on shaky ground and can lead to IP bans or even legal issues in some jurisdictions.
Read the site's Terms of Service too. Some sites explicitly prohibit automated access. Others allow it with restrictions (rate limits, attribution, non-commercial use only). Knowing the rules helps you stay on the right side of them.
Implementing polite delays
Set DOWNLOAD_DELAY to at least 1–2 seconds for small sites, more for larger ones. Scrapy is fast enough that even with a 2-second delay, you can still scrape thousands of pages per hour. The delay prevents you from accidentally overwhelming a server, especially if it's a smaller site without enterprise infrastructure.
You can also randomize delays to make traffic patterns look more human:
DOWNLOAD_DELAY = 2RANDOMIZE_DOWNLOAD_DELAY = True # Adds ±50% random variation
Using a descriptive user agent
Use a descriptive user agent that includes contact information:
This makes it easy for site owners to reach you if there's an issue. Many sites are fine with scraping as long as you're not causing problems, and a clear user agent shows you're operating in good faith.
Caching during development
Enable HTTPCACHE_ENABLED = True while building and testing your spider. This saves every response to disk so subsequent runs don't hit the live site. It's faster, avoids unnecessary server load, and prevents you from getting blocked while debugging.
Just remember to disable it or clear the cache before running production scrapes, otherwise you'll be working with stale data.
Monitoring and logging
Set LOG_LEVEL = 'INFO' in production and save logs to a file with LOG_FILE = 'scraper.log'. Monitor for patterns like:
- High error rates (connection timeouts, HTTP 429/503 responses)
- Decreasing success rates over time (might indicate a block)
- Unexpected item counts (could mean selectors broke due to a site redesign)
Scrapy's stats collection tracks these automatically and prints a summary at the end of each run. Review it regularly.
Handling failures gracefully with retries
Network hiccups, temporary server issues, and transient errors are normal when scraping at scale. Scrapy's built-in retry middleware handles most of this, but you can tune it:
RETRY_TIMES = 3 # Retry failed requests up to 3 timesRETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
For retry strategy fundamentals and how to implement exponential backoff, see this guide to Python requests retry logic.
You can also implement custom retry logic in a downloader middleware if you need more control (e.g., backing off longer after a 429 rate limit response).
Scaling considerations
Scrapy runs on a single machine by default, which is enough for most projects. When you need to crawl large sites or distribute work across multiple machines, you can move request scheduling out of memory and into a shared backend.
scrapy-redis enables distributed crawling by replacing Scrapy’s in-memory request queue with a Redis-backed queue. Multiple Scrapy instances can then pull URLs from the same queue and push results to the same destination, sharing the workload automatically.
Install scrapy-redis into your virtual environment:
pip install scrapy-redis
To enable Redis-based scheduling, add the following to your project settings file (meaning the one inside your project package, for example bookstore/bookstore/settings.py). The configuration below switches Scrapy to the Redis scheduler, enables request deduplication across all workers, and allows crawls to be paused and resumed:
# Use scrapy-redis schedulerSCHEDULER = "scrapy_redis.scheduler.Scheduler"DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"# Keep Redis queues between runs (pause / resume)SCHEDULER_PERSIST = True# Redis connectionREDIS_URL = 'redis://localhost:6379'
Next, create a new spider file inside your project’s spiders/ folder, for example bookstore/bookstore/spiders/distributed_books.py. This spider should inherit from RedisSpider. Rather than defining start_urls, it reads starting URLs from a Redis key:
from scrapy_redis.spiders import RedisSpiderclass DistributedBookSpider(RedisSpider):name = "distributed_books"redis_key = "book_urls" # Push start URLs to this Redis keydef parse(self, response):# Same parsing logic as beforepass
To start the crawl, push one or more URLs into the Redis list named in redis_key. Any running worker connected to the same Redis instance can pick those URLs up:
redis-cli lpush book_urls "https://books.toscrape.com/catalogue/page-1.html"
Once URLs are in the queue, start one or more workers from the project root, meaning the folder containing scrapy.cfg. Each worker runs the same spider name and pulls work from Redis:
scrapy crawl distributed_books
When to consider a scraping API instead
Building and maintaining a scraping infrastructure takes time. You need proxies, retry logic, CAPTCHA handling, monitoring, and scaling infrastructure. For many projects, a managed scraping API handles all of this for you.
Decodo's Web Scraping API provides:
- Automatic proxy rotation with 115M+ residential IPs
- JavaScript rendering when needed
- Built-in retry and error handling
- No infrastructure to manage
If you're scraping as a core part of your business and need full control, build with Scrapy. If scraping is a means to an end and you want to focus on using the data rather than collecting it, an API makes more sense.
Final thoughts
Scrapy is one of the most powerful web scraping frameworks available. It scales from small personal projects to large data collection workflows, and its architecture makes it easy to add custom behavior where it matters.
The learning curve is steeper than simpler tools like Beautiful Soup, but it pays off once you’re working with multiple pages, structured data, or non-trivial request logic.
Start simple. Build a basic spider, test selectors in the shell, and use Items and Pipelines to keep your data clean and predictable. From there, add middlewares, tuning, and browser rendering only when the target site requires it.
About the author
Dominykas Niaura
Technical Copywriter
Dominykas brings a unique blend of philosophical insight and technical expertise to his writing. Starting his career as a film critic and music industry copywriter, he's now an expert in making complex proxy and web scraping concepts accessible to everyone.
