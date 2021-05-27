Installing Scrapy and setting up your first project

Prerequisites

Before installing Scrapy, make sure you have Python 3.7 or higher on your computer. You can check your current version by running the following command in the terminal:

python - - version

If you need to install or upgrade Python, get the latest version from their official website. And if you’re new to running Python code from the terminal, our guide explains the basics.

Creating a virtual environment

Isolate your Scrapy project in a virtual environment to keep dependencies tidy and avoid conflicts with other Python projects:

python - m venv scrapy - env source scrapy - env / bin / activate scrapy - env\Scripts\activate

Once activated, your terminal prompt will change to show the environment name. From here, any package you install stays contained inside it.

Installing Scrapy

With the environment active, install Scrapy via pip:

pip install scrapy

To confirm the installation worked, check with:

scrapy version

Creating your first project

Navigate to the folder where you want your project to live, then run:

scrapy startproject bookstore cd bookstore

This generates the following structure:

bookstore / ├── scrapy . cfg └── bookstore / ├── __init__ . py ├── items . py ├── middlewares . py ├── pipelines . py ├── settings . py └── spiders / └── __init__ . py

Here's what each file does:

spiders/ . Where your spider classes live. Each spider defines what to scrape and how.

. Where your spider classes live. Each spider defines what to scrape and how. items.py . Defines structured data containers for your scraped fields.

. Defines structured data containers for your scraped fields. pipelines.py . Processes items after they're scraped – validation, cleaning, storage.

. Processes items after they're scraped – validation, cleaning, storage. middlewares.py . Hooks into the request/response cycle for custom behavior. Useful for rotating user agents, handling retries, or adding proxy logic.

. Hooks into the request/response cycle for custom behavior. Useful for rotating user agents, handling retries, or adding proxy logic. settings.py . Controls everything from concurrency to user agents to export formats.

. Controls everything from concurrency to user agents to export formats. scrapy.cfg. A deployment configuration file. You'll rarely need to touch this during development.

Using Scrapy Shell for interactive data extraction

Before writing a full spider, Scrapy Shell lets you test selectors interactively against a live page. This saves a lot of trial and error.

Launching the shell

If you have IPython installed (pip install ipython), Scrapy will use it automatically, providing syntax highlighting and tab completion in the interactive shell.

To launch the shell, run the following command. Scrapy will fetch the page and drop you into an interactive Python session with the response already loaded:

scrapy shell "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

Exploring the response object

Once the shell loads, you have access to a response object:

response . status response . url response . headers

To open the page visually in your browser, you can enter:

view ( response )

Testing XPath and CSS selectors

You can test CSS selectors directly in the shell to extract specific elements from the page. Here's how to extract product data from a books.toscrape.com product page:

response . css ( "h1::text" ) . get ( ) response . css ( "p.price_color::text" ) . get ( ) response . css ( "p.availability::text" ) . getall ( ) " " . join ( response . css ( "p.availability::text" ) . getall ( ) ) . strip ( ) response . css ( "p.star-rating::attr(class)" ) . get ( )

You can run the same extractions using XPath selectors. These queries target some of the same elements as the CSS examples above, but use XPath syntax instead:

response . xpath ( "//h1/text()" ) . get ( ) response . xpath ( "//p[@class='price_color']/text()" ) . get ( )

Pro tips for working in the shell

Test selectors in the shell before adding them to spider code. It’s much faster to iterate and debug here.

Use your browser's DevTools (right-click → Inspect ) to identify element paths before switching to the shell.

) to identify element paths before switching to the shell. Expect variations in page structure. Use .get() (returns None on failure) instead of .getall()[0] to avoid errors when elements are missing.

(returns on failure) instead of to avoid errors when elements are missing. Exit the shell with Ctrl+D, or by typing exit() or quit() to return to your terminal.

For a deeper look at how XPath and CSS selectors compare, check out our guide on choosing the right selector for web scraping.

Creating and customizing Scrapy spiders

Spider basics

A spider is a Python class that tells Scrapy what to crawl and how to extract data from responses.

Each spider is defined in its own Python file inside the project’s spiders/ directory (for example, bookstore/spiders/book_spider.py).

The snippets in this section are illustrative. They show different ways to structure a spider as you add features. In a real project, you would typically create a single spider file inside the spiders/ directory and extend it progressively, rather than creating a new file for every example shown here.

Every spider follows the same core anatomy:

import scrapy class BookSpider ( scrapy . Spider ) : name = "books" allowed_domains = [ "books.toscrape.com" ] start_urls = [ "https://books.toscrape.com/catalogue/category/books_1/index.html" ] def parse ( self , response ) : for book in response . css ( 'article.product_pod' ) : yield { 'title' : book . css ( 'h3 a::attr(title)' ) . get ( ) , 'price' : book . css ( 'p.price_color::text' ) . get ( ) , 'rating' : book . css ( 'p.star-rating::attr(class)' ) . get ( ) . split ( ) [ - 1 ] , 'availability' : book . css ( 'p.availability::text' ) . getall ( ) [ 1 ] . strip ( ) , }

Breaking down the key parts:

name. A unique identifier for the spider. This is what you use to run it ( scrapy crawl books ). No two spiders in the same project can share a name.

A unique identifier for the spider. This is what you use to run it ( ). No two spiders in the same project can share a name. allowed_domains. Scrapy won't follow links outside these domains.

Scrapy won't follow links outside these domains. start_urls. The URLs Scrapy fetches first. Each one triggers a request that gets passed to the parse method.

The URLs Scrapy fetches first. Each one triggers a request that gets passed to the method. parse method. The default callback that handles responses. It receives a Response object and can yield items (extracted data) or new Request objects to follow.

The spider can either yield items (data) or yield new scrapy.Request objects to follow links. You can mix both in the same parse method. This distinction (between scraping (extracting data) and crawling (following links to discover pages)) is worth understanding clearly if you're new to the concepts; check out our overview for a breakdown.

Spider types

Scrapy ships with several spider classes beyond the base one:

scrapy.Spider . The default. You control all request logic manually.

. The default. You control all request logic manually. CrawlSpider . Uses Rule objects with link extractors to follow links automatically. Good for crawling an entire site.

. Uses objects with link extractors to follow links automatically. Good for crawling an entire site. SitemapSpider . Reads an XML sitemap to discover URLs. Efficient when the site provides one.

. Reads an XML sitemap to discover URLs. Efficient when the site provides one. CSVFeedSpider & XMLFeedSpider. Parse structured feeds rather than HTML. Useful for data imports.

Customizing request behavior

You can customize request headers either per spider or globally. Per-spider overrides are useful when a specific crawler needs different headers than the rest of the project. For a global default, set USER_AGENT in settings.py. The example below shows how to define custom headers by overriding start_requests.

def start_requests ( self ) : headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' , 'Accept-Language' : 'en-US,en;q=0.9' , } for url in self . start_urls : yield scrapy . Request ( url , headers = headers , callback = self . parse )

When scraping detail pages, you often need to carry data from a listing page into the detail page's callback. Scrapy’s meta dictionary makes this straightforward:

def parse ( self , response ) : for book in response . css ( 'article.product_pod' ) : detail_url = book . css ( 'h3 a::attr(href)' ) . get ( ) yield response . follow ( detail_url , callback = self . parse_book , meta = { 'rating' : book . css ( 'p.star-rating::attr(class)' ) . get ( ) . split ( ) [ - 1 ] } ) def parse_book ( self , response ) : yield { 'title' : response . css ( 'h1::text' ) . get ( ) , 'price' : response . css ( 'p.price_color::text' ) . get ( ) , 'description' : response . css ( '#product_description ~ p::text' ) . get ( ) , 'rating' : response . request . meta [ 'rating' ] , }

Error handling with errback

Network errors and 4xx/5xx responses don't automatically stop a crawl, but you can handle them cleanly using errback:

import logging from scrapy . spidermiddlewares . httperror import HttpError from twisted . internet . error import DNSLookupError , TimeoutError def parse ( self , response ) : yield scrapy . Request ( url , callback = self . parse_book , errback = self . handle_error ) def handle_error ( self , failure ) : if failure . check ( HttpError ) : response = failure . value . response logging . error ( f"HTTP error { response . status } on { response . url } " ) elif failure . check ( DNSLookupError ) : logging . error ( f"DNS lookup failed: { failure . request . url } " ) elif failure . check ( TimeoutError ) : logging . error ( f"Request timed out: { failure . request . url } " )

Putting it together: a complete example spider

Here's a spider for crawling books.toscrape.com that combines the three core building blocks you'll use in most real Scrapy projects:

Parse a listing page and extract links to detail pages.

Follow each detail page link and scrape additional fields there.

Pass data from the listing page to the detail page callback using meta .

To use it, create a new file in your project’s spiders/ directory (for example, bookstore/spiders/book_details.py) and paste the code below into it. Scrapy automatically discovers spiders placed in this folder, as long as the class inherits from scrapy.Spider and has a unique name.

import scrapy import logging from scrapy . spidermiddlewares . httperror import HttpError class BookDetailSpider ( scrapy . Spider ) : name = "book_details" allowed_domains = [ "books.toscrape.com" ] start_urls = [ "https://books.toscrape.com/catalogue/category/books_1/index.html" ] def parse ( self , response ) : for book in response . css ( 'article.product_pod' ) : detail_url = book . css ( 'h3 a::attr(href)' ) . get ( ) rating = book . css ( 'p.star-rating::attr(class)' ) . get ( ) . split ( ) [ - 1 ] yield response . follow ( detail_url , callback = self . parse_book , errback = self . handle_error , meta = { 'rating' : rating } ) def parse_book ( self , response ) : yield { 'title' : response . css ( 'h1::text' ) . get ( ) , 'price' : response . css ( 'p.price_color::text' ) . get ( ) , 'availability' : response . css ( 'p.availability::text' ) . getall ( ) [ 1 ] . strip ( ) , 'description' : response . css ( '#product_description ~ p::text' ) . get ( ) , 'rating' : response . request . meta [ 'rating' ] , 'upc' : response . css ( 'table tr:first-child td::text' ) . get ( ) , } def handle_error ( self , failure ) : if failure . check ( HttpError ) : logging . error ( f"HTTP { failure . value . response . status } : { failure . request . url } " ) else : logging . error ( repr ( failure ) )

Run the spider from the project root (the folder with scrapy.cfg) using the value defined in the spider’s name attribute. The filename doesn’t matter as long as the spider is placed in the spiders/ directory:

scrapy crawl book_details

Scraping multiple pages and handling pagination

Real-world scraping rarely stops at a single page. Most sites spread their content across multiple pages, and Scrapy gives you several ways to navigate them.

Pagination patterns you'll encounter

"Next" button pagination . A "Next" link appears at the bottom of each page. You follow it until it disappears.

. A "Next" link appears at the bottom of each page. You follow it until it disappears. Numbered page links . The site shows page numbers (1, 2, 3 …) as individual links. You can follow them or generate the URLs directly.

. The site shows page numbers (1, 2, 3 …) as individual links. You can follow them or generate the URLs directly. Infinite scroll . The page loads more content as the user scrolls down. This is driven by JavaScript and XHR requests, so standard Scrapy can't handle it without additional tooling (Splash or Scrapy-Playwright). You'd need to identify and hit the underlying API endpoint instead.

. The page loads more content as the user scrolls down. This is driven by JavaScript and XHR requests, so standard Scrapy can't handle it without additional tooling (Splash or Scrapy-Playwright). You'd need to identify and hit the underlying API endpoint instead. Load more buttons. Similar to infinite scroll – clicking a button fires an XHR request. Inspect the network tab to find the API call and replicate it directly.

This is the most common pagination pattern. Check if a next-page link exists and follow it if present. Scrapy’s response.follow() automatically resolves relative URLs, so you don’t need to manually construct absolute URLs:

def parse ( self , response ) : for book in response . css ( 'article.product_pod' ) : yield { 'title' : book . css ( 'h3 a::attr(title)' ) . get ( ) , 'price' : book . css ( 'p.price_color::text' ) . get ( ) , } next_page = response . css ( 'li.next a::attr(href)' ) . get ( ) if next_page : yield response . follow ( next_page , callback = self . parse )

Building page URLs programmatically

When the URL pattern is predictable (for example, ?page=1, ?page=2), you can generate page URLs upfront instead of following links dynamically. This approach works well when you know the total number of pages in advance:

def start_requests ( self ) : base_url = "https://books.toscrape.com/catalogue/page-{}.html" for page in range ( 1 , 51 ) : yield scrapy . Request ( base_url . format ( page ) , callback = self . parse )

Using CrawlSpider rules

CrawlSpider lets you define link-following behavior declaratively using rules, instead of writing pagination logic by hand. It’s well-suited for crawling entire site sections where pagination and detail links follow consistent patterns. Rules are evaluated in order: pagination links are followed first, and item pages are then routed to a parsing callback:

from scrapy . spiders import CrawlSpider , Rule from scrapy . linkextractors import LinkExtractor class BookCrawlSpider ( CrawlSpider ) : name = "book_crawl" allowed_domains = [ "books.toscrape.com" ] start_urls = [ "https://books.toscrape.com" ] rules = ( Rule ( LinkExtractor ( restrict_css = 'li.next a' ) ) , Rule ( LinkExtractor ( restrict_css = 'article.product_pod h3 a' ) , callback = 'parse_book' ) , ) def parse_book ( self , response ) : yield { 'title' : response . css ( 'h1::text' ) . get ( ) , 'price' : response . css ( 'p.price_color::text' ) . get ( ) , 'availability' : response . css ( 'p.availability::text' ) . getall ( ) [ 1 ] . strip ( ) , 'description' : response . css ( '#product_description ~ p::text' ) . get ( ) , }

Using SitemapSpider

If the target site has an XML sitemap, SitemapSpider is the cleanest approach. It reads the sitemap, filters URLs by pattern, and calls the appropriate callback (no pagination logic needed – the sitemap handles URL discovery entirely):

from scrapy . spiders import SitemapSpider class BookSitemapSpider ( SitemapSpider ) : name = "book_sitemap" sitemap_urls = [ "https://books.toscrape.com/sitemap.xml" ] sitemap_rules = [ ( '/catalogue/' , 'parse_book' ) , ] def parse_book ( self , response ) : yield { 'title' : response . css ( 'h1::text' ) . get ( ) , 'price' : response . css ( 'p.price_color::text' ) . get ( ) , }

Saving and processing scraped data

Extracting the data from the page is only half the job. Scrapy's Items, Item Loaders, and Pipelines give you a structured way to clean, validate, and store it.

Scrapy Items

An Item is a schema for your scraped data. Items catch typos in field names early (a raw dict would silently accept any key), make it easier to pass consistent data through pipelines, and improve readability across a larger project. Rather than yielding raw dictionaries from your spider, you yield Item objects that enforce structure.

Define your item schema inside the items.py file located in your project’s root module directory:

import scrapy class BookItem ( scrapy . Item ) : title = scrapy . Field ( ) price = scrapy . Field ( ) availability = scrapy . Field ( ) description = scrapy . Field ( ) rating = scrapy . Field ( ) upc = scrapy . Field ( )

Item Loaders

Item Loaders handle the messy work of populating Items (stripping whitespace, cleaning strings, and dealing with missing fields), so your spider code stays clean. Use Item Loaders inside your spider file in the spiders/ directory:

from scrapy . loader import ItemLoader from bookstore . items import BookItem def parse_book ( self , response ) : loader = ItemLoader ( item = BookItem ( ) , response = response ) loader . add_css ( 'title' , 'h1::text' ) loader . add_css ( 'price' , 'p.price_color::text' ) loader . add_css ( 'availability' , 'p.availability::text' ) loader . add_css ( 'description' , '#product_description ~ p::text' ) return loader . load_item ( )

By default, each field collects a list of values. Input processors transform values as they're added; output processors transform the final list when load_item() is called.

Scrapy's built-in processors cover most common needs:

TakeFirst . Returns the first non-null value from the list. Good for most single-value fields.

. Returns the first non-null value from the list. Good for most single-value fields. MapCompose . Applies a chain of functions to each value before storing it. Perfect for stripping whitespace or reformatting strings.

. Applies a chain of functions to each value before storing it. Perfect for stripping whitespace or reformatting strings. Join. Joins a list of strings into one. Useful for multi-line descriptions.

Define processors inside your project’s items.py file alongside your Item class:

import scrapy from itemloaders . processors import TakeFirst , MapCompose , Join import re def clean_price ( value ) : return re . sub ( r'[^\d.]' , '' , value ) def normalize_availability ( value ) : return value . strip ( ) . lower ( ) class BookItem ( scrapy . Item ) : title = scrapy . Field ( input_processor = MapCompose ( str . strip ) , output_processor = TakeFirst ( ) ) price = scrapy . Field ( input_processor = MapCompose ( str . strip , clean_price ) , output_processor = TakeFirst ( ) ) availability = scrapy . Field ( input_processor = MapCompose ( normalize_availability ) , output_processor = TakeFirst ( ) ) description = scrapy . Field ( input_processor = MapCompose ( str . strip ) , output_processor = Join ( ' ' ) ) rating = scrapy . Field ( output_processor = TakeFirst ( ) )

Pipelines

Pipelines receive each item after the spider yields it. Chain multiple pipelines with specific responsibilities and control execution order via ITEM_PIPELINES in settings.py.

Validation pipeline drops items that are missing critical fields. Define pipelines inside your project’s pipelines.py file:

from itemadapter import ItemAdapter import scrapy class ValidationPipeline : def process_item ( self , item , spider ) : adapter = ItemAdapter ( item ) required = [ 'title' , 'price' ] for field in required : if not adapter . get ( field ) : raise scrapy . exceptions . DropItem ( f"Missing { field } in { item } " ) return item

Cleaning pipeline normalizes data after extraction:

class CleaningPipeline : def process_item ( self , item , spider ) : adapter = ItemAdapter ( item ) if adapter . get ( 'price' ) : adapter [ 'price' ] = float ( adapter [ 'price' ] ) if adapter . get ( 'availability' ) : adapter [ 'availability' ] = 'in_stock' if 'in stock' in adapter [ 'availability' ] else 'out_of_stock' return item

Database pipeline saves items to SQLite:

import sqlite3 class SQLitePipeline : def open_spider ( self , spider ) : self . conn = sqlite3 . connect ( 'books.db' ) self . cursor = self . conn . cursor ( ) self . cursor . execute ( ''' CREATE TABLE IF NOT EXISTS books ( title TEXT, price REAL, availability TEXT, description TEXT, rating TEXT ) ''' ) def close_spider ( self , spider ) : self . conn . commit ( ) self . conn . close ( ) def process_item ( self , item , spider ) : adapter = ItemAdapter ( item ) self . cursor . execute ( 'INSERT INTO books VALUES (?, ?, ?, ?, ?)' , ( adapter . get ( 'title' ) , adapter . get ( 'price' ) , adapter . get ( 'availability' ) , adapter . get ( 'description' ) , adapter . get ( 'rating' ) ) ) return item

Enable and order your pipelines in settings.py. Lower numbers run first. Keep validation at the top so cleaning and storage don't run on invalid items.:

ITEM_PIPELINES = { 'bookstore.pipelines.ValidationPipeline' : 100 , 'bookstore.pipelines.CleaningPipeline' : 200 , 'bookstore.pipelines.SQLitePipeline' : 300 , }

Export formats

For quick exports without a custom pipeline, you can use Scrapy’s FEEDS setting in settings.py. The example below shows how to export the same crawl output into multiple formats at once:

FEEDS = { 'output/books.json' : { 'format' : 'json' , 'encoding' : 'utf8' , 'indent' : 2 } , 'output/books.jl' : { 'format' : 'jsonlines' } , 'output/books.csv' : { 'format' : 'csv' } , 'output/books.xml' : { 'format' : 'xml' } , }

If you plan to process large datasets or stream results incrementally, JSON Lines (.jl) is often the most practical format, since each line is a standalone JSON object.

To export directly to cloud storage, set the feed URI to a remote destination. Exporting to S3 using a s3:// URI requires boto3 and configured AWS credentials. Scrapy also supports Google Cloud Storage (gs://) and FTP destinations using the same mechanism. The example below writes JSON Lines output to an S3 bucket:

FEEDS = { 's3://your-bucket/books.jl' : { 'format' : 'jsonlines' , 'encoding' : 'utf8' , } }

Alternatively, if you only need a one-off export and don’t want to modify settings.py, you can specify the output file when running the spider.

scrapy crawl book_details - o output / books . csv

Extending Scrapy with middlewares and custom settings

What downloader middlewares do

Downloader middlewares sit between Scrapy's Engine and the Downloader, intercepting every request before it goes out and every response before it reaches your spider. They're your main tool for controlling how requests are made and responses are handled.

When downloader middlewares run

A downloader middleware can hook into three points:

process_request(request, spider) runs before each request is sent. You can modify headers, change the request URL, or even return a fake response to bypass the actual download.

runs before each request is sent. You can modify headers, change the request URL, or even return a fake response to bypass the actual download. process_response(request, response, spider) runs after a response arrives. You can validate it, modify it, or return a different response entirely.

runs after a response arrives. You can validate it, modify it, or return a different response entirely. process_exception(request, exception, spider) handles errors during the download. You can retry failed requests or log them for later inspection.

Middleware common use cases

Proxy rotation . When scraping at scale, rotating proxies prevents IP bans. A middleware can assign a different proxy to each request from a pool, handling failures and retries automatically.

. When scraping at scale, rotating prevents IP bans. A middleware can assign a different proxy to each request from a pool, handling failures and retries automatically. User agent rotation . Rotating user agents makes your traffic look more organic, reducing the chance of detection. You'd maintain a list of real browser user agent strings and cycle through them per request.

. Rotating user agents makes your traffic look more organic, reducing the chance of detection. You'd maintain a list of real browser user agent strings and cycle through them per request. Custom retry logic. With backoff delays and maximum attempt counts, you can retry specific errors like network timeouts, rate limits, or transient server issues.

Built-in middlewares and their default priorities

Scrapy ships with several middlewares active by default. They run in priority order (lower numbers run first for requests, higher numbers run first for responses). Here are some key ones:

HttpProxyMiddleware (750) handles proxy settings from request meta or settings

UserAgentMiddleware (500) sets the User-Agent header

RetryMiddleware (550) retries failed requests

RedirectMiddleware (600) follows HTTP redirects

CookiesMiddleware (700) manages cookies

You can see the full list and their priorities in Scrapy's documentation.

Writing a custom middleware

Custom downloader middlewares let you intercept requests before they are sent and react to failures when something goes wrong. A common use case is proxy rotation, where each request is routed through a different proxy to reduce blocks and rate limits.

The example below shows a simple proxy rotation middleware. It does three things:

Loads a list of proxies from project settings when Scrapy starts.

Assigns a random proxy to each outgoing request.

Retries failed requests with a different proxy.

Save this code in your project’s middlewares.py file:

import random from scrapy import signals from scrapy . exceptions import NotConfigured class ProxyRotationMiddleware : def __init__ ( self , proxy_list ) : self . proxy_list = proxy_list @ classmethod def from_crawler ( cls , crawler ) : proxy_list = crawler . settings . getlist ( 'PROXY_LIST' ) if not proxy_list : raise NotConfigured ( 'PROXY_LIST setting is required' ) return cls ( proxy_list ) def process_request ( self , request , spider ) : proxy = random . choice ( self . proxy_list ) request . meta [ 'proxy' ] = proxy spider . logger . info ( f"Using proxy: { proxy } " ) def process_exception ( self , request , exception , spider ) : proxy = random . choice ( self . proxy_list ) request . meta [ 'proxy' ] = proxy spider . logger . warning ( f"Request failed, retrying with: { proxy } " ) return request

To activate the middleware, you need to define a proxy list and register the middleware in settings.py. The snippet below shows the minimum configuration required to enable it:

PROXY_LIST = [ 'http://proxy1.example.com:7000' , 'http://proxy2.example.com:7000' , 'http://proxy3.example.com:7000' , ] DOWNLOADER_MIDDLEWARES = { 'bookstore.middlewares.ProxyRotationMiddleware' : 350 , }

For production scraping with anti-bot protection, you'll want residential proxies rather than datacenter ones. Decodo's residential proxies handle rotation, authentication, and geographic targeting automatically, which saves you from building all this logic yourself.

If you're new to working with proxies in Python, check out this guide to mastering Python requests with proxies for the foundational concepts.