Back to blog

Scrapy With JavaScript: How To Scrape Dynamic Sites Without Losing Your Pipeline

Share article:

Scrapy is an asynchronous Python framework for crawling and extracting data at scale, but it doesn't execute JavaScript on its own. A spider can get a clean 200 response and still return empty selectors on a modern site. This guide covers the rendering options (Splash, Selenium, Playwright, managed APIs) and the cache and concurrency settings that matter once browser rendering comes into play.

Bug icon centered inside a rounded square, representing scraping

TL;DR

  • Use regular Scrapy first, and confirm the page actually needs JavaScript rendering before adding a browser layer.
  • Start with scrapy-playwright if you're building fresh. It fits Scrapy's workflow cleanly and is built around a download handler instead of a full rewrite.
  • Skip rendering when the site is really just calling a JSON endpoint. Hitting the API directly is cheaper and faster than running a browser.
  • Move the hardest targets to a managed rendering layer when browser maintenance and blocking start eating more time than parsing.

What is Scrapy?

Scrapy is a Python framework built for crawling websites and extracting structured data at scale. It runs asynchronously, which is why it handles large crawls so well. If you're new to web scraping and want the basics first, start with this Scrapy guide.

The useful thing about Scrapy in this context is that it's, first and foremost, a crawling and data pipeline framework. Your spider holds the logic; the downloader fetches responses; pipelines handle post-processing; and middlewares are where rendering tools plug in. That's what makes JavaScript support manageable: you're not throwing Scrapy away; you're swapping in a different way to fetch certain pages. By itself, Scrapy isn't a browser. It won't execute JavaScript, wait for XMLHttpRequest (XHR) requests, or interact with a React page out of the box. And if you're still deciding whether Scrapy is even the right tool for the job, this Scrapy vs. BeautifulSoup comparison is worth a look.

Is Scrapy good for web scraping?

Scrapy makes the most sense on jobs with real volume: thousands of URLs, repeated crawl runs, and a pipeline that does more than just fetch HTML and print it. That's also where it works well with JavaScript. You can keep Scrapy handling request flow, retries, deduplication, throttling, and output, then add a rendering layer only where the target actually needs it. If you want to compare that approach against other Python options, this overview of Python web scraping libraries is a useful reference.

It's a weaker fit for small jobs. If you only need to fetch a handful of pages once, a full Scrapy project is usually more setup than the job needs. It also starts to lose its edge when every page has to go through a browser, or when the target is built around heavy interaction from the start: login flows, CAPTCHA walls, or browser checks on every step. In those cases, the rendering overhead starts to eat into the speed advantage that makes Scrapy useful in the first place.

A job board is a good example. Say the listing pages are paginated and mostly server-rendered, but some extra details load through XHR. That's a good Scrapy case. You let Scrapy handle the crawl and only route the JavaScript-dependent pages through a rendering tool. That's usually cleaner than writing the entire crawl as a single Playwright script.

So the practical answer is this: if you're scraping a few thousand listings, product pages, or articles, even with some JavaScript mixed in, Scrapy is still a good choice. If you're scraping ten pages once, use something lighter. And if you're still comparing approaches at the Python level, this broader guide to Python web scraping is the right next read.

Scraping dynamic websites with Scrapy: Why selectors return nothing

This is the point where many Scrapy projects go sideways. The spider gets a 200 response, the request looks fine, and every selector comes back empty. Sometimes that means the page needs to render JavaScript. Sometimes it means the selector is wrong. Sometimes the site gave you a different page than the one you saw in the browser. It's worth checking which one you're dealing with before you add a browser layer.

The first check is simple: compare what Scrapy got with what the browser rendered. Look at response.text in the spider, or fetch the page from the command line with scrapy fetch --nolog <url>. Then compare that raw HTML with the Elements panel in DevTools. If the data is present in DevTools but missing from Scrapy's response, the browser is doing extra work after the initial request.

The second check is the Network tab. Many "JavaScript-heavy" pages don't render data in the DOM from scratch. They're calling a JSON endpoint after load. Suppose you can find that XHR or fetch request; you can often skip rendering entirely and call the API directly from Scrapy. That's usually faster, simpler, and cheaper than running a browser for every page. This is the same basic idea covered in the guide to dynamic content scraping.

The third check is view-source:. If the text you want is missing from the page source but visible in the live page, the content is client-rendered. If the text is already in the source, the problem is probably not JavaScript. It's more likely a bad selector, a timing assumption, or the wrong extraction method. If that part is shaky, it helps to revisit how to choose XPath vs. CSS selectors.

In practice, most targets fall into 3 groups.

  • Class A: API-backed pages. The page shell loads first, then the real data arrives through XHR or fetch. In this case, don't render unless necessary. Call the API directly from Scrapy.
  • Class B: server-rendered pages with hydration. Most of the HTML is already there, and JavaScript only adds interactivity. In this case, regular Scrapy often works fine. Check the raw response before adding anything heavier.
  • Class C: fully client-rendered pages. The raw HTML is mostly a shell, often something like a div with an app root and very little actual content. You need a rendering layer because the data isn't in the initial response.

A realistic example is a real estate listing site. The listing page might be server-rendered, allowing Scrapy to extract titles, locations, and links without a browser. But when you open an individual listing, the price history panel might load from a separate JSON endpoint after page load. On the same site, you can end up treating 2 page types differently:

  • Scrape listing pages with regular Scrapy
  • Call the price-history API directly if you can find it
  • Only fall back to browser rendering if the data is truly client-rendered, and there's no usable endpoint

That's the general rule: don't add JavaScript rendering just because the site uses JavaScript. Add it only when the response Scrapy gets is missing the data you need.

Scrapy middlewares for headless browsers: How JavaScript rendering plugs in

The easiest way to think about Scrapy with JavaScript is this: you're not changing how the spider thinks; you're changing how certain requests get fetched.

That matters because many people treat Splash, Selenium, and Playwright as 3 distinct approaches to scraping. They're not. In Scrapy, there are different ways of plugging a browser-backed fetch step into the downloader layer. Once you look at it that way, the architecture becomes much easier to follow.

Start with the 2 layers that matter

There are 2 Scrapy extension points behind almost every JavaScript setup:

  • Downloader middleware: This sits around the normal request-and-response flow. It can inspect requests, modify headers, apply retries, attach proxies, or intercept specific request types and return different responses.
  • Download handler: This sits lower in the stack. It replaces the fetch mechanism itself. Instead of letting Scrapy's normal downloader make the request, it forwards it to a different backend.

The difference is important. Middleware changes or intercepts the flow. A download handler changes who is actually loading the page.

If the browser piece still feels abstract, this is a good place to pause and look at what a headless browser actually is.

Where Playwright fits

scrapy-playwright uses a download handler.

That's why it feels cleaner than older integrations. When you mark a request with:

meta={"playwright": True}

Scrapy routes that request through a Playwright-backed fetch path instead of the normal HTTP downloader. The spider callback still receives a Scrapy response. Your items, pipelines, exports, and duplicate filtering still work the same way.

So with Playwright, the change happens at the fetch layer, not in the spider logic.

Where Splash fits

scrapy-splash is built around a downloader middleware plus a custom request type, usually SplashRequest.

The middleware catches that request, rewrites it so it goes to the Splash service, waits for Splash to render the page, and then turns the result back into a normal Scrapy response.

So the spider still sees HTML and keeps parsing as usual. The extra work was done before the response arrived.

Where Selenium fits

scrapy-selenium follows the same general pattern as Splash.

It uses a downloader middleware and a custom request type, such as SeleniumRequest. The middleware intercepts that request, drives a browser through WebDriver, gets the rendered page source, and hands that HTML back to Scrapy.

Again, the spider code doesn't need to be converted to "Selenium code." The browser logic stays in the downloader layer.

You don't need 3 different mental models for Scrapy, Splash, Selenium, and Playwright.

The cleaner way to read the rest of this article is:

  • Keep the spider focused on extraction
  • Decide at the downloader layer which requests need browser rendering
  • Use middleware or a custom download handler to make that happen

That's what keeps Scrapy with JavaScript manageable. Your spider stays focused, the rendering logic stays in one place, and swapping one rendering approach for another doesn't mean rewriting the whole crawl.

Executing JavaScript in Scrapy with Splash

Splash is a lightweight rendering service that sits outside Scrapy and exposes a browser-like renderer over HTTP. In simple terms, it's a separate tool that loads a page for you and returns the rendered HTML to Scrapy. That becomes important when the raw HTML is missing the data you need.

Splash is still useful when a smaller rendering setup matters or when there's already a working Splash deployment. It's lighter than a full Playwright or Selenium setup, but it's also a narrower fit on newer frontend stacks and harder targets.

To run Splash locally, the easiest option is Docker Desktop. Docker Desktop is an app that lets you run software in isolated packages called containers. A container is a self-contained runtime for an app and its dependencies, so you don't have to install everything manually on your system.

On Windows, Docker Desktop usually depends on either WSL 2 or Hyper-V.

  • WSL 2 stands for Windows Subsystem for Linux. It lets Windows run Linux tools and services in the background.
  • Hyper-V is Microsoft’s built-in virtualization system. Virtualization means your computer can run isolated systems or environments inside the main operating system.

If Docker isn't installed yet, install Docker Desktop first. On Mac, make sure to pick the right installer for your chip, as there are separate versions for Intel and Apple Silicon (M1 and later). 

Once installation is complete, launch Docker Desktop and wait for it to finish starting up before moving on. Once it’s up and running, open a terminal and verify that Docker is working by running docker --version. It also helps to run docker run hello-world once to confirm Docker can actually start containers on your machine.

If both commands succeed, Docker is ready, and you can move on to running Splash locally with:

docker run -p 8050:8050 scrapinghub/splash

Then install scrapy-splash in your Scrapy environment:

pip install scrapy-splash

After that, wire it into settings.py:

SPLASH_URL = "http://localhost:8050"
DOWNLOADER_MIDDLEWARES = {
"scrapy_splash.SplashCookiesMiddleware": 723,
"scrapy_splash.SplashMiddleware": 725,
"scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 810,
}
SPIDER_MIDDLEWARES = {
"scrapy_splash.SplashDeduplicateArgsMiddleware": 100,
}
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
HTTPCACHE_STORAGE = "scrapy_splash.SplashAwareFSCacheStorage"

That configuration is what makes Splash behave like a proper part of the Scrapy request cycle rather than a one-off service call. Without it, installing the package isn't enough.

For simple rendered pages, the spider code stays short. Instead of yielding a normal scrapy.Request, yield SplashRequest, and give the page a short wait:

from scrapy_splash import SplashRequest
def start_requests(self):
yield SplashRequest(
url="https://example.com",
callback=self.parse,
args={"wait": 2},
)

That wait matters. If the page populates content after load, returning too early will give you the same empty-selector problem you were trying to fix.

Where Splash still has a niche is in scripted interaction. If the page needs a short scroll or a delayed render, you can send a Lua script through the execute endpoint:

function main(splash)
splash:go(splash.args.url)
splash:wait(2)
splash:runjs("window.scrollTo(0, document.body.scrollHeight)")
splash:wait(1)
return {html = splash:html()}
end

Then call it like this:

from scrapy_splash import SplashRequest
lua_script = """
function main(splash)
splash:go(splash.args.url)
splash:wait(2)
splash:runjs("window.scrollTo(0, document.body.scrollHeight)")
splash:wait(1)
return {html = splash:html()}
end
"""
def start_requests(self):
yield SplashRequest(
url="https://example.com",
callback=self.parse,
endpoint="execute",
args={
"lua_source": lua_script,
"wait": 2,
},
)

That's enough for simpler dynamic pages where one scroll and one delay are all you need. Splash is lighter, but it's also less capable in modern JavaScript and less reliable on harder targets. Use it when you already run it, when you want a smaller container, or when the target is simple enough that Lua and a basic renderer are enough. If you're starting fresh, scrapy-playwright is the stronger default. And if keeping a renderer alive at all feels like unnecessary overhead, Decodo Web Scraping API moves that part out of your Scrapy stack entirely.

Executing JavaScript in Scrapy with Selenium

Selenium is the option most people already know before they get to Scrapy. It works, and it still has a place, but it isn't the recommended default for a new Scrapy project. The main reasons are still maintenance and weight. The original scrapy-selenium package is old; newer wrappers such as scrapy-selenium4 exist, and the Selenium path is still heavier and less cleanly integrated with Scrapy than scrapy-playwright

The basic setup still follows the same pattern. Install scrapy-selenium, add the driver settings in settings.py, and enable the middleware:

pip install scrapy-selenium
from shutil import which
SELENIUM_DRIVER_NAME = "chrome"
SELENIUM_DRIVER_EXECUTABLE_PATH = which("chromedriver")
SELENIUM_DRIVER_ARGUMENTS = ["--headless=new"]
DOWNLOADER_MIDDLEWARES = {
"scrapy_selenium.SeleniumMiddleware": 800,
}

That configuration is based on the package's documented settings. The only adjustment for a current Chrome setup is "--headless=new", since Selenium deprecated the older convenience headless path and Chrome's current headless mode uses the newer form. If you're using Firefox instead, the argument is still "-headless".

In the spider, the request shape is straightforward. Instead of yielding a normal scrapy.Request, you yield SeleniumRequest and tell it what to wait for. In the example below, the target is quotes.toscrape.com/js, a practice site that renders its content through JavaScript:

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
start_urls = ["https://quotes.toscrape.com/js"]
def start_requests(self):
for url in self.start_urls:
yield SeleniumRequest(
url=url,
callback=self.parse,
wait_time=5,
wait_until=EC.presence_of_element_located(
(By.CSS_SELECTOR, ".quote")
),
)
def parse(self, response):
for quote in response.css(".quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}

That is the basic pattern: wait for a selector that proves the page is actually ready by passing wait_time and wait_until to prevent Selenium from returning the page too early, then parse the response like normal Scrapy HTML.

Selenium still makes the most sense when the page needs interaction. The scroll version of the same site at quotes.toscrape.com/scroll loads new quotes as you scroll down, which requires driving the browser directly. The callback can access the live driver through response.meta["driver"]. A single scroll loads one batch, so production use would loop this until no new content appears:

import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
class ScrollQuotesSpider(scrapy.Spider):
name = "scroll_quotes_spider"
start_urls = ["https://quotes.toscrape.com/scroll"]
def start_requests(self):
for url in self.start_urls:
yield SeleniumRequest(
url=url,
callback=self.parse,
wait_time=5,
)
def parse(self, response):
driver = response.meta["driver"]
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
page_source = driver.page_source
selector = Selector(text=page_source)
for quote in selector.css(".quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}

The package also documents a built-in script argument on SeleniumRequest, so for simple one-shot JavaScript execution, you can keep that logic on the request itself instead of moving everything into the callback. The script runs after the page loads, so a wait_time is still worth setting if the page needs time to settle before the script fires:

import scrapy
from scrapy_selenium import SeleniumRequest
class ScrollSpider(scrapy.Spider):
name = "scroll_spider"
start_urls = ["https://quotes.toscrape.com/scroll"]
def start_requests(self):
for url in self.start_urls:
yield SeleniumRequest(
url=url,
callback=self.parse,
wait_time=3,
script="window.scrollTo(0, document.body.scrollHeight);",
)
def parse(self, response):
for quote in response.css(".quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}

The trade-off is cost; Selenium is the heaviest option: browser process, WebDriver process, and much weaker scaling once you raise concurrency. It's still a reasonable choice when you already have Selenium expertise, a working WebDriver setup, or browser extensions you need to keep using. If that is your situation, the Selenium guide, the WebDriver proxy guide, and the Puppeteer vs. Selenium comparison are the most relevant follow-ups.

For most new projects, scrapy-playwright is still the cleaner default. And if the browser layer is becoming the whole job, using the Decodo Web Scraping API can simplify things.

Executing JavaScript in Scrapy with Playwright

For a new Scrapy project, scrapy-playwright is usually the right place to start when a page needs JavaScript. It supports ChromiumFirefox, and WebKit, and it fits into Scrapy much cleaner than the older browser integrations. Instead of pulling the whole crawl into a separate browser script, it plugs into Scrapy at the downloader layer. That keeps the split clear: normal requests stay normal, and only the requests that actually need rendering go through Playwright.

Start with the install:

pip install scrapy-playwright
playwright install

Then add the Playwright download handler in settings.py:

DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

The download handler tells Scrapy where to send browser-rendered requests. The reactor matters because Playwright is async-native, and Scrapy needs an asyncio-compatible reactor for that integration to work properly.

Once that is in place, Playwright is enabled per request with meta={"playwright": True}.

Here's a small JavaScript-rendered example using https://quotes.toscrape.com/js/. It works well as a first Playwright example because it makes one thing clear: the quotes appear only after the page has finished rendering.

import scrapy
from scrapy_playwright.page import PageMethod
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
async def start(self):
yield scrapy.Request(
"https://quotes.toscrape.com/js/",
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", ".quote"),
],
},
callback=self.parse,
)
def parse(self, response):
for quote in response.css(".quote"):
yield {
"text": quote.css(".text::text").get(),
"author": quote.css(".author::text").get(),
"tags": quote.css(".tag::text").getall(),
}

In Scrapy 2.13 and newer, async def start() is the current way to generate initial requests. If you're working with an older project, start_requests() still works, which is why our earlier Selenium and Splash examples use that style.

The meta block is where the Playwright behavior is defined:

meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", ".quote"),
],
}

Each part does a separate job:

  • playwright=True tells Scrapy to render this request through Playwright
  • playwright_page_methods runs browser actions before the response is passed to the callback

In this case, Playwright waits until at least one .quote element appears. Once that happens, Scrapy receives the rendered HTML, and the spider can keep using normal CSS selectors.

That is the main reason scrapy-playwright works so well as a default: after rendering, the spider still looks like Scrapy.

Useful PageMethod patterns

The most common browser action is simply waiting for content:

PageMethod("wait_for_selector", ".listing")

Use that when the page shell loads first, and the useful data appears later.

If the page needs more than a simple wait before parsing, extra browser actions can be added through additional PageMethod calls.

To wait for an element to appear before Scrapy parses the page, use:

PageMethod("wait_for_selector", "div.quote")

If the page needs a scroll before the next batch of content appears, add a scroll action first, then wait for a selector that confirms the new content has loaded:

PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)")
PageMethod("wait_for_selector", "div.quote(11)") (assuming 10 items load at a time)

It's also possible to execute inline JavaScript directly in the page:

PageMethod("evaluate", "window.INITIAL_STATE")

That's useful when the data is already available in a browser-side JavaScript variable rather than in the DOM. Many sites load a state object first and render the visible interface from that. Reading the variable directly is often simpler and more reliable than scraping the rendered table or list.

When the live Playwright page is actually needed

Most JavaScript-rendered pages don't need the live Playwright page object. In many cases, PageMethod is enough: wait for a selector, maybe run one browser action, then let Scrapy parse the rendered HTML as usual.

The live page object is for the cases where rendering alone isn't enough. That usually means a page that needs browser interaction after navigation, such as clicking, typing, multi-step workflows, or infinite scrolling. In those cases, add playwright_include_page=True and switch to an async callback so the spider can work directly with the Playwright page.

Infinite scrolling is a good example because it shows why the live page object exists in the first place. The page doesn't just need time to render. It needs repeated browser actions, checks for newly loaded content, and a clean shutdown when the work is done.

import scrapy
class QuotesScrollSpider(scrapy.Spider):
name = "quotes_scroll"
async def start(self):
yield scrapy.Request(
"https://quotes.toscrape.com/scroll",
meta={
"playwright": True,
"playwright_include_page": True,
},
callback=self.parse,
errback=self.errback,
)
async def parse(self, response):
page = response.meta["playwright_page"]
# Wait for the first batch of quotes
await page.wait_for_selector(".quote")
previous_count = 0
while True:
quotes = page.locator(".quote")
count = await quotes.count()
if count == previous_count:
break
previous_count = count
await page.evaluate(
"window.scrollTo(0, document.body.scrollHeight)"
)
await page.wait_for_timeout(1000)
quotes = page.locator(".quote")
for i in range(await quotes.count()):
quote = quotes.nth(i)
yield {
"text": await quote.locator(".text").inner_text(),
"author": await quote.locator(".author").inner_text(),
"tags": await quote.locator(".tag").all_inner_texts(),
}
await page.close()
async def errback(self, failure):
page = failure.request.meta.get("playwright_page")
if page and not page.is_closed():
await page.close()

A few details matter here.

First, the spider waits for the first .quote element before doing anything else. That confirms the page has rendered the initial batch of content.

Then it keeps scrolling until the number of quotes stops increasing. That's a useful pattern in this example. It doesn't assume a fixed number of scrolls or a fixed number of pages. It keeps going until no new items appear.

It also stays inside Playwright’s own locator API the whole time. That is cleaner than turning the updated page back into another Scrapy response after every scroll, and it makes the point of playwright_include_page=True much clearer: this is for real browser interaction, not just for grabbing one rendered snapshot.

The last important line is await page.close(). Open Playwright pages count toward the page limit inside each browser context. If pages are left open, the crawl can stall once that limit is reached. That is why the same cleanup appears in errback as well: if the request fails, the browser page still needs to be closed.

That is also the rule for using playwright_include_page=True in general: only turn it on when the callback really needs the live page, and close the page when the interaction is finished.

Initializing pages and intercepting requests

Some setup belongs before the page even starts loading. That is what playwright_page_init_callback is for.

A common use case is blocking heavy assets:

import scrapy
async def init_page(page, request):
await page.route(
"**/*.{png,jpg,jpeg,gif,woff,woff2,css}",
lambda route: route.abort(),
)
class LightweightSpider(scrapy.Spider):
name = "lightweight"
async def start(self):
yield scrapy.Request(
"https://example.com",
meta={
"playwright": True,
"playwright_page_init_callback": init_page,
},
callback=self.parse,
)
def parse(self, response):
yield {
"title": response.css("title::text").get(),
}

This callback runs after scrapy-playwright creates the page, but before the request is made. That makes it the right place for setup work such as request routing, blocking images, or adding scripts before navigation.

The same method can also be used to watch for a specific XHR or API request:

async def init_page(page, request):
async def handle_route(route):
url = route.request.url
if "/api/listings" in url:
print("Saw listings API request:", url)
await route.continue_()
await page.route("**/api/listings**", handle_route)

On real sites, such interception often indicates that browser rendering may not be necessary for the entire crawl. If Playwright shows that the page is actually retrieving its data from a JSON API, scraping that API directly with standard Scrapy requests is usually faster and more reliable.

scrapy-playwright is a strong default because it uses modern browser engines, fits Scrapy’s async model, and lets the project keep Scrapy’s normal parsing, item pipelines, middleware, throttling, and feed exports. Selenium still has a place in certain browser automation workflows, and Splash still appears in older deployments. Still, for a new JavaScript-heavy Scrapy project, Playwright is usually the cleaner starting point.

It's still worth keeping the bigger rule in mind: browser rendering is expensive. Use it where it's needed, not everywhere. If the data is available via an API, use it. If the target blocks headless Chromium immediately, pair scrapy-playwright with rotating residential IPs instead of trying to solve every blocking problem inside the browser layer.

Splash vs. Selenium vs. Playwright vs. managed API: Which one for your Scrapy spider?

This is the short version: if you're starting fresh, use scrapy-playwright. It's the latest option in the Scrapy browser-rendering ecosystem, with version 0.0.47 released on PyPI in June 2026, and it supports ChromiumFirefox, and WebKitscrapy-splash is still maintained and relevant, with a February 2025 PyPI release, but it's a more limited fit for new projects. 

The better comparison isn't "which library is popular." It's the one that fits the job.

Option

Setup overhead

Modern JS support

Concurrency ceiling

Anti-bot resilience

Maintenance status

Cost

Best fit

scrapy-splash

Low to medium

Limited

Moderate

Weak

Older

Low

Existing Splash deployments, small rendering jobs

scrapy-selenium

Medium to high

Good

Low

Weak to moderate

Loosely maintained wrapper

Medium

Teams that already depend on WebDriver

scrapy-playwright

Medium

Strong

Good

Better than the older Scrapy browser integrations

Active

Medium

New Scrapy projects that need browser rendering

Low

Strong

High from the Scrapy side

Stronger by design because rendering, proxy rotation, and unblock layers sit outside your spider

Managed

Usage-based

Hard targets or teams that don't want to manage browsers

Use scrapy-playwright when the crawl is small to medium, and the JavaScript is real

This is the default for most new Scrapy spiders. It's actively maintained, uses modern browser engines, and plugs into Scrapy as a download handler instead of forcing a separate browser workflow into the spider. It also supports the current versions of Scrapy and Python directly. If the target needs rendering but isn't a full anti-bot war, this is usually the right answer.

Keep Splash when it's already in production, and the target is simple

Splash still makes sense when the deployment already exists, the targets are relatively light, and the small container footprint matters. It isn't abandoned, but it isn’t the strongest default for a new JavaScript-heavy Scrapy spider. The better way to frame it is as a practical option for simpler rendered pages and existing Lua-based workflows, not as the first choice for modern React-heavy or aggressively protected targets.

Use scrapy-selenium when the team already runs WebDriver

This is mostly a compatibility choice now. If the team already has Selenium expertise, existing browser automation, or a browser extension setup that needs to stay in place, Selenium can still be the shortest path; however, it remains the heaviest option in this group once concurrency increases. 

Use a managed API when the browser layer is becoming the actual project

This is the point where rendering, proxies, CAPTCHA, and unblock logic take longer than parsing. In that case, a managed layer is often the cleaner architecture. Decodo Web Scraping API is the fourth option here for exactly that reason: it handles rendering and delivery outside the spider, allowing Scrapy to stay focused on request routing and extraction.

The short recommendation

  • Small crawl + simple JavaScript: use scrapy-playwright
  • Legacy Splash deployment: keep Splash
  • Existing Selenium estate: use scrapy-selenium
  • Heavy anti-bot targets or no interest in managing browsers: use Decodo Web Scraping API

That's the actual decision path. If the broader Python landscape still matters, comparing Python web scraping libraries is the right place to zoom out.

Getting blank pages? Same

Decodo's Web Scraping API renders the JavaScript so your spiders actually grab the data.

Using Scrapy cache and concurrency to scrape faster

Once browser rendering enters the pipeline, throughput drops fast. The fix isn't just "raise CONCURRENT_REQUESTS." The useful gains usually come from 3 places: cache rendered responses when you're developing, cap browser-side parallelism to what the machine can actually hold, and keep as many requests as possible out of the browser in the first place. Scrapy's own settings cover the crawl side, and scrapy-playwright adds a second set of limits for browser contexts and pages.

Cache rendered pages during development

If rendering a request takes a few seconds, re-rendering the same page every time you adjust a selector is a waste of time. Scrapy's HTTP cache can store responses on disk, and the built-in FilesystemCacheStorage, along with the RFC2616 policy, is the standard component for that.

A practical starting point in settings.py looks like this:

HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"
HTTPCACHE_EXPIRATION_SECS = 3600

That doesn't make a target faster on the first hit, but it does make parser iteration much cheaper on re-runs. For a Playwright-backed spider, that matters because you can keep working on extraction logic without paying the browser cost again for unchanged pages. For a refresher on the framework side, the main Scrapy guide is still the right reference.

Don't tune concurrency as if every request were plain HTTP

CONCURRENT_REQUESTS still controls overall Scrapy parallelism, but with browser-backed requests, the real bottleneck is usually the browser, not the downloader. In scrapy-playwrightPLAYWRIGHT_MAX_CONTEXTS limits concurrent browser contexts, and PLAYWRIGHT_MAX_PAGES_PER_CONTEXT limits open pages per context. By default, the pages-per-context limit inherits Scrapy's CONCURRENT_REQUESTS value, which can be too loose if RAM is tight.

A safer baseline is:

CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 4
PLAYWRIGHT_MAX_CONTEXTS = 4
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 4
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0

That keeps Scrapy reasonably busy without letting a single target or browser process run away with the machine. Scrapy's docs also note that DOWNLOAD_DELAY can reduce the effective per-domain concurrency below CONCURRENT_REQUESTS_PER_DOMAIN, so raising limits mindlessly doesn't always translate into real throughput. And once browsers are in the loop, AutoThrottle tends to work better than fixed sleeps because it responds to actual latency rather than guessing.

Skip rendering whenever possible

This is still the biggest speed lever; if a page is really just calling a JSON endpoint, route that request through normal Scrapy and leave Playwright out of it. That cuts browser time, memory use, and failure surface all at once.

In practice, that often means doing something like this in the spider:

yield scrapy.Request(api_url, callback=self.parse_api)
yield scrapy.Request(
listing_url,
callback=self.parse_listing,
meta={"playwright": True},
)

The same rule applies within a single crawl: not every URL on the site belongs in the browser queue. Keep Class A requests out of Playwright whenever possible.

Block heavy assets the current way

The current scrapy-playwright docs are more cautious than that. The plugin uses Page.route internally and explicitly warns against calling Page.route yourself unless you know exactly what you're doing. The safer current hook is playwright_page_init_callback, where you can add initialization logic before navigation.

A practical example looks like this:

async def init_page(page, request):
await page.route(
"**/*.{png,jpg,jpeg,gif,webp,woff,woff2}",
lambda route: route.abort(),
)
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_init_callback": init_page,
},
)

That's a cleaner way to cut render time on pages where the text or table data matters and the media doesn't. If the crawl is large enough that this starts to matter, the next useful read is web scraping without getting blocked.

The practical rule

When JavaScript rendering slows the crawl down, the first move should be:

  • cache what can be cached
  • lower browser-side parallelism to something the machine can hold
  • keep API-backed requests out of the browser
  • block assets you don't need

That's usually enough to get a Playwright-enabled spider back into a usable range without turning the whole settings file into a guessing game.

Handling blocks when running Scrapy with JavaScript

Adding a browser solves rendering, but it also makes the crawl easier to fingerprint. A plain headless Chromium session is often more obvious than a normal Scrapy request because the browser exposes automation signals like navigator.webdriver, and anti-bot systems now look at more than just the DOM. They look at browser fingerprints, request patterns, and network behavior, too. That's why JavaScript-enabled spiders often get blocked more, not less. If you want the broader background, the best companion reads here are anti-bot systemsweb scraping without getting blocked, and anti-scraping techniques.

The first layer is browser hygiene. Rotate a realistic user agent, vary the viewport slightly, and avoid sending the same browser profile across contexts. Playwright supports userAgent and viewport at the browser-context level, so that part is straightforward. It's also where community stealth layers come in. Tools like playwright-stealth exist, but even their own package description is careful: It's a proof-of-concept starting point, not a guarantee that modern detection will disappear. Rebrowser-style patches exist as well, but they move quickly and aren't the same thing as official Playwright support. That makes them useful as tactical tools, not something to build a whole crawler strategy around.

In practice, the mitigation stack usually looks like this:

  • rotate user agents per browser context
  • vary viewport sizes instead of using one fixed default
  • slow the crawl down enough to avoid looking machine-perfect
  • use a stealth layer only if the target really needs it, and expect to re-test it regularly

A simple scrapy-playwright context setup can look like this:

meta = {
"playwright": True,
"playwright_context": "new",
"playwright_context_kwargs": {
"user_agent": ua_string,
"viewport": {"width": 1366, "height": 768},
},
}

That isn't "undetectable." It just removes a few cheap tells and makes the browser side less uniform.

The second layer is the proxy layer. scrapy-playwright supports proxies through PLAYWRIGHT_LAUNCH_OPTIONS, which is the cleanest place to route the browser through an external IP. For normal crawling, rotating residential IPs are the safer default because they spread requests out and appear less synthetic than hammering a target from a single address. For login-heavy or multi-step flows, session stickiness matters more than raw rotation. That's where a sticky residential or ISP session is a better fit than changing IPs on every request.

A basic launch configuration looks like this:

PLAYWRIGHT_LAUNCH_OPTIONS = {
"proxy": {
"server": "http://proxy.example:3128",
"username": "user",
"password": "pass",
}
}

If the target is sensitive enough that the browser layer keeps getting burned, Decodo Residential proxies are a more natural fit than stretching datacenter IPs beyond what the site will tolerate.

The last part is knowing when to stop. Once the crawl is spending more time on stealth patches, browser quirks, and proxy failures than on extraction logic, the architecture has usually tipped too far. A simple rule of thumb is this: if a target is still failing on more than about one in five requests after reasonable browser tuning and a proper proxy layer, it's probably time to stop fighting that target inside the spider and move it to a managed rendering layer instead. That's the point of the Decodo Web Scraping API: keep the Scrapy pipeline, but push rendering, proxy rotation, and CAPTCHA handling out of the spider and into one endpoint.

Final thoughts

Running Scrapy with JavaScript is mostly a routing problem. First, confirm the page actually needs rendering, since many "dynamic" pages just call a JSON endpoint and should stay on plain Scrapy. When a browser is necessary, scrapy-playwright is the best default for new projects; Splash fits when a small container matters or it's already in production, and Selenium when you're reusing an existing WebDriver setup. After that, keep the crawl efficient: cache what you can, let AutoThrottle work, block assets you don't need, and limit browser-backed requests to pages that truly need them. And once blocks, browser upkeep, and proxy issues eat more time than parsing, stop treating it as a spider problem. A managed rendering layer like the Decodo Web Scraping API is built for that handoff.

Keep your pipeline alive

Offload the heavy dynamic rendering to Decodo and stop firefighting broken jobs.

Share article:

About the author

Vilius Sakutis

Head of Partnerships

Vilius leads performance marketing initiatives with expertize rooted in affiliates and SaaS marketing strategies. Armed with a Master's in International Marketing and Management, he combines academic insight with hands-on experience to drive measurable results in digital marketing campaigns.

Connect with Vilius via LinkedIn

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Can Scrapy execute JavaScript on its own?

No. Scrapy can download the response and parse it, but it doesn't run browser JavaScript on its own. That's why tools like scrapy-playwright, scrapy-splash, and scrapy-selenium exist.

Is scrapy-playwright better than scrapy-splash in 2026?

For most new projects, yes. It's a better fit for modern JavaScript-heavy sites and a cleaner match for Scrapy's async model. Splash still has a place when the target is simple, the container footprint matters, or there's already a working Splash setup in production.

How do I scrape infinite scroll pages with Scrapy?

Use a browser-backed request, wait for the content you need, then trigger scroll actions before reading the updated page source or DOM. With scrapy-playwright, that usually means PageMethod calls or an async callback using the page object.

Why are my Scrapy selectors returning empty lists?

Usually, one of three things is happening: the page is client-rendered, the selector is wrong, or Scrapy is receiving a different response than the browser. Compare response.text with view-source: and DevTools before assuming you need a browser.

Should I use Scrapy or just Playwright for JavaScript-heavy sites?

Use Scrapy when the crawl has real volume and needs structure: retries, deduplication, pipelines, throttling, and storage. Use plain Playwright when the job is small or mostly interactive browser automation. If every page needs a browser and there's barely any crawl logic, Scrapy may not be buying much.

Dashboard line chart showing purple trend, with tabs 'Last 24 hours  Last week  Last month  Custom' on dark textured backdrop

Mastering Scrapy for Scalable Python Web Scraping: A Practical Guide

Scrapy is a powerful web scraping framework available in Python. Its asynchronous architecture makes it faster than sequential scrapers built with Requests or Beautiful Soup, and it includes everything needed for production-ready scraping: spiders, items, pipelines, throttling, retries, data export, and middleware. In this guide, you'll learn how to set up Scrapy, build and customize spiders, handle pagination, structure and store data, extend Scrapy with middlewares and proxies, and apply best practices for scraping at scale.

Beautiful Soup vs Scrapy text centered over a red lightning bolt inside a dark circular badge on light blue background

Scrapy vs BeautifulSoup – Which is Better for You?

Scrapy and BeautifulSoup are two extremely popular Python-based tools that will enable you to scrape the web. Ah, and they’re free and open-source! So if you’re thinking of building a scraper, you might be a bit lost between the two options. 

Don’t worry, we’ve got you covered. This blog post will compare these two tools by looking over their main fors and againsts. Ready? Let’s go!

How To Scrape Websites With Dynamic Content Using Python

You've mastered static HTML scraping, but now you're staring at a site where Requests + Beautiful Soup returns nothing but an empty <div> and <script> tags. Welcome to JavaScript-rendered content, where you get the material after the initial request. In this guide, we'll tackle dynamic sites using Python and Selenium (plus a Beautiful Soup alternative).

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved