Scrapy With JavaScript: How To Scrape Dynamic Sites Without Losing Your Pipeline
Scrapy is an asynchronous Python framework for crawling and extracting data at scale, but it doesn't execute JavaScript on its own. A spider can get a clean 200 response and still return empty selectors on a modern site. This guide covers the rendering options (Splash, Selenium, Playwright, managed APIs) and the cache and concurrency settings that matter once browser rendering comes into play.
Vilius Sakutis
Last updated: Jul 01, 2026
14 min read

TL;DR
- Use regular Scrapy first, and confirm the page actually needs JavaScript rendering before adding a browser layer.
- Start with scrapy-playwright if you're building fresh. It fits Scrapy's workflow cleanly and is built around a download handler instead of a full rewrite.
- Skip rendering when the site is really just calling a JSON endpoint. Hitting the API directly is cheaper and faster than running a browser.
- Move the hardest targets to a managed rendering layer when browser maintenance and blocking start eating more time than parsing.
What is Scrapy?
Scrapy is a Python framework built for crawling websites and extracting structured data at scale. It runs asynchronously, which is why it handles large crawls so well. If you're new to web scraping and want the basics first, start with this Scrapy guide.
The useful thing about Scrapy in this context is that it's, first and foremost, a crawling and data pipeline framework. Your spider holds the logic; the downloader fetches responses; pipelines handle post-processing; and middlewares are where rendering tools plug in. That's what makes JavaScript support manageable: you're not throwing Scrapy away; you're swapping in a different way to fetch certain pages. By itself, Scrapy isn't a browser. It won't execute JavaScript, wait for XMLHttpRequest (XHR) requests, or interact with a React page out of the box. And if you're still deciding whether Scrapy is even the right tool for the job, this Scrapy vs. BeautifulSoup comparison is worth a look.
Is Scrapy good for web scraping?
Scrapy makes the most sense on jobs with real volume: thousands of URLs, repeated crawl runs, and a pipeline that does more than just fetch HTML and print it. That's also where it works well with JavaScript. You can keep Scrapy handling request flow, retries, deduplication, throttling, and output, then add a rendering layer only where the target actually needs it. If you want to compare that approach against other Python options, this overview of Python web scraping libraries is a useful reference.
It's a weaker fit for small jobs. If you only need to fetch a handful of pages once, a full Scrapy project is usually more setup than the job needs. It also starts to lose its edge when every page has to go through a browser, or when the target is built around heavy interaction from the start: login flows, CAPTCHA walls, or browser checks on every step. In those cases, the rendering overhead starts to eat into the speed advantage that makes Scrapy useful in the first place.
A job board is a good example. Say the listing pages are paginated and mostly server-rendered, but some extra details load through XHR. That's a good Scrapy case. You let Scrapy handle the crawl and only route the JavaScript-dependent pages through a rendering tool. That's usually cleaner than writing the entire crawl as a single Playwright script.
So the practical answer is this: if you're scraping a few thousand listings, product pages, or articles, even with some JavaScript mixed in, Scrapy is still a good choice. If you're scraping ten pages once, use something lighter. And if you're still comparing approaches at the Python level, this broader guide to Python web scraping is the right next read.
Scraping dynamic websites with Scrapy: Why selectors return nothing
This is the point where many Scrapy projects go sideways. The spider gets a 200 response, the request looks fine, and every selector comes back empty. Sometimes that means the page needs to render JavaScript. Sometimes it means the selector is wrong. Sometimes the site gave you a different page than the one you saw in the browser. It's worth checking which one you're dealing with before you add a browser layer.
The first check is simple: compare what Scrapy got with what the browser rendered. Look at response.text in the spider, or fetch the page from the command line with scrapy fetch --nolog <url>. Then compare that raw HTML with the Elements panel in DevTools. If the data is present in DevTools but missing from Scrapy's response, the browser is doing extra work after the initial request.
The second check is the Network tab. Many "JavaScript-heavy" pages don't render data in the DOM from scratch. They're calling a JSON endpoint after load. Suppose you can find that XHR or fetch request; you can often skip rendering entirely and call the API directly from Scrapy. That's usually faster, simpler, and cheaper than running a browser for every page. This is the same basic idea covered in the guide to dynamic content scraping.
The third check is view-source:. If the text you want is missing from the page source but visible in the live page, the content is client-rendered. If the text is already in the source, the problem is probably not JavaScript. It's more likely a bad selector, a timing assumption, or the wrong extraction method. If that part is shaky, it helps to revisit how to choose XPath vs. CSS selectors.
In practice, most targets fall into 3 groups.
- Class A: API-backed pages. The page shell loads first, then the real data arrives through XHR or fetch. In this case, don't render unless necessary. Call the API directly from Scrapy.
- Class B: server-rendered pages with hydration. Most of the HTML is already there, and JavaScript only adds interactivity. In this case, regular Scrapy often works fine. Check the raw response before adding anything heavier.
- Class C: fully client-rendered pages. The raw HTML is mostly a shell, often something like a div with an app root and very little actual content. You need a rendering layer because the data isn't in the initial response.
A realistic example is a real estate listing site. The listing page might be server-rendered, allowing Scrapy to extract titles, locations, and links without a browser. But when you open an individual listing, the price history panel might load from a separate JSON endpoint after page load. On the same site, you can end up treating 2 page types differently:
- Scrape listing pages with regular Scrapy
- Call the price-history API directly if you can find it
- Only fall back to browser rendering if the data is truly client-rendered, and there's no usable endpoint
That's the general rule: don't add JavaScript rendering just because the site uses JavaScript. Add it only when the response Scrapy gets is missing the data you need.
Scrapy middlewares for headless browsers: How JavaScript rendering plugs in
The easiest way to think about Scrapy with JavaScript is this: you're not changing how the spider thinks; you're changing how certain requests get fetched.
That matters because many people treat Splash, Selenium, and Playwright as 3 distinct approaches to scraping. They're not. In Scrapy, there are different ways of plugging a browser-backed fetch step into the downloader layer. Once you look at it that way, the architecture becomes much easier to follow.
Start with the 2 layers that matter
There are 2 Scrapy extension points behind almost every JavaScript setup:
- Downloader middleware: This sits around the normal request-and-response flow. It can inspect requests, modify headers, apply retries, attach proxies, or intercept specific request types and return different responses.
- Download handler: This sits lower in the stack. It replaces the fetch mechanism itself. Instead of letting Scrapy's normal downloader make the request, it forwards it to a different backend.
The difference is important. Middleware changes or intercepts the flow. A download handler changes who is actually loading the page.
If the browser piece still feels abstract, this is a good place to pause and look at what a headless browser actually is.
Where Playwright fits
scrapy-playwright uses a download handler.
That's why it feels cleaner than older integrations. When you mark a request with:
Scrapy routes that request through a Playwright-backed fetch path instead of the normal HTTP downloader. The spider callback still receives a Scrapy response. Your items, pipelines, exports, and duplicate filtering still work the same way.
So with Playwright, the change happens at the fetch layer, not in the spider logic.
Where Splash fits
scrapy-splash is built around a downloader middleware plus a custom request type, usually SplashRequest.
The middleware catches that request, rewrites it so it goes to the Splash service, waits for Splash to render the page, and then turns the result back into a normal Scrapy response.
So the spider still sees HTML and keeps parsing as usual. The extra work was done before the response arrived.
Where Selenium fits
scrapy-selenium follows the same general pattern as Splash.
It uses a downloader middleware and a custom request type, such as SeleniumRequest. The middleware intercepts that request, drives a browser through WebDriver, gets the rendered page source, and hands that HTML back to Scrapy.
Again, the spider code doesn't need to be converted to "Selenium code." The browser logic stays in the downloader layer.
You don't need 3 different mental models for Scrapy, Splash, Selenium, and Playwright.
The cleaner way to read the rest of this article is:
- Keep the spider focused on extraction
- Decide at the downloader layer which requests need browser rendering
- Use middleware or a custom download handler to make that happen
That's what keeps Scrapy with JavaScript manageable. Your spider stays focused, the rendering logic stays in one place, and swapping one rendering approach for another doesn't mean rewriting the whole crawl.
Executing JavaScript in Scrapy with Splash
Splash is a lightweight rendering service that sits outside Scrapy and exposes a browser-like renderer over HTTP. In simple terms, it's a separate tool that loads a page for you and returns the rendered HTML to Scrapy. That becomes important when the raw HTML is missing the data you need.
Splash is still useful when a smaller rendering setup matters or when there's already a working Splash deployment. It's lighter than a full Playwright or Selenium setup, but it's also a narrower fit on newer frontend stacks and harder targets.
To run Splash locally, the easiest option is Docker Desktop. Docker Desktop is an app that lets you run software in isolated packages called containers. A container is a self-contained runtime for an app and its dependencies, so you don't have to install everything manually on your system.
On Windows, Docker Desktop usually depends on either WSL 2 or Hyper-V.
- WSL 2 stands for Windows Subsystem for Linux. It lets Windows run Linux tools and services in the background.
- Hyper-V is Microsoft’s built-in virtualization system. Virtualization means your computer can run isolated systems or environments inside the main operating system.
If Docker isn't installed yet, install Docker Desktop first. On Mac, make sure to pick the right installer for your chip, as there are separate versions for Intel and Apple Silicon (M1 and later).
Once installation is complete, launch Docker Desktop and wait for it to finish starting up before moving on. Once it’s up and running, open a terminal and verify that Docker is working by running docker --version. It also helps to run docker run hello-world once to confirm Docker can actually start containers on your machine.
If both commands succeed, Docker is ready, and you can move on to running Splash locally with:
Then install scrapy-splash in your Scrapy environment:
After that, wire it into settings.py:
That configuration is what makes Splash behave like a proper part of the Scrapy request cycle rather than a one-off service call. Without it, installing the package isn't enough.
For simple rendered pages, the spider code stays short. Instead of yielding a normal scrapy.Request, yield SplashRequest, and give the page a short wait:
That wait matters. If the page populates content after load, returning too early will give you the same empty-selector problem you were trying to fix.
Where Splash still has a niche is in scripted interaction. If the page needs a short scroll or a delayed render, you can send a Lua script through the execute endpoint:
Then call it like this:
That's enough for simpler dynamic pages where one scroll and one delay are all you need. Splash is lighter, but it's also less capable in modern JavaScript and less reliable on harder targets. Use it when you already run it, when you want a smaller container, or when the target is simple enough that Lua and a basic renderer are enough. If you're starting fresh, scrapy-playwright is the stronger default. And if keeping a renderer alive at all feels like unnecessary overhead, Decodo Web Scraping API moves that part out of your Scrapy stack entirely.
Executing JavaScript in Scrapy with Selenium
Selenium is the option most people already know before they get to Scrapy. It works, and it still has a place, but it isn't the recommended default for a new Scrapy project. The main reasons are still maintenance and weight. The original scrapy-selenium package is old; newer wrappers such as scrapy-selenium4 exist, and the Selenium path is still heavier and less cleanly integrated with Scrapy than scrapy-playwright.
The basic setup still follows the same pattern. Install scrapy-selenium, add the driver settings in settings.py, and enable the middleware:
That configuration is based on the package's documented settings. The only adjustment for a current Chrome setup is "--headless=new", since Selenium deprecated the older convenience headless path and Chrome's current headless mode uses the newer form. If you're using Firefox instead, the argument is still "-headless".
In the spider, the request shape is straightforward. Instead of yielding a normal scrapy.Request, you yield SeleniumRequest and tell it what to wait for. In the example below, the target is quotes.toscrape.com/js, a practice site that renders its content through JavaScript:
That is the basic pattern: wait for a selector that proves the page is actually ready by passing wait_time and wait_until to prevent Selenium from returning the page too early, then parse the response like normal Scrapy HTML.
Selenium still makes the most sense when the page needs interaction. The scroll version of the same site at quotes.toscrape.com/scroll loads new quotes as you scroll down, which requires driving the browser directly. The callback can access the live driver through response.meta["driver"]. A single scroll loads one batch, so production use would loop this until no new content appears:
The package also documents a built-in script argument on SeleniumRequest, so for simple one-shot JavaScript execution, you can keep that logic on the request itself instead of moving everything into the callback. The script runs after the page loads, so a wait_time is still worth setting if the page needs time to settle before the script fires:
The trade-off is cost; Selenium is the heaviest option: browser process, WebDriver process, and much weaker scaling once you raise concurrency. It's still a reasonable choice when you already have Selenium expertise, a working WebDriver setup, or browser extensions you need to keep using. If that is your situation, the Selenium guide, the WebDriver proxy guide, and the Puppeteer vs. Selenium comparison are the most relevant follow-ups.
For most new projects, scrapy-playwright is still the cleaner default. And if the browser layer is becoming the whole job, using the Decodo Web Scraping API can simplify things.
Executing JavaScript in Scrapy with Playwright
For a new Scrapy project, scrapy-playwright is usually the right place to start when a page needs JavaScript. It supports Chromium, Firefox, and WebKit, and it fits into Scrapy much cleaner than the older browser integrations. Instead of pulling the whole crawl into a separate browser script, it plugs into Scrapy at the downloader layer. That keeps the split clear: normal requests stay normal, and only the requests that actually need rendering go through Playwright.
Start with the install:
Then add the Playwright download handler in settings.py:
The download handler tells Scrapy where to send browser-rendered requests. The reactor matters because Playwright is async-native, and Scrapy needs an asyncio-compatible reactor for that integration to work properly.
Once that is in place, Playwright is enabled per request with meta={"playwright": True}.
Here's a small JavaScript-rendered example using https://quotes.toscrape.com/js/. It works well as a first Playwright example because it makes one thing clear: the quotes appear only after the page has finished rendering.
In Scrapy 2.13 and newer, async def start() is the current way to generate initial requests. If you're working with an older project, start_requests() still works, which is why our earlier Selenium and Splash examples use that style.
The meta block is where the Playwright behavior is defined:
Each part does a separate job:
- playwright=True tells Scrapy to render this request through Playwright
- playwright_page_methods runs browser actions before the response is passed to the callback
In this case, Playwright waits until at least one .quote element appears. Once that happens, Scrapy receives the rendered HTML, and the spider can keep using normal CSS selectors.
That is the main reason scrapy-playwright works so well as a default: after rendering, the spider still looks like Scrapy.
Useful PageMethod patterns
The most common browser action is simply waiting for content:
Use that when the page shell loads first, and the useful data appears later.
If the page needs more than a simple wait before parsing, extra browser actions can be added through additional PageMethod calls.
To wait for an element to appear before Scrapy parses the page, use:
If the page needs a scroll before the next batch of content appears, add a scroll action first, then wait for a selector that confirms the new content has loaded:
It's also possible to execute inline JavaScript directly in the page:
That's useful when the data is already available in a browser-side JavaScript variable rather than in the DOM. Many sites load a state object first and render the visible interface from that. Reading the variable directly is often simpler and more reliable than scraping the rendered table or list.
When the live Playwright page is actually needed
Most JavaScript-rendered pages don't need the live Playwright page object. In many cases, PageMethod is enough: wait for a selector, maybe run one browser action, then let Scrapy parse the rendered HTML as usual.
The live page object is for the cases where rendering alone isn't enough. That usually means a page that needs browser interaction after navigation, such as clicking, typing, multi-step workflows, or infinite scrolling. In those cases, add playwright_include_page=True and switch to an async callback so the spider can work directly with the Playwright page.
Infinite scrolling is a good example because it shows why the live page object exists in the first place. The page doesn't just need time to render. It needs repeated browser actions, checks for newly loaded content, and a clean shutdown when the work is done.
A few details matter here.
First, the spider waits for the first .quote element before doing anything else. That confirms the page has rendered the initial batch of content.
Then it keeps scrolling until the number of quotes stops increasing. That's a useful pattern in this example. It doesn't assume a fixed number of scrolls or a fixed number of pages. It keeps going until no new items appear.
It also stays inside Playwright’s own locator API the whole time. That is cleaner than turning the updated page back into another Scrapy response after every scroll, and it makes the point of playwright_include_page=True much clearer: this is for real browser interaction, not just for grabbing one rendered snapshot.
The last important line is await page.close(). Open Playwright pages count toward the page limit inside each browser context. If pages are left open, the crawl can stall once that limit is reached. That is why the same cleanup appears in errback as well: if the request fails, the browser page still needs to be closed.
That is also the rule for using playwright_include_page=True in general: only turn it on when the callback really needs the live page, and close the page when the interaction is finished.
Initializing pages and intercepting requests
Some setup belongs before the page even starts loading. That is what playwright_page_init_callback is for.
A common use case is blocking heavy assets:
This callback runs after scrapy-playwright creates the page, but before the request is made. That makes it the right place for setup work such as request routing, blocking images, or adding scripts before navigation.
The same method can also be used to watch for a specific XHR or API request:
On real sites, such interception often indicates that browser rendering may not be necessary for the entire crawl. If Playwright shows that the page is actually retrieving its data from a JSON API, scraping that API directly with standard Scrapy requests is usually faster and more reliable.
Why is this the recommended default
scrapy-playwright is a strong default because it uses modern browser engines, fits Scrapy’s async model, and lets the project keep Scrapy’s normal parsing, item pipelines, middleware, throttling, and feed exports. Selenium still has a place in certain browser automation workflows, and Splash still appears in older deployments. Still, for a new JavaScript-heavy Scrapy project, Playwright is usually the cleaner starting point.
It's still worth keeping the bigger rule in mind: browser rendering is expensive. Use it where it's needed, not everywhere. If the data is available via an API, use it. If the target blocks headless Chromium immediately, pair scrapy-playwright with rotating residential IPs instead of trying to solve every blocking problem inside the browser layer.
Splash vs. Selenium vs. Playwright vs. managed API: Which one for your Scrapy spider?
This is the short version: if you're starting fresh, use scrapy-playwright. It's the latest option in the Scrapy browser-rendering ecosystem, with version 0.0.47 released on PyPI in June 2026, and it supports Chromium, Firefox, and WebKit. scrapy-splash is still maintained and relevant, with a February 2025 PyPI release, but it's a more limited fit for new projects.
The better comparison isn't "which library is popular." It's the one that fits the job.
Option
Setup overhead
Modern JS support
Concurrency ceiling
Anti-bot resilience
Maintenance status
Cost
Best fit
scrapy-splash
Low to medium
Limited
Moderate
Weak
Older
Low
Existing Splash deployments, small rendering jobs
scrapy-selenium
Medium to high
Good
Low
Weak to moderate
Loosely maintained wrapper
Medium
Teams that already depend on WebDriver
scrapy-playwright
Medium
Strong
Good
Better than the older Scrapy browser integrations
Active
Medium
New Scrapy projects that need browser rendering
Low
Strong
High from the Scrapy side
Stronger by design because rendering, proxy rotation, and unblock layers sit outside your spider
Managed
Usage-based
Hard targets or teams that don't want to manage browsers
Use scrapy-playwright when the crawl is small to medium, and the JavaScript is real
This is the default for most new Scrapy spiders. It's actively maintained, uses modern browser engines, and plugs into Scrapy as a download handler instead of forcing a separate browser workflow into the spider. It also supports the current versions of Scrapy and Python directly. If the target needs rendering but isn't a full anti-bot war, this is usually the right answer.
Keep Splash when it's already in production, and the target is simple
Splash still makes sense when the deployment already exists, the targets are relatively light, and the small container footprint matters. It isn't abandoned, but it isn’t the strongest default for a new JavaScript-heavy Scrapy spider. The better way to frame it is as a practical option for simpler rendered pages and existing Lua-based workflows, not as the first choice for modern React-heavy or aggressively protected targets.
Use scrapy-selenium when the team already runs WebDriver
This is mostly a compatibility choice now. If the team already has Selenium expertise, existing browser automation, or a browser extension setup that needs to stay in place, Selenium can still be the shortest path; however, it remains the heaviest option in this group once concurrency increases.
Use a managed API when the browser layer is becoming the actual project
This is the point where rendering, proxies, CAPTCHA, and unblock logic take longer than parsing. In that case, a managed layer is often the cleaner architecture. Decodo Web Scraping API is the fourth option here for exactly that reason: it handles rendering and delivery outside the spider, allowing Scrapy to stay focused on request routing and extraction.
The short recommendation
- Small crawl + simple JavaScript: use scrapy-playwright
- Legacy Splash deployment: keep Splash
- Existing Selenium estate: use scrapy-selenium
- Heavy anti-bot targets or no interest in managing browsers: use Decodo Web Scraping API
That's the actual decision path. If the broader Python landscape still matters, comparing Python web scraping libraries is the right place to zoom out.
Getting blank pages? Same
Decodo's Web Scraping API renders the JavaScript so your spiders actually grab the data.
Using Scrapy cache and concurrency to scrape faster
Once browser rendering enters the pipeline, throughput drops fast. The fix isn't just "raise CONCURRENT_REQUESTS." The useful gains usually come from 3 places: cache rendered responses when you're developing, cap browser-side parallelism to what the machine can actually hold, and keep as many requests as possible out of the browser in the first place. Scrapy's own settings cover the crawl side, and scrapy-playwright adds a second set of limits for browser contexts and pages.
Cache rendered pages during development
If rendering a request takes a few seconds, re-rendering the same page every time you adjust a selector is a waste of time. Scrapy's HTTP cache can store responses on disk, and the built-in FilesystemCacheStorage, along with the RFC2616 policy, is the standard component for that.
A practical starting point in settings.py looks like this:
That doesn't make a target faster on the first hit, but it does make parser iteration much cheaper on re-runs. For a Playwright-backed spider, that matters because you can keep working on extraction logic without paying the browser cost again for unchanged pages. For a refresher on the framework side, the main Scrapy guide is still the right reference.
Don't tune concurrency as if every request were plain HTTP
CONCURRENT_REQUESTS still controls overall Scrapy parallelism, but with browser-backed requests, the real bottleneck is usually the browser, not the downloader. In scrapy-playwright, PLAYWRIGHT_MAX_CONTEXTS limits concurrent browser contexts, and PLAYWRIGHT_MAX_PAGES_PER_CONTEXT limits open pages per context. By default, the pages-per-context limit inherits Scrapy's CONCURRENT_REQUESTS value, which can be too loose if RAM is tight.
A safer baseline is:
That keeps Scrapy reasonably busy without letting a single target or browser process run away with the machine. Scrapy's docs also note that DOWNLOAD_DELAY can reduce the effective per-domain concurrency below CONCURRENT_REQUESTS_PER_DOMAIN, so raising limits mindlessly doesn't always translate into real throughput. And once browsers are in the loop, AutoThrottle tends to work better than fixed sleeps because it responds to actual latency rather than guessing.
Skip rendering whenever possible
This is still the biggest speed lever; if a page is really just calling a JSON endpoint, route that request through normal Scrapy and leave Playwright out of it. That cuts browser time, memory use, and failure surface all at once.
In practice, that often means doing something like this in the spider:
The same rule applies within a single crawl: not every URL on the site belongs in the browser queue. Keep Class A requests out of Playwright whenever possible.
Block heavy assets the current way
The current scrapy-playwright docs are more cautious than that. The plugin uses Page.route internally and explicitly warns against calling Page.route yourself unless you know exactly what you're doing. The safer current hook is playwright_page_init_callback, where you can add initialization logic before navigation.
A practical example looks like this:
That's a cleaner way to cut render time on pages where the text or table data matters and the media doesn't. If the crawl is large enough that this starts to matter, the next useful read is web scraping without getting blocked.
The practical rule
When JavaScript rendering slows the crawl down, the first move should be:
- cache what can be cached
- lower browser-side parallelism to something the machine can hold
- keep API-backed requests out of the browser
- block assets you don't need
That's usually enough to get a Playwright-enabled spider back into a usable range without turning the whole settings file into a guessing game.
Handling blocks when running Scrapy with JavaScript
Adding a browser solves rendering, but it also makes the crawl easier to fingerprint. A plain headless Chromium session is often more obvious than a normal Scrapy request because the browser exposes automation signals like navigator.webdriver, and anti-bot systems now look at more than just the DOM. They look at browser fingerprints, request patterns, and network behavior, too. That's why JavaScript-enabled spiders often get blocked more, not less. If you want the broader background, the best companion reads here are anti-bot systems, web scraping without getting blocked, and anti-scraping techniques.
The first layer is browser hygiene. Rotate a realistic user agent, vary the viewport slightly, and avoid sending the same browser profile across contexts. Playwright supports userAgent and viewport at the browser-context level, so that part is straightforward. It's also where community stealth layers come in. Tools like playwright-stealth exist, but even their own package description is careful: It's a proof-of-concept starting point, not a guarantee that modern detection will disappear. Rebrowser-style patches exist as well, but they move quickly and aren't the same thing as official Playwright support. That makes them useful as tactical tools, not something to build a whole crawler strategy around.
In practice, the mitigation stack usually looks like this:
- rotate user agents per browser context
- vary viewport sizes instead of using one fixed default
- slow the crawl down enough to avoid looking machine-perfect
- use a stealth layer only if the target really needs it, and expect to re-test it regularly
A simple scrapy-playwright context setup can look like this:
That isn't "undetectable." It just removes a few cheap tells and makes the browser side less uniform.
The second layer is the proxy layer. scrapy-playwright supports proxies through PLAYWRIGHT_LAUNCH_OPTIONS, which is the cleanest place to route the browser through an external IP. For normal crawling, rotating residential IPs are the safer default because they spread requests out and appear less synthetic than hammering a target from a single address. For login-heavy or multi-step flows, session stickiness matters more than raw rotation. That's where a sticky residential or ISP session is a better fit than changing IPs on every request.
A basic launch configuration looks like this:
If the target is sensitive enough that the browser layer keeps getting burned, Decodo Residential proxies are a more natural fit than stretching datacenter IPs beyond what the site will tolerate.
The last part is knowing when to stop. Once the crawl is spending more time on stealth patches, browser quirks, and proxy failures than on extraction logic, the architecture has usually tipped too far. A simple rule of thumb is this: if a target is still failing on more than about one in five requests after reasonable browser tuning and a proper proxy layer, it's probably time to stop fighting that target inside the spider and move it to a managed rendering layer instead. That's the point of the Decodo Web Scraping API: keep the Scrapy pipeline, but push rendering, proxy rotation, and CAPTCHA handling out of the spider and into one endpoint.
Final thoughts
Running Scrapy with JavaScript is mostly a routing problem. First, confirm the page actually needs rendering, since many "dynamic" pages just call a JSON endpoint and should stay on plain Scrapy. When a browser is necessary, scrapy-playwright is the best default for new projects; Splash fits when a small container matters or it's already in production, and Selenium when you're reusing an existing WebDriver setup. After that, keep the crawl efficient: cache what you can, let AutoThrottle work, block assets you don't need, and limit browser-backed requests to pages that truly need them. And once blocks, browser upkeep, and proxy issues eat more time than parsing, stop treating it as a spider problem. A managed rendering layer like the Decodo Web Scraping API is built for that handoff.
About the author

Vilius Sakutis
Head of Partnerships
Vilius leads performance marketing initiatives with expertize rooted in affiliates and SaaS marketing strategies. Armed with a Master's in International Marketing and Management, he combines academic insight with hands-on experience to drive measurable results in digital marketing campaigns.
Connect with Vilius via LinkedIn
All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.


