Back to blog

Puppeteer Infinite Scroll: A Practical Scraping Guide

Share article:

If you try to run curl on an infinite scroll page and then search (grep) for the content, the result will show zero matches. The required items aren't available in the initial HTML. The content is loaded using fetch requests that the page sends when a scroll event is triggered. This guide will cover: diagnosing the target, three scroll strategies, verification, speed, anti-detection methods, and a real-world walkthrough.

Puppeteer Infinite Scroll

TL;DR

  • The Network tab usually reveals all necessary information before coding begins. If there's a dedicated JSON endpoint for the scroll, a full browser may not be necessary.
  • Most sites implement one of three scroll patterns: a height-check loop, viewport-step with item counting, or using scrollIntoView on a sentinel element. You must diagnose the pattern first before selecting the appropriate strategy.
  • Most scrapers fail silently due to errors in loop verification. To verify, log per-iteration counts and cross-check them against the API's pagination signal when it's available.
  • For optimal speed, use event hooks instead of fixed sleep delays. The page.waitForResponse() method can capture the data response, and MutationObserver detects when new items appear in the DOM.
  • Anti-detection should be implemented in incremental steps. Begin with delay and scroll randomization, incorporate mouse movement, and only transition to residential proxies if IP blocking becomes an issue.

Understanding infinite scrolling

For a scraper, infinite scroll is a data delivery mechanism. The page loads more content as the user scrolls, with no pagination clicks and no full-page reloads.

How infinite scroll loads data

Most sites use a scroll event listener or an IntersectionObserver watching a sentinel element near the bottom of the page. When the trigger fires, the page sends a fetch request to a JSON endpoint, parses the response, and injects new nodes into the DOM. The cycle repeats until the API returns an empty page or the user stops scrolling.

Why standard HTTP scrapers fail

requests.get() or plain fetch() call returns whatever the server sent in the first response. That gives you the page shell, the loader script, and sometimes the first batch of items. Anything that loads through JavaScript afterward is missing. Without a JavaScript runtime, the scraper has no way to fire scroll events, so the rest of the data never loads. That's why infinite-scroll targets need a headless browser.

You can see the gap in one command:

$ curl -s https://quotes.toscrape.com/scroll | grep -c "class="quote""
0
$ node strategy1-scroll-to-bottom.js
Saved 100 quotes

The raw response has an empty container div and a script tag that fetches quotes from a JSON API and injects them at runtime. Until JavaScript runs, the document doesn't contain the quotes the page displays. That's the whole problem in one shell session.

What this means for scraping

You should note 3 properties of infinite scroll when building a scraper:

  • The state doesn’t survive a crash. A scraper that fails mid-scroll has no bookmark to resume from; with cursor-based pagination you can resume from the last ID.
  • There's no stable URL corresponding to an item's position. This makes deduplication across runs more complex than it first appears.
  • Scroll patterns can expose bot signals. Events that arrive too regularly or too fast don't match real visit behavior, so anti-detection becomes a core concern from the start.

Set up Puppeteer

Puppeteer is a Node.js library that drives Chromium over the Chrome DevTools Protocol (CDP). Calling puppeteer.launch() starts a bundled Chromium and returns a Browser object that your script can control. Every method you call (navigation, evaluation, scrolling, input) maps to a CDP command the browser executes.

Install

Install Puppeteer into a Node 22 LTS project. The package bundles a compatible Chromium build automatically:

mkdir scrape-infinite-scroll && cd scrape-infinite-scroll
npm init -y
npm pkg set type=module
npm install puppeteer

Recent Puppeteer releases ship as ESM, so every example below uses import/export syntax. Setting "type": "module" in package.json (the npm pkg set line above) is the simplest setup. If your project still uses CommonJS, pin Puppeteer to v24 or earlier.

Two v22 API changes you should know about

These 2 changes in v22 are the main ones to watch for before you start writing code.

First, there's no built-in delay function. The page.waitFor() and page.waitForTimeout() helpers were removed in v22. For fixed delays, use a one-line helper: const delay = ms => new Promise(r => setTimeout(r, ms)). For condition-based waits, use page.waitForFunction() instead of a fixed sleep.

Second, the headless flag changed behavior in v22. headless: true now selects the modern headless implementation, and headless: 'shell' selects the legacy chrome-headless-shell binary. Both modes produce User-Agent strings that contain HeadlessChrome, which is a detection signal. The anti-detection section covers how to override the User-Agent. The intermediate headless: 'new' string from older 2023 tutorials still launches and behaves the same as headless: true. The examples below use headless: true.

Target site

The target site for every example is quotes.toscrape.com/scroll. It's a public test site built specifically for scraping practice. It has real AJAX pagination, a stable DOM, and a JSON endpoint that anyone can see with DevTools open. The site is intentionally friendly. There's no anti-bot or authentication, the termination is predictable, and the selectors are clean. That makes it a good starting point for learning the mechanics, even though most production targets aren't this clean.

Some test pages with "scroll" in the URL are static demos that don't actually load on scroll. Use the curl-and-grep diagnostic from the next section to confirm the page behavior before committing to a strategy.

Diagnose your target before writing code

The hardest part of an infinite scroll scrape isn't the scroll loop. It's figuring out how the target loads its data before you commit to a strategy that doesn't match. Spending just a few minutes with DevTools on the real site saves a lot of time debugging a loop that terminates on the wrong signal.

If you've already confirmed your target needs JavaScript rendering, you can skip to the 3 scroll strategies below and return here when you hit specific situations (picking the right strategy, handling authentication, identifying the termination signal).

Load the target in Chrome and open DevTools (our inspect element guide covers the basics if you need a refresher), then answer the 5 questions below.

Is there an API endpoint, or is the data rendered server-side?

Switch to the Network tab, filter to Fetch/XHR, and scroll the page. If you see fetch requests firing on each scroll with JSON responses that contain the visible items, intercepting the API directly is the cleanest approach. If you see no new requests on scroll, the page is either re-rendering pre-loaded data client-side or using server-sent events, and you'll need DOM scraping instead.

DevTools Network → Fetch/XHR on quotes.toscrape.com/scroll: each scroll fires a fetch, and the JSON response carries everything the rendered DOM would have.

If you found a JSON endpoint, see the alternative approaches below for the implementation. The scroll-strategy sections that follow are for targets without a clean API.

What is the scroll trigger?

3 patterns are common, and each needs a different strategy:

  • The page uses a near-bottom trigger. It loads more content when the scroll position is within a few hundred pixels of document.body.scrollHeight. Test by scrolling slowly and watching where in the viewport the next batch fires. Strategy 1 (jump to the bottom and wait for height change) is the right choice.
  • The page uses an IntersectionObserver on a sentinel. A specific element (usually invisible, at the bottom of the list) triggers the load when it enters the viewport. Open the Sources tab, search for IntersectionObserver, and look at the root and rootMargin properties. If rootMargin is small or zero, you must scroll the sentinel into actual view, so Strategy 2 (viewport-height increments) is the right choice.
  • The page has an explicit "Load More" button. A real click advances the feed. The locator click pattern handles this, and the one-line answer is page.locator('button.load-more').click().

Does the URL change as you scroll?

If you see page=2 or ?cursor=abc123 appearing in the URL on scroll, the site has addressable pagination. Treat it as pagination instead of infinite scroll. Standard HTTP requests to those URLs usually work without a headless browser.

The same test domain shows both patterns side by side, which makes the choice concrete:

# Infinite scroll variant: raw HTML has nothing useful
$ curl -s https://quotes.toscrape.com/scroll | grep -c 'class="quote"'
0
# Pagination variant on the same site: raw HTML has the items, ready to parse
$ curl -s https://quotes.toscrape.com/page/2/ | grep -c 'class="quote"'
10

2 URL patterns produce 2 distinct scraping strategies. A few seconds of curl before you start using Puppeteer often tells you Puppeteer isn't needed. Some sites are hybrid (infinite scroll on category pages, addressable URLs on the product detail pages they link to), and you should scrape each layer with the approach that matches its architecture. For more on pagination scraping, see our guide on web scraping pagination.

What does authentication require?

Look at the captured request's headers and check for these signals:

  • The request has a Cookie header with session values. You need to load the login page through Puppeteer first so it establishes a session.
  • The request has an Authorization header with a Bearer token. This is a JWT or API token. Find where it comes from (login response, embedded in HTML, or generated client-side) so you can replay it.
  • The request has an X-CSRF-Token or X-Requested-With header. This is CSRF protection. Extract the token from the page before replaying.
  • The URL has a signed query parameter (for example, &sig=…). This is algorithmic signing. Reverse-engineering the algorithm or keeping the browser-driven approach are the 2 practical options.

How does the feed signal that it's finished?

Check what the last response looks like. The end signal might be a flag like has_next: false, a cursor that returns null, an empty items array, or no signal at all. The termination logic in your loop has to match the signal the site sends. A loop watching for has_next: false against a feed that ends with an empty array will run forever.

3 scroll strategies

Pages trigger their content loads in different ways. 3 patterns cover most production targets, and each works on the quotes.toscrape.com/scroll test site. Pick the one that matches how your target listens for scroll events.

Shared helpers and the modern delay function

Every example below shares the same boilerplate. Save this once as a helper file:

// helpers.js
export const delay = ms => new Promise(r => setTimeout(r, ms));
export const randomDelay = (min, max) =>
new Promise(r => setTimeout(r, Math.random() * (max - min) + min));
export async function launch(puppeteer) {
return puppeteer.launch({
headless: true,
defaultViewport: { width: 1920, height: 1080 },
});
}

These 2 small helpers cover what the standard library doesn't. The delay() function handles fixed pauses, and randomDelay() provides the variance used later in the human-like patterns. The launch wrapper centralizes the headless and viewport defaults so every example below stays consistent. The 1920x1080 resolution matches the most common real desktop, which the anti-detection section relies on later.

Strategy 1: Scroll to bottom and wait for height change

Strategy 1 is the scroll-to-bottom pattern. Scroll to document.body.scrollHeight, wait for the height to grow (which means new content was appended), then repeat. Use this on any page that grows the body as new items load.

// strategy1-scroll-to-bottom.js
import puppeteer from 'puppeteer';
import fs from 'fs/promises';
import { delay, launch } from './helpers.js';
async function extractQuotes(page) {
return page.evaluate(() => {
const cards = document.querySelectorAll('.quote');
return Array.from(cards).map(card => ({
text: card.querySelector('.text')?.innerText.trim(),
author: card.querySelector('.author')?.innerText.trim(),
tags: Array.from(card.querySelectorAll('.tag')).map(t => t.innerText.trim()),
}));
});
}
async function scrapeInfiniteScroll() {
const browser = await launch(puppeteer);
const page = await browser.newPage();
try {
await page.goto('https://quotes.toscrape.com/scroll', {
waitUntil: 'domcontentloaded',
});
await page.waitForSelector('.quote');
while (true) {
const previousHeight = await page.evaluate(() => document.body.scrollHeight);
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
try {
await page.waitForFunction(
prev => document.body.scrollHeight > prev,
{ timeout: 5000 },
previousHeight,
);
} catch {
break; // No new content arrived: end of feed.
}
await delay(800);
}
const quotes = await extractQuotes(page);
await fs.writeFile('quotes.json', JSON.stringify(quotes, null, 2));
console.log(`Saved ${quotes.length} quotes`);
return quotes;
} finally {
await browser.close();
}
}
scrapeInfiniteScroll();

The loop snapshots the current scrollHeight, scrolls to the bottom, then uses waitForFunction() to block until the height grows. If 5 seconds pass with no change, the loop assumes the feed is exhausted and breaks. Wrapping the work in try/finally means browser.close() runs on every exit path, including unhandled errors. Skip the try/finally and leaked Chromium processes will silently eat memory on long runs.

A successful run prints a single line and writes quotes.json next to the script:

Saved 100 quotes

The output file contains the full extraction in the format extractQuotes() produced. The first few records look like this:

[
{
"text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”",
"author": "Albert Einstein",
"tags": ["change", "deep-thoughts", "thinking", "world"]
},
{
"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”",
"author": "J.K. Rowling",
"tags": ["abilities", "choices"]
},
{
"text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”",
"author": "Albert Einstein",
"tags": ["inspirational", "life", "live", "miracle", "miracles"]
}
]

The site returns 100 quotes across 10 pages. If you run fewer reports, the loop terminates early, usually because the waitForFunction timeout fired before a slow page finished loading. Increase the timeout from 5000 ms to 10000 ms and re-run.

On real targets, the selectors will be more unreliable than .quote. Our guide to choosing the right selector for web scraping covers picking between CSS and XPath, and the verification section below covers how to catch silent selector mismatches before they reach production.

Strategy 2: Scroll by viewport height

Some pages don't respond to a single jump to the bottom. They use IntersectionObserver with a threshold that only fires when the sentinel is close to the viewport edge, not anywhere on the page. For those targets, scroll one viewport at a time and watch the item count instead of the document height:

// strategy2-scroll-by-viewport.js
import puppeteer from "puppeteer";
import { delay, launch } from "./helpers.js";
async function scrapeByViewport() {
const browser = await launch(puppeteer);
const page = await browser.newPage();
try {
await page.goto("https://quotes.toscrape.com/scroll");
await page.waitForSelector(".quote");
let previousCount = 0;
let stableScrolls = 0;
while (stableScrolls < 5) {
await page.evaluate(() => window.scrollBy(0, window.innerHeight));
await delay(700);
const currentCount = await page.$$eval(".quote", (els) => els.length);
if (currentCount > previousCount) {
previousCount = currentCount;
stableScrolls = 0;
} else {
stableScrolls++;
}
}
console.log(`Found ${previousCount} quotes via viewport scrolling`);
} finally {
await browser.close();
}
}
scrapeByViewport();

Strategy 2 makes a couple of changes from Strategy 1. First, window.scrollBy(0, window.innerHeight) replaces the jump to the bottom, so each iteration advances by one screen. Second, the termination check counts .quote elements instead of measuring scrollHeight, which is a more direct signal when IntersectionObserver is the loading trigger. After 5 consecutive scrolls without a new item, the loop assumes the feed is exhausted. This pattern is slower per page than Strategy 1, and more reliable on pages that only load when the scroll position is within a small threshold of the bottom.

Strategy 3: Scroll to a specific element

When you know what the end of the feed looks like (a footer, a "no more results" banner, or a sentinel div the page adds when there's no more data), scroll-to-element is the most explicit option. It also avoids open-ended loops:

// strategy3-scroll-to-element.js
import puppeteer from "puppeteer";
import { delay, launch } from "./helpers.js";
async function scrapeToElement(targetSelector) {
const browser = await launch(puppeteer);
const page = await browser.newPage();
try {
await page.goto("https://quotes.toscrape.com/scroll");
await page.waitForSelector(".quote");
let targetVisible = false;
let iterations = 0;
const maxIterations = 50;
while (!targetVisible && iterations < maxIterations) {
await page.evaluate(() => {
const all = document.querySelectorAll(".quote");
if (all.length) all[all.length - 1].scrollIntoView({ block: "end" });
});
await delay(900);
targetVisible = await page.evaluate((sel) => {
const el = document.querySelector(sel);
if (!el) return false;
const rect = el.getBoundingClientRect();
return rect.top < window.innerHeight && rect.bottom > 0;
}, targetSelector);
iterations++;
}
} finally {
await browser.close();
}
}
scrapeToElement("footer");

This loop scrolls the last .quote card into view, waits, and re-checks whether the footer entered the viewport via getBoundingClientRect(). The exit condition reads from the page's own DOM rather than guessing scroll depth. The iterations cap is a safety net, so a misconfigured selector doesn't trap the script in an open loop.

There's one caveat to running this against quotes.toscrape.com/scroll specifically. The test site's initial 10-quote render is short enough that the page footer is already inside the viewport on the first iteration, so the loop exits with only 10 quotes captured rather than the full feed:

Strategy 3: footer visible after 1 iteration, 10 quotes loaded

That's a real-world Strategy 3 failure mode worth noticing. The technique is correct; the selector is wrong for this target. Strategy 3 works best when the end-of-feed signal isn't visible from the start. Examples include an "End of results" banner injected after the last fetch, or a sentinel div that the page adds when there's no more data. On targets without that kind of explicit end-marker, fall back to Strategy 1 or Strategy 2's growth-based termination.

For element interactions (clicking a Load More button instead of waiting for natural scroll-load), the page.locator() method is cleaner than the manual waitForSelector + click pattern. The locator auto-scrolls the element into view, waits for it to be actionable, retries on stale references, then acts as a single call:

await page.locator('button.load-more').click();

That single line handles visibility waiting, scrolling, and the click, which older recipes wrote in 5 or 6 lines around manual waitForSelector() calls.

The Locator API does more than .click(). Other methods matter most when your script interacts with recycled or repositioned elements. This includes clicking a Load More button that re-renders, hovering to trigger a tooltip on a card the framework may have replaced, or filling a search field that resets between batches. The scroll loops earlier in this section don't strictly need the Locator API because they operate on the document itself (window.scrollTodocument.body.scrollHeight) rather than on individual nodes that could go stale. The methods below are worth using when you move from "scroll the page" to "interact with an element on the page":

  • The .wait() method blocks until the locator resolves to a stable, visible element, without performing any action. It's a stricter replacement for waitForSelector.
  • The .hover().fill(text).map(fn), and .filter(fn) methods complete the action set for non-click interactions.
  • The Locator.race([loadMoreBtn, endOfFeedBanner]) method resolves to whichever element appears first. This is useful when a scroll either reveals more content or reveals an end-of-feed sentinel, and you want to act on whichever resolves first.

Chain configurators before the action to override the defaults:

await page
.locator('button.load-more')
.setTimeout(10_000)
.setEnsureElementIsInTheViewport(true)
.setWaitForStableBoundingBox(true)
.click();

The setWaitForStableBoundingBox(true) configurator waits until the element stops moving across 2 consecutive animation frames, which removes the race between a click and a CSS transition. The setEnsureElementIsInTheViewport(true) setting is the default for click. Set it explicitly when you want the same auto-scroll behavior for non-click actions.

Handling lazy-loaded images

Lazy-loaded image handling is a separate problem from infinite-scroll data loading. Images sit in the DOM with loading="lazy" or behind an IntersectionObserver, and the browser defers fetching them until they enter the viewport. A scroll-to-bottom loop reaches the last item, but the images several screens up may never have been visible long enough to load.

The solution requires a second pass:

// lazy-image-pass.js
import { delay } from './helpers.js';
export async function loadAllImages(page) {
const viewportHeight = await page.evaluate(() => window.innerHeight);
const fullHeight = await page.evaluate(() => document.body.scrollHeight);
for (let y = 0; y <= fullHeight; y += viewportHeight) {
await page.evaluate(scrollY => window.scrollTo(0, scrollY), y);
await delay(400);
}
await page.waitForNetworkIdle({ idleTime: 1000, timeout: 30000 });
}

After your data-collection scroll loop finishes, call loadAllImages(page) before reading src attributes or capturing screenshots. The loop drives the viewport from top to bottom in screen-sized steps, so every image enters the viewport and page.waitForNetworkIdle() blocks until image requests stop arriving. Skip the network-idle wait, and you collect placeholder URLs (or empty src attributes) for any images that haven't finished downloading.

Verify your loop actually worked

With the scroll loop and lazy images handled, the next concern is making sure it worked. The most common production failure in scroll scraping isn't a crash. It's a silent partial extraction: the loop terminates early, returns 47 items out of an expected 5,000, and the downstream pipeline produces wrong results without anyone realizing. Verification has to happen at the scraper, before the data leaves the process.

Log a structured line per iteration

The cheapest signal that something is wrong is the count progression. If you log every iteration, a run that should produce 10 lines but ends after 3 is obvious in the output:

let iteration = 0;
while (true) {
const previousHeight = await page.evaluate(() => document.body.scrollHeight);
const previousCount = await page.$$eval('.quote', els => els.length);
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
try {
await page.waitForFunction(
prev => document.body.scrollHeight > prev,
{ timeout: 5000 },
previousHeight,
);
} catch {
console.log(JSON.stringify({
event: 'loop-exit',
reason: 'no-height-growth',
iteration,
total_items: previousCount,
}));
break;
}
const newCount = await page.$$eval('.quote', els => els.length);
console.log(JSON.stringify({
event: 'iteration',
ts: new Date().toISOString(),
iteration,
added: newCount - previousCount,
total: newCount,
}));
iteration++;
}

A real run against the test site produces a noisier sequence than you might expect. Here's an actual capture:

{"event":"iteration","ts":"2026-05-21T05:56:30.950Z","iteration":0,"added":10,"total":20}
{"event":"iteration","ts":"2026-05-21T05:56:30.977Z","iteration":1,"added":0,"total":30}
{"event":"iteration","ts":"2026-05-21T05:56:30.992Z","iteration":2,"added":0,"total":30}
{"event":"iteration","ts":"2026-05-21T05:56:31.314Z","iteration":3,"added":70,"total":100}
{"event":"iteration","ts":"2026-05-21T05:56:31.353Z","iteration":4,"added":0,"total":100}
{"event":"iteration","ts":"2026-05-21T05:56:31.422Z","iteration":5,"added":0,"total":100}
{"event":"loop-exit","reason":"no-height-growth","iteration":6,"total_items":100}

3 details in that output are worth knowing. First, added and total count items at measurement time, not at scroll-trigger time, and the page can load multiple batches between the scroll firing and the count being captured. That's why iteration 0 already shows 20 items (the initial 10 plus a batch that arrived during the scroll). Second, added: 0 in a healthy run doesn't mean the loop is broken. It just means the height grew, but no new .quote nodes appeared between the previous and current count snapshots. 

This usually happens because the previous iteration already counted the newest batch. Third, large jumps like added: 70 happen when the page loads several batches faster than expected between counts. The signal you watch for isn't per-iteration added. It's whether the total eventually stops increasing at the expected final number, and whether the loop-exit line shows that plateau.

A truly failing run looks different from the noisy-but-healthy one above. An exit line at iteration: 3 with total_items: 30 on a feed you expect to produce 100 items is the clearest sign of early termination. Another failure mode is a sequence where total climbs steadily, then jumps back down (item recycling, virtual-list reset), which means your selector is matching transient nodes the framework swaps out. JSON-line output costs nothing to produce and parses cleanly when you pipe it through jq or load it into a notebook. Structured logs beat free-text logs the moment you have more than one run to compare.

Cross-check against the API's own termination signal

When the target exposes a JSON endpoint, the API tells you whether there's more data. Watch the responses passively and capture the last has_next (or equivalent) value:

let lastApiSignal = { page: null, hasNext: null };
page.on('response', async response => {
if (response.url().includes('/api/quotes')) {
try {
const json = await response.json();
lastApiSignal = { page: json.page, hasNext: json.has_next };
} catch {}
}
});
// ... run the scroll loop ...
if (lastApiSignal.hasNext === true) {
console.warn(JSON.stringify({
event: 'suspicious-exit',
last_api_page: lastApiSignal.page,
note: 'Loop terminated but the API said more data was available.',
}));
}

When the scroll loop's termination logic disagrees with the API's own signal, the API is almost always right. This catches the common case where the wait timeout fired during a slow page load, and the loop exited even though more data was coming. Here's the warning you'll see when it triggers:

{"event":"suspicious-exit","last_api_page":3,"note":"Loop terminated but the API said more data was available."}

That's the signal to either raise the waitForFunction timeout, or switch to count-based termination (Strategy 2's pattern) that doesn't depend on network speed.

Snapshot the page when the loop exits

When the count looks wrong, the failure is usually visible on the page. The page might show a CAPTCHA banner, a 403 overlay, or a stuck loading spinner. A full-page screenshot at exit time tells you which:

const ts = new Date().toISOString().replace(/[:.]/g, '-');
await page.screenshot({ path: `loop-exit-${ts}.png`, fullPage: true });

Run all 3 checks during development. Once you've seen what success looks like and have confidence in the termination signal, you can reduce it to just the count log in production.

Speed up the scroll loop

Once verification is solid, the next concern is speed. The strategies above work, but they download everything the page requests. That includes avatars, fonts, stylesheets, ad pixels, and analytics calls. On quotes.toscrape.com, that's a few kilobytes, but on a real product catalog or social timeline, it's megabytes per scroll. 3 Puppeteer features cut that overhead and remove the polling that makes scroll loops slow.

Block non-essential resources with request interception

Calling page.setRequestInterception(true) routes every outgoing request through a Node-side handler that can continueabort, or respond. To run a text-only scroll loop, abort the image, media, font, and stylesheet requests:

await page.setRequestInterception(true);
page.on('request', req => {
if (req.isInterceptResolutionHandled()) return;
const block = ['image', 'media', 'font', 'stylesheet'];
if (block.includes(req.resourceType())) {
req.abort();
} else {
req.continue();
}
});

The req.isInterceptResolutionHandled() check guards against double-handling when more than one listener is registered on request. The performance gain on real targets is significant. On image-heavy feeds, it routinely reduces total bytes by an order of magnitude, which translates into faster scroll triggers and lower memory pressure across long runs.

There's one caveat. This approach is incompatible with the lazy-image pass from the previous section. Aborting image requests removes the very fetches the second pass relies on. Run a fast text-only scrape or a full-asset scrape, not both in the same browser session.

Block by URL pattern via CDP

For more precise control without using full interception, the Chrome DevTools Protocol exposes a URL-pattern blocklist that runs inside the browser process:

const client = await page.createCDPSession();
await client.send('Network.enable');
await client.send('Network.setBlockedURLs', {
urls: [
'*google-analytics*',
'*doubleclick*',
'*segment.io*',
'*sentry.io*',
'*hotjar*',
],
});

The page.createCDPSession() call exposes raw CDP. The Network.setBlockedURLs method adds near-zero per-request overhead because the browser drops matching requests itself, unlike setRequestInterception, which round-trips every request through your Node process. Use this when the block list is known upfront, and you only want to filter (not modify) traffic.

Push-based detection with MutationObserver and exposeFunction

By default, page.waitForFunction runs your predicate on every animation frame (about every 16ms), which is continuous CPU work even when nothing has changed. A MutationObserver runs your callback only when the DOM actually mutates, so the CPU stays idle between batches.

The page.exposeFunction(name, fn) helper installs a Node-side callback on the page's window object. The page calls it like a normal function. Puppeteer routes the arguments back through CDP. Together with MutationObserver, you get push-based notifications from the page to your script:

import { EventEmitter } from 'events';
const events = new EventEmitter();
await page.exposeFunction('__onNewQuotes', count => events.emit('batch', count));
await page.evaluate(() => {
const container = document.querySelector('.quotes-container') || document.body;
new MutationObserver(mutations => {
const added = mutations
.flatMap(m => Array.from(m.addedNodes))
.filter(n => n.nodeType === 1 && n.matches?.('.quote'));
if (added.length) window.__onNewQuotes(added.length);
}).observe(container, { childList: true, subtree: true });
});
// Now drive the scroll and react to batches as they land:
let total = 0;
events.on('batch', count => {
total += count;
console.log(`Got ${count} new quotes (total ${total})`);
});
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));

Against the test site, the batch handler prints a line per page load as the data arrives, with no polling between them:

Got 10 new quotes (total 10)
Got 10 new quotes (total 20)
Got 10 new quotes (total 30)
Got 10 new quotes (total 40)
Got 10 new quotes (total 50)
Got 10 new quotes (total 60)
Got 10 new quotes (total 70)
Got 10 new quotes (total 80)
Got 10 new quotes (total 90)
Got 10 new quotes (total 100)

Each line fires the instant the API response gets parsed, and the new .quote nodes attach, rather than waiting for the next polling interval. On the test site, the difference is small. On a real feed with strict rate limiting (one batch every few seconds), push-based detection changes the page from running the polling predicate on every animation frame to being essentially idle between batches.

Use this when the loading cadence is irregular (some pages send a batch every few seconds, some load several at once), or when you want to react to specific node patterns rather than scrollHeight changes. The trade-off is that the termination signal becomes "no batches for N seconds" instead of "scrollHeight didn't grow", so you still need a timeout on the Node side, just not a per-iteration polling timer.

Human-like scrolling and anti-detection

After the scroll loop is fast and verified, the next concern is whether it can bypass detection. Major platforms, including social feeds, large eCommerce sites, and professional networks, use scroll behavior as a component of their bot risk scoring. The default Puppeteer scrolling generates several signals that detection scripts can flag. For example, the timing between scrolls is constant to the millisecond, and the scroll distance is identical on every iteration.

Furthermore, there's an absence of mouse movement between scrolls, and no micro-pauses, re-reads, or upward scrolls are present. These patterns are rarely seen in a real user visit. The fix is to introduce variance at every level a detection script can measure. Although none of the techniques below guarantee that you won't be flagged (because no client-side method can), they collectively remove the most obvious statistical fingerprints.

Inject variance in timing and distance

A real visitor doesn't scroll at a perfectly even rate and doesn't move the same number of pixels each time. These 2 short helpers replace the constants:

const randomDelay = (min, max) =>
new Promise(r => setTimeout(r, Math.random() * (max - min) + min));
await randomDelay(600, 1800);
await page.evaluate(() => {
const factor = 0.6 + Math.random() * 0.4;
window.scrollBy(0, window.innerHeight * factor);
});

The displayed values are for illustrative purposes only. They must be tuned according to your target application. The core principle to follow is variance, and not the specific numerical values. You should use a different scroll amount and a different pause length during each iteration. The removal of the constant-distance and constant-delay signals represents one of the most accessible and low-cost anti-detection improvements.

Add mouse movement

Real user activity involves mouse movement during reading. The page.mouse.move() function generates CDP input events, making the activity indistinguishable from hardware mouse input in DevTools Protocol traces:

async function humanMouseMove(page) {
const x = 100 + Math.random() * 1000;
const y = 100 + Math.random() * 700;
await page.mouse.move(x, y, { steps: 10 + Math.floor(Math.random() * 10) });
}

The steps parameter breaks the movement into intermediate points. This generates a curved path and avoids the instantaneous "teleport" effect.

Use mouse.wheel() for natural scrolling

The window.scrollTo() call is purely programmatic. In contrast, page.mouse.wheel({ deltaY: n }) dispatches a real wheel event, which more accurately matches a visitor scrolling with a trackpad or mouse wheel:

await page.mouse.move(600, 400);
await page.mouse.wheel({ deltaY: 400 + Math.random() * 200 });

Wheel events follow the same code paths that real input devices trigger. It's a stronger match than when you use the programmatic scrollTo() call.

Insert occasional upward scrolls

In order to better simulate the behavior of actual users who frequently return to previous sections for re-reading, it's recommended to implement an occasional upward scroll of a minor distance after every 5 or 6 iterations before proceeding further down the page.

if (Math.random() < 0.15) {
await page.mouse.wheel({ deltaY: -(150 + Math.random() * 200) });
await randomDelay(400, 900);
}

This breaks the monotonic-downward pattern that simple detection heuristics watch for.

Set a realistic viewport and user agent

The default User-Agent for headless: true and headless: 'shell' contains the HeadlessChrome substring, which acts as a primary detection signal for bot scanners. It's necessary to override the User-Agent to remove the Headless prefix as a minimum anti-detection measure. The Chrome version in the User-Agent string must also remain current, because a version that is older than the bundled Chromium binary appears unusual, and hardcoded versions in documentation quickly become obsolete. You should extract the major version number from browser.version() to ensure the User-Agent string updates automatically with each Puppeteer upgrade.

await page.setViewport({ width: 1920, height: 1080 });
const major = (await browser.version()).match(/\/(\d+)\./)?.[1];
await page.setUserAgent(
`Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ` +
`(KHTML, like Gecko) Chrome/${major}.0.0.0 Safari/537.36`,
);

The User Agent must align with the actual Chromium build to avoid version-mismatch signals across the User-Agent string, navigator.userAgentData, and the TLS ClientHello. If you need a specific historical version (some sites serve a different feed to older Chrome), hardcode that major instead of reading it dynamically.

Collectively, these patterns result in a scroll loop that simulates the behavior of a slightly distracted human user. For a deeper background on what detection scripts measure and why each control matters, see our guide to anti-bot systems. If scroll detection escalates to a CAPTCHA, the Puppeteer CAPTCHA guide is the next step.

The 2026 stealth ecosystem: a stealth plugin alone isn't enough

Many anti-detection guides recommend puppeteer-extra-plugin-stealth. However, this is only a partial solution. The plugin patches the visible Chromium headless signals (navigator.webdriver, missing plugin lists, chrome-object differences in userland). Conversely, modern web application firewalls (WAFs) inspect signals that the plugin doesn’t cover, including traces from the Runtime.Enable CDP command that Puppeteer sends to the browser during startup. Cloudflare, DataDome, and PerimeterX are reported in the scraping community to check for this specific signal, among others.

Some functionality provided by the plugin in userland can also be accessed at the browser level. For example, launching with args: [‘–disable-blink-features=AutomationControlled’] suppresses navigator.webdriver at the Chromium binding layer. This approach is more difficult to detect than a JavaScript patch that overrides the property after the page loads. The effect is observable in one line:

// Default launch (no flag): the property reveals automation.
await page.evaluate(() => navigator.webdriver);
// → true
// With args: ['--disable-blink-features=AutomationControlled']
await page.evaluate(() => navigator.webdriver);
// → false

The primary objective of the AutomationControlled flag is to align the behavior of the automated instance with a standard browser environment. By ensuring that navigator.webdriver returns false, the scraper effectively masks the automation signal at the binding layer. In contrast to userland patches, such as those utilized by the stealth plugin, this method prevents detection via Object.getOwnPropertyDescriptor inspections on the Navigator.prototype. To achieve maximum coverage, developers should combine this Chromium flag with the stealth plugin rather than relying on a single approach.

The current options worth knowing, with their maintenance status (it matters here):

  • puppeteer-extra-plugin-stealth. This represents a legacy baseline that patches common userland signals. Although widely documented, its development has stalled since early 2023, making it insufficient for bypassing modern anti-bot systems that inspect CDP-level traces.
  • rebrowser-patches. This utility modifies the Puppeteer source code to suppress the Runtime.Enable leak. It's currently the most effective technical solution for maintaining Puppeteer compatibility while mitigating specific CDP-based detection fingerprints.
  • puppeteer-real-browser. This library interfaces with an existing Chrome installation via CDP attachment. By utilizing a standard browser binary instead of Chromium-for-Testing, it eliminates numerous headless tells, though it requires a more complex environment configuration.

In summary, client-side stealth techniques possess built-in limitations. In-browser modifications can reduce fingerprints, but high-security targets eventually require network-level intervention. Developers must implement residential or mobile proxies, or managed unblocking services, to ensure reliable access when client-side fingerprints are no longer sufficient to bypass protection layers.

When human-like scrolling isn't enough

Browser-side controls cover scroll behavior, mouse movement, and the visible fingerprint. They don’t modify the IP address or the TLS handshake. On hard targets, IP reputation and rate limits are evaluated prior to any in-page checks, and a well-crafted scroll loop on a flagged IP still gets blocked.

When that occurs, the typical escalation involves residential proxies with rotating IPs. Decodo's residential proxies are implemented directly via Puppeteer launch arguments:

const browser = await puppeteer.launch({
headless: true,
defaultViewport: { width: 1920, height: 1080 },
args: ['--proxy-server=http://gate.decodo.com:7000'],
});
const page = await browser.newPage();
await page.authenticate({
username: process.env.PROXY_USER,
password: process.env.PROXY_PASS,
});

Any provider that exposes an HTTP proxy with basic authentication works the same way. Only the gateway hostname and credentials change.

Scrolling solved, IPs sorted

Your Puppeteer script handles the scroll. Decodo's residential proxies handle the part where the site blocks you after the third page load. 115M+ rotating IPs, 195+ countries.

Real-world walkthrough on Dev.to

Sandboxes serve to demonstrate mechanics. Dev.to, a real production site, implements infinite scroll and features a publicly documented API at developers.forem.com. The public API currently responds without requiring authentication or aggressive rate limiting, suitable for the read volumes generated by a tutorial. Executing the same diagnostic playbook against Dev.to shows which technical patterns translate effectively outside of controlled test environments.

Diagnose Dev.to

Here's what happens when you run the diagnostic playbook from earlier in this article against Dev.to:

  • Initial HTML contains an initial batch of articles in .crayons-story cards.
  • Scroll trigger is visible in DevTools Network → Fetch/XHR. Each scroll-load fires a fetch in the same shape:
GET https://dev.to/search/feed_content?per_page=15&page=0&sort_by=hotness_score&sort_direction=desc
GET https://dev.to/search/feed_content?per_page=15&page=1&sort_by=hotness_score&sort_direction=desc
GET https://dev.to/search/feed_content?per_page=15&page=2&sort_by=hotness_score&sort_direction=desc
  • URL on scroll doesn't change. That's infinite scroll, not pagination.
  • Public API exists at https://dev.to/api/articles with page and per_page parameters, and no auth required for read access.
  • Termination signal is an empty array on pages past the end (public API) or an empty result on the internal endpoint.

That diagnosis points to one clear path, and to a separate note about why the internal endpoint isn't worth pursuing on this target.

The documented public API

When the publisher already exposes what you need, skip the browser entirely:

// fetch-devto-articles.js (no browser required)
let articles = [];
for (let page = 1; page <= 5; page++) {
const res = await fetch(`https://dev.to/api/articles?per_page=30&page=${page}`);
const batch = await res.json();
if (!Array.isArray(batch) || batch.length === 0) break;
articles.push(...batch);
await new Promise(r => setTimeout(r, 400));
}
console.log(`Captured ${articles.length} articles via public API`);
console.log(`First title: ${articles[0]?.title}`);
console.log(`First author: ${articles[0]?.user?.name} (@${articles[0]?.user?.username})`);

Here's the actual output from running this against the live site:

Captured 150 articles via public API
First title: Great Little Software: Rackula
First author: Valeria (@valeriavg)

That's 150 fully-structured article objects (title, author, tags, cover image, URL, reading time, reaction counts) with 1 Node process, no Chromium, no scroll loop, in roughly 2 seconds. When a public API gives you the fields you need, it's usually the cleanest option.

A note on the internal endpoint

You might wonder whether intercepting dev.to's scroll-triggered /search/feed_content endpoint with the Auth Replay pattern from the earlier section would be even faster, or would give you fields the public API doesn't. Tested against the live site, a Node-side replay of that endpoint with captured cookies returns an empty result:

Captured 0 feed items via internal endpoint

The endpoint validates something beyond cookies (probably the sec-ch-ua Client Hints headers that only real browsers send, or a session attribute set by JavaScript that doesn't persist into the cookie jar). This is the failure mode the Auth Replay section warned about. Token-or-signature aspects aren't reproducible from a static replay. The general fallback (run the fetch from inside an active Puppeteer page so it inherits the live session) works. Still, for dev.to specifically there's no reason to do so: the public API already returns richer fields, and using Puppeteer just to call the internal endpoint costs more than it saves. On Developer-friendly sites that publish a usable public API, the cheapest path is usually the right one.

The lesson and a maintenance note

The diagnostic step upfront decided the path: dev.to's public API delivers everything its DOM exposes. Defaulting straight to a scroll loop without first checking whether a public API exists is one of the most common mistakes readers make on real targets. The earlier patterns (Strategy 1, 2, 3, and Auth Replay) still apply to targets without a clean public API. dev.to just isn't one of them, and that's the lesson worth taking from the walkthrough.

2 maintenance notes are worth knowing. First, dev.to's public API response shape and the internal endpoint's path may both change between when this article was written and when you read it. The diagnostic process stays valid; the specific URLs don't. Run the diagnosis yourself when the article's specifics don't match what your DevTools shows. Second, even scrape-friendly sites publish acceptable-use guidelines: add a small delay between requests, identify your bot in the User-Agent at scale, and prefer the public API over any internal endpoint when both deliver the data you need.

Alternative approaches

Driving a headless browser is the most general approach, and on some targets it's the only practical option. On others, you have cleaner choices. If you're still choosing between browser-automation tools, see Puppeteer vs. Playwright or Puppeteer vs. Selenium for the detailed comparisons.

Intercept the underlying API

Most infinite scroll pages fetch data from a JSON endpoint and inject the result into the DOM. If you can call that endpoint directly, you skip the rendering step entirely, which is faster and more reliable at scale.

To find the endpoint, open DevTools → Network → Fetch/XHR, scroll the page, and watch what fires. On quotes.toscrape.com/scroll the endpoint is quotes.toscrape.com/api/quotes?page=N and the response looks like this:

{
"has_next": true,
"page": 1,
"tag": null,
"top_ten_tags": [["love", 14], ["inspirational", 13], ["life", 13]],
"quotes": [
{
"author": {
"name": "Albert Einstein",
"goodreads_link": "/author/show/9810.Albert_Einstein",
"slug": "Albert-Einstein"
},
"tags": ["change", "deep-thoughts", "thinking", "world"],
"text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"
}
]
}

These 2 fields drive the loop logic: has_next tells you whether to request another page, and the integer page lets you increment without parsing the URL. The actual quote payload (author, tags, text) is structurally richer than the DOM extraction (you get the goodreads_link and slug fields the page never renders), which is part of why API interception often beats DOM scraping when the endpoint exists.

Puppeteer gives you 2 ways to capture those responses. The first is a passive listener that fires whenever a matching response arrives:

page.on('response', async response => {
const url = response.url();
if (url.includes('/api/quotes?page=')) {
const json = await response.json();
console.log(`Captured page ${json.page} with ${json.quotes.length} quotes`);
}
});

This works for opportunistic capture, but the listener isn't tied to your scroll action, so timing is loose. The deterministic pattern is page.waitForResponse() with a predicate, awaited in parallel with the scroll that triggers the request:

const [response] = await Promise.all([
page.waitForResponse(
r => r.url().includes('/api/quotes') && r.status() === 200,
{ timeout: 10_000 },
),
page.mouse.wheel({ deltaY: 2000 }),
]);
const batch = await response.json();
console.log(`Captured page ${batch.page} with ${batch.quotes.length} quotes`);

Each iteration is deterministic: the promise resolves when the matching response arrives, so you don't need delay() calls or waitForFunction() polling. The Promise.all call matters here. It registers the listener before the scroll fires, which avoids the race where the response arrives before your code starts listening.

Wrapping that in a loop and watching the has_next field gives you a clean termination condition. Against the test site, a complete run captures every page in order and stops when the feed is exhausted:

Captured page 1 with 10 quotes (has_next: true)
Captured page 2 with 10 quotes (has_next: true)
Captured page 3 with 10 quotes (has_next: true)
Captured page 4 with 10 quotes (has_next: true)
Captured page 5 with 10 quotes (has_next: true)
Captured page 6 with 10 quotes (has_next: true)
Captured page 7 with 10 quotes (has_next: true)
Captured page 8 with 10 quotes (has_next: true)
Captured page 9 with 10 quotes (has_next: true)
Captured page 10 with 10 quotes (has_next: false)

The same 100 quotes that took roughly 24 randomized scroll iterations to collect through the DOM now arrive in 10 deterministic API calls. That ratio is what people mean when they say API interception "often beats DOM scraping". You get the same data with fewer round trips and no scroll timing to tune.

Once you confirm the endpoint shape, you can drop Puppeteer entirely and call the API with plain fetch or axios in a tight loop, incrementing the page parameter until the response returns an empty array. The trade-offs are:

  • Advantages. You skip DOM parsing and scroll timing, get structured JSON output, and have fewer moving parts. It's also faster.
  • Limitations. Some APIs require auth tokens, cursor parameters, signed query strings, or session cookies that you have to extract first. Some APIs also change shape without notice.

Auth Replay: replaying authenticated API calls

The previous example works on a public test endpoint with no auth. Real targets usually wrap the same kind of endpoint in some combination of session cookies, CSRF tokens, signed query parameters, and custom headers. Switching from a passive Puppeteer listener to a direct fetch loop means cloning that authentication state into your Node code.

A workable pattern involves the following steps: let Puppeteer load the page and trigger one scroll, capture the resulting request as a template, extract the browser's cookies, then replay the template from Node while only varying the cursor parameter each round:

import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
let template = null;
page.on('request', req => {
if (req.url().includes('/api/feed') && !template) {
template = {
url: req.url(),
method: req.method(),
headers: req.headers(),
body: req.postData(),
};
}
});
await page.goto('https://example.com/feed');
await page.waitForSelector('.feed-item');
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForResponse(r => r.url().includes('/api/feed'));
const cookies = await page.cookies(template.url);
const cookieHeader = cookies.map(c => `${c.name}=${c.value}`).join('; ');
await browser.close();
// Replay against the API directly, varying only the cursor each round.
let cursor = null;
const items = [];
while (true) {
const body = template.body ? JSON.parse(template.body) : {};
if (cursor) body.cursor = cursor;
const res = await fetch(template.url, {
method: template.method,
headers: { ...template.headers, cookie: cookieHeader },
body: template.body ? JSON.stringify(body) : undefined,
});
const data = await res.json();
items.push(...(data.items ?? []));
if (!data.has_next) break;
cursor = data.next_cursor;
}
console.log(`Captured ${items.length} items via replayed API`);

3 failure modes are worth knowing before you use this pattern in production:

  • Session cookies expire mid-run on long scrapes. When authentication errors start coming back, re-open Puppeteer briefly to refresh the session.
  • CSRF tokens are sometimes single-use, which means you can't replay them statically and must re-fetch the token from the page before each API call.
  • The API has its own rate limits, usually stricter than the browser-paced ceiling. Adding a small delay between requests is the difference between "cleanest path" and "instant ban".

When token refresh or signing logic gets more expensive than driving the scroll, drive the scroll. API interception is a large speedup when the auth mechanism is simple. It's a wash when it requires constant maintenance.

API interception is often the cleanest path. Try it before you commit to DOM scraping.

The Playwright equivalent

Playwright uses the same conceptual model as Puppeteer (driving a browser engine through a remote-debugging protocol), with slightly different API details and broader engine support:

// playwright-scroll.js
import { chromium } from 'playwright';
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://quotes.toscrape.com/scroll');
let previousHeight = 0;
while (true) {
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) break;
previousHeight = currentHeight;
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForLoadState('networkidle');
}
await browser.close();

3 Playwright conveniences are worth knowing:

  • The page.waitForLoadState('networkidle') method replaces the manual waitForFunction and delay pair, which keeps the inner loop shorter.
  • The locator.scrollIntoViewIfNeeded() method is a cleaner scroll-to-element API than Puppeteer's element.scrollIntoView() through page.evaluate().
  • The page.mouse.wheel(0, scrollAmount) method serves the same purpose as Puppeteer's page.mouse.wheel() for human-like scrolling.

The largest functional difference is multi-browser support: Playwright drives Chromium, Firefox, and WebKit from the same script, which matters when the target's bot detection behaves differently per engine. For the full walkthrough, see the Playwright web scraping tutorial.

Community npm packages: state of the ecosystem

You can find several community packages that help you handle the scroll loop, such as puppeteer-autoscroll-down and puppeteer-infinite-scroller. However, the situation in 2026 is quite different for each one. The puppeteer-autoscroll-down package is still active because it received updates in the last year, and it works well with the current Puppeteer versions. In contrast, puppeteer-infinite-scroller is now outdated because it has not been updated for about 2.5 years, so it doesn’t support the headless mode changes from v22. Other available wrappers are in between these two states of maintenance.

Even when a package is current, the trade-off remains constant. Since the abstractions these libraries offer are short (e.g., 20 lines for a generic scroll loop, 40 lines for one with item-count termination), implementing the pattern in your own code is less costly than troubleshooting a wrapper that breaks due to a Puppeteer upgrade or a target-site DOM change. Use npm packages when you prefer not to write the loop yourself and are willing to replace them when they fall behind. Write the loop yourself when you require direct control over the termination signal, the timing tuning, and the upgrade path.

Putting it all together

Now that you've seen the techniques applied to a real site, here's the complete working scrape. It combines viewport-step scrolling with height-change termination, the lazy-image pass, and the human-like patterns into a single runnable file. Save as scrape-infinite-scroll.js and run with node scrape-infinite-scroll.js:

// scrape-infinite-scroll.js
// Full working example. Targets quotes.toscrape.com/scroll.
// - Viewport-step scroll loop with height-change termination
// - Human-like patterns: random delays, varied scroll distance,
// mouse movement, occasional upward scrolls, dynamic UA tied to bundled Chromium
// - Second pass for lazy-loaded images
// - try/finally cleanup, JSON output
import puppeteer from 'puppeteer';
import fs from 'fs/promises';
const delay = ms => new Promise(r => setTimeout(r, ms));
const randomDelay = (min, max) => new Promise(r => setTimeout(r, Math.random() * (max - min) + min));
async function humanMouseMove(page) {
const x = 100 + Math.random() * 1000;
const y = 100 + Math.random() * 700;
await page.mouse.move(x, y, { steps: 10 + Math.floor(Math.random() * 10) });
}
async function extractQuotes(page) {
return page.evaluate(() => {
const cards = document.querySelectorAll('.quote');
return Array.from(cards).map(card => ({
text: card.querySelector('.text')?.innerText.trim(),
author: card.querySelector('.author')?.innerText.trim(),
tags: Array.from(card.querySelectorAll('.tag')).map(t => t.innerText.trim()),
}));
});
}
async function loadAllImages(page) {
const viewportHeight = await page.evaluate(() => window.innerHeight);
const fullHeight = await page.evaluate(() => document.body.scrollHeight);
for (let y = 0; y <= fullHeight; y += viewportHeight) {
await page.evaluate(scrollY => window.scrollTo(0, scrollY), y);
await delay(400);
}
await page.waitForNetworkIdle({ idleTime: 1000, timeout: 30000 });
}
async function scrapeInfiniteScroll() {
const browser = await puppeteer.launch({
headless: true,
defaultViewport: { width: 1920, height: 1080 },
args: ['--disable-blink-features=AutomationControlled'],
});
const page = await browser.newPage();
const major = (await browser.version()).match(/\/(\d+)\./)?.[1];
await page.setUserAgent(
`Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ` +
`(KHTML, like Gecko) Chrome/${major}.0.0.0 Safari/537.36`,
);
try {
await page.goto('https://quotes.toscrape.com/scroll', {
waitUntil: 'domcontentloaded',
});
await page.waitForSelector('.quote');
let iteration = 0;
let stableScrolls = 0;
let lastHeight = await page.evaluate(() => document.body.scrollHeight);
while (stableScrolls < 4) {
await humanMouseMove(page);
await page.evaluate(() => {
const factor = 0.6 + Math.random() * 0.4;
window.scrollBy(0, window.innerHeight * factor);
});
if (iteration > 0 && Math.random() < 0.15) {
await page.mouse.wheel({ deltaY: -(150 + Math.random() * 200) });
await randomDelay(400, 900);
}
await randomDelay(600, 1800);
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight > lastHeight) {
lastHeight = currentHeight;
stableScrolls = 0;
} else {
stableScrolls++;
}
iteration++;
}
await loadAllImages(page);
const quotes = await extractQuotes(page);
await fs.writeFile('quotes.json', JSON.stringify(quotes, null, 2));
console.log(`Saved ${quotes.length} quotes across ${iteration} iterations`);
return quotes;
} finally {
await browser.close();
}
}
scrapeInfiniteScroll().catch(err => {
console.error('Scrape failed:', err);
process.exit(1);
});

A clean run against quotes.toscrape.com/scroll prints one line and produces a 100-quote quotes.json:

Saved 100 quotes across 22 iterations

The iteration count varies between runs because the scroll distance and delays are randomized. The site has 100 quotes split into 10 API-paginated batches. With 60 to 100% viewport scrolls, each batch typically takes 1 or 2 iterations to trigger, so anything in the low to mid 20s is normal here. If the count climbs into the 40s on a target that should be smaller, the stability counter is waiting longer than necessary, and you can lower the stableScrolls < 4 threshold to 3.

To route the same script through a proxy, add the –proxy-server arg and page.authenticate() pattern from the anti-detection section above. Provide any provider's gateway hostname and credentials, and the same loop runs through the proxy pool.

Next steps

You should select the optimal scraping strategy based on the target's architecture. If DevTools reveals a structured JSON endpoint during scrolling, prioritize API interception to bypass Puppeteer entirely. If the data requires JavaScript rendering, implement Strategy 1 (scroll-to-bottom with height verification). Utilize Strategy 2 for pages employing IntersectionObserver with a small threshold, and apply Strategy 3 when a distinct end-of-feed signal is present.

Before you run anything at scale, apply 3 production defaults:

  • Wrap the scrape loop in try/finally so browser.close() runs on every exit path, including unhandled errors.
  • Log a timestamped record per scroll iteration so a failure at quote 1,847 of 5,000 tells you exactly where it stopped.
  • Run a small test pass against your real target before scaling. A selector that works on quotes.toscrape.com may not match the structure of the real site.

When local stealth is insufficient (IP reputation, scaling concurrency, CDN-grade anti-bot), configure the browser launch with proxy credentials. The --proxy-server flag and page.authenticate() pattern shown earlier works with any provider that exposes an HTTP proxy with basic auth.

For the complete Puppeteer API details, see the official Puppeteer documentation.

Bottom line

The simplest method through infinite scroll is usually the one you see in DevTools first. When that path requires a browser, the patterns in this guide are effective for most production targets. When it requires a residential pool, Decodo's residential proxies are compatible with the –proxy-server arg and page.authenticate() pattern shown earlier.

Skip the Puppeteer maintenance

Infinite scroll, lazy loading, anti-bot detection. Decodo's Web Scraping API renders the full page and returns the data. No browser to manage, no selectors to debug.

Share article:

About the author

Justinas Tamasevicius

Director of Engineering

Justinas Tamaševičius is Director of Engineering with over two decades of expertise in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.

Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

How do I scrape a website with infinite scroll?

You use a headless browser like Puppeteer to run the page's JavaScript and trigger scroll-loaded content, or you intercept the JSON API the page calls on each scroll. If DevTools shows fetch requests on scroll, the API path is usually cleaner and faster than DOM scraping.

How does Puppeteer handle infinite scroll?

Puppeteer drives a real Chromium instance so that it can call window.scrollTo() or page.mouse.wheel() inside the page context. The scroll triggers the site's native JavaScript, which fetches and injects new DOM nodes. Puppeteer then waits via page.waitForFunction() before extracting the rendered content.

What is the best way to scroll to the bottom of a page in Puppeteer?

For most pages, use page.evaluate(() => window.scrollTo(0, document.body.scrollHeight)). For IntersectionObserver loading, scroll by viewport height with window.scrollBy(). For element-specific scrolling, use element.scrollIntoView() inside page.evaluate(). Don't use page.waitForTimeout(); it was removed in v22.

How do I scrape lazy-loaded images with Puppeteer?

After the data scroll loop finishes, do a second top-to-bottom pass in viewport-height increments. This triggers IntersectionObserver events for every image. Finish with page.waitForNetworkIdle() before reading src attributes, so the image fetches have time to settle.

Is Playwright better than Puppeteer for infinite scroll?

Both work well. Playwright's built-in waitForLoadState('networkidle') and locator.scrollIntoViewIfNeeded() are slightly more convenient than Puppeteer's manual waits. Puppeteer has a larger codebase and more community examples. For Chromium-only targets, either is a reasonable choice.

Playwright Web Scraping: A Practical Tutorial

Web scraping can feel like directing a play without a script – unpredictable and chaotic. That’s where Playwright steps in: a powerful, headless browser automation tool that makes scraping modern, dynamic websites smoother than ever. In this practical tutorial, you’ll learn how to use Playwright to reliably extract data from any web page.

Mastering Web Scraping Pagination: Techniques, Challenges, and Python Solutions

Pagination is the system websites use to split large datasets across multiple pages for faster loading and better navigation. In web scraping, handling pagination is essential to capture complete datasets rather than just the first page of results. This guide explains what pagination is, the challenges it creates, and how to handle it efficiently with Python.

Puppeteer vs. Playwright: Which Tool Is Better for Web Scraping?

Puppeteer vs. Playwright is a real architectural decision for any production scraping project. The two libraries share a common origin: Playwright was built at Microsoft by engineers who previously worked on Puppeteer at Google. Yet they're different on browser coverage, language bindings, and scraping ergonomics. Performance, stealth, proxy integration, and parallel execution decide which tool fits your pipeline.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved