Back to blog

Puppeteer Download File: A Complete Guide for Node.js Developers

Share article:

Puppeteer makes browser automation feel easy until you need to save a file to disk. Triggering a download in headless mode isn't the same as clicking a button in a real browser, and the default behavior in headless Chrome won't help you. This guide covers the full Puppeteer download file workflow: configuring CDP correctly, picking the right method for your scenario, detecting when a file truly finished, and scaling to batch jobs without leaking memory or corrupting your queue.

Puppeteer Download File hero image

TL;DR

  • Set a custom download directory using Browser.setDownloadBehavior over CDP. Without this step, headless Chromium will not save files
  • Pick your method based on what the target does. Use page.click() for real download buttons, in-browser fetch for blob or AJAX exports, CDP events for reliable completion tracking, and Node-side HTTP for direct asset URLs
  • Detect completion with Browser.downloadProgress events. Polling for .crdownload files or watching for filename changes does not work reliably across different operating systems
  • Use the Decodo Web Scraping API if downloads are blocked by Cloudflare, DataDome, geo-restrictions, or login flows that headless Chrome cannot handle by itself

Configuring Puppeteer for file downloads

This section is essential for successful downloads. Incorrect download paths or CDP scope selection will cause silent failures, resulting in no errors or files, only an empty folder.

Setting up the project

Create a dedicated project directory before writing any code:

mkdir puppeteer-downloader
cd puppeteer-downloader

Initialize a Node.js project:

npm init -y

Install Puppeteer:

# installs Puppeteer and its own Chrome for Testing binary
npm install puppeteer
# or, if you want to use your own Chrome installation
npm install puppeteer-core

Create the main script file and downloads directory:

touch downloader.js
mkdir downloads

Your project should now look like this:

puppeteer-downloader/
├── node_modules/
├── downloads/
├── package.json
└── downloader.js

All code in this section goes inside downloader.js unless stated otherwise.

Puppeteer has two versions:

  • puppeteer. Automatically downloads Chrome for testing. Recommended for local development and most environments.
  • puppeteer-core. Doesn't include Chrome; you must supply your own installation. Suitable for CI containers, Lambda layers, or when minimizing binary size is necessary.

If you're using puppeteer-core, point it at a valid Chrome installation explicitly:

const puppeteer = require("puppeteer-core");
const browser = await puppeteer.launch({
executablePath: "/usr/bin/google-chrome-stable",
headless: true,
});

Check your Node version before going further. Puppeteer's current release requires Node 20+:

node --version
# Expected: v20.x.x or higher

Running Puppeteer in Docker or CI containers

If you're running this in Docker, Kubernetes, or a CI environment like GitHub Actions, add these flags to every puppeteer.launch() call in the guide:

const browser = await puppeteer.launch({
headless: true,
args: [
"--no-sandbox",
"--disable-dev-shm-usage",
],
});
  • --no-sandbox. Chrome's sandbox requires kernel-level privileges that most container environments don't grant by default. Without this flag, Chromium crashes immediately in unprivileged containers with a confusing SUID sandbox helper binary error.
  • --disable-dev-shm-usage. By default, Chrome uses /dev/shm (shared memory) for rendering. Docker containers cap /dev/shm at 64MB. Chrome silently exceeds that limit and crashes mid-scrape with no clear error. This flag tells Chrome to use /tmp instead, which has no such restriction.

Without both flags, headless Chrome will crash intermittently in containers often only under load, making it hard to diagnose. Add them to every puppeteer.launch() call if your scraper runs anywhere outside a local machine.

Why downloads break in headless mode

By default, headless Chromium does not automatically allow browser-managed downloads. When a response includes a Content-Disposition: attachment header, headless Chrome does not display a message, save the file, or show an error. The download never starts.

To resolve this, use a Chrome DevTools Protocol (CDP) command to enable file downloads during the session.

Configuring downloads with CDP

CDP (Chrome DevTools Protocol) is the underlying protocol for Puppeteer. The Browser.setDownloadBehavior command specifies where Chromium saves files and whether downloads are permitted.

Add this to downloader.js:

const puppeteer = require("puppeteer");
const path = require("path");
const fs = require("fs");
const downloadDir = path.resolve(__dirname, "downloads");
// create the downloads directory if it doesn't exist
if (!fs.existsSync(downloadDir)) {
fs.mkdirSync(downloadDir, { recursive: true });
}
const LAUNCH_ARGS = { headless: true, args: [ "--no-sandbox", "--disable-dev-shm-usage", ], };
async function setupBrowser() {
const browser = await puppeteer.launch(LAUNCH_ARGS);
const page = await browser.newPage();
// get a CDP session scoped to the browser
const client = await page.createCDPSession();
await client.send("Browser.setDownloadBehavior", {
behavior: "allowAndName",
downloadPath: downloadDir,
eventsEnabled: true,
});
return { browser, page, client };
}

Download options:

  • allow. This option lets users download files and uses the name suggested by the server.
  • allowAndName. This option allows downloads but names the files using their CDP GUID instead of the suggested name. Use this option for batch jobs to avoid naming conflicts.
  • deny. This option blocks all downloads. Use it on pages where you want to prevent any downloads from happening.

Browser.* vs. Page.*:

Previous Puppeteer guides reference Page.setDownloadBehavior, which is page-scoped and now deprecated. Use Browser.setDownloadBehavior for browser-wide, consistent behavior in recent Puppeteer versions.

Use absolute paths only

Chromium’s sandbox interprets relative paths based on its own working directory. Using a relative path, such as ./downloads will prevent files from being saved. Always use path.resolve().

// wrong -- silently fails
downloadPath: "./downloads"
// correct
downloadPath: path.resolve(__dirname, "downloads")

Headless mode caveats

Puppeteer’s headless: true now defaults to the new headless mode, which is a fully headless Chromium build and behaves differently from the legacy shell mode regarding browser-managed downloads.

This guide assumes headless: true (new mode) throughout. If you are using an older Puppeteer version and encounter unexpected download behavior, try the following:

const browser = await puppeteer.launch({ headless: "shell" });

The legacy shell mode has broader compatibility with older CDP download behavior but is being phased out. Stick with headless: true for new projects.

Persistent profiles

When downloads require cookies or auth from a previous session, a dashboard export that needs you to be logged in, for example, launch with userDataDir to persist storage between runs:

const browser = await puppeteer.launch({
headless: true,
userDataDir: path.resolve(__dirname, "chrome-profile"),
});

Chromium writes cookies, localStorage, and session data to that directory. The next time you launch with the same path, the session is already active; there's no need to re-authenticate.

Create the profile directory upfront:

mkdir chrome-profile

Your project directory now looks like this:

puppeteer-downloader/
├── chrome-profile/
├── downloads/
├── node_modules/
├── package.json
└── downloader.js

Per-context configuration for parallel scrapes

When running parallel download jobs, set download behavior per BrowserContext rather than per page. This keeps each job's files isolated:

async function setupParallelContexts(browser) {
const context1 = await browser.createBrowserContext();
const context2 = await browser.createBrowserContext();
const page1 = await context1.newPage();
const page2 = await context2.newPage();
// createCDPSession() is called on a page, but Browser.* commands
// sent through it are still browser-scoped -- the CDP domain determines scope,
// not the object you created the session from
const client1 = await page1.createCDPSession();
const client2 = await page2.createCDPSession();
await client1.send("Browser.setDownloadBehavior", {
behavior: "allowAndName",
downloadPath: path.resolve(__dirname, "downloads/job-1"),
eventsEnabled: true,
});
await client2.send("Browser.setDownloadBehavior", {
behavior: "allowAndName",
downloadPath: path.resolve(__dirname, "downloads/job-2"),
eventsEnabled: true,
});
return { page1, page2, client1, client2 };
}

Incorrectly scoping download behavior is a common reason why files from parallel jobs are saved to the same folder or are not saved at all.

Environment prerequisites checklist

Before downloading, review the following checklist:

  • Ensure you have Node version 20 or higher by running node --version.
  • Verify write permissions for the download directory. Use ls -la in the parent folder to check permissions.
  • Confirm sufficient disk space for the largest expected file, plus additional space for temporary files ending with .crdownload.

Check your file descriptor limit. On Linux, downloading multiple files simultaneously can exhaust file descriptors. Use ulimit -n to view and increase the limit if necessary:

ulimit -n 4096

If you encounter environment issues during setup, refer to the JavaScript heap out of memory guide for common Node diagnostics. If you're still evaluating browser automation tools, consider the trade-offs between Playwright and Selenium.

Methods for downloading files with Puppeteer

There's no one-size-fits-all way to download files with Puppeteer. The best method depends on how the target site handles downloads. Here are four methods, ranked by how well they work in real situations.

Method 1: Triggering a real browser download via page.click()

Use this method when a regular download button or link triggers a file download with a Content-Disposition: attachment header, and you don't need to check the file's contents first. This is the simplest case: clicking the button makes the server send the file, and the browser saves it. Puppeteer handles this for you. Chromium puts the file right in the download path you set with setDownloadBehavior. Your Node script doesn't handle the file data; it just checks when the file is done downloading.

Add this to downloader.js:

const puppeteer = require("puppeteer");
const path = require("path");
const fs = require("fs");
const downloadDir = path.resolve(__dirname, "downloads");
const LAUNCH_ARGS = { headless: true, args: [ "--no-sandbox", "--disable-dev-shm-usage", ], };
async function downloadViaClick(url, selector) {
const browser = await puppeteer.launch(LAUNCH_ARGS);
const page = await browser.newPage();
const client = await page.createCDPSession();
await client.send("Browser.setDownloadBehavior", {
behavior: "allowAndName",
downloadPath: downloadDir,
eventsEnabled: true,
});
await page.goto(url, { waitUntil: "networkidle2" });
// wait for the download button to appear
await page.waitForSelector(selector, { timeout: 10000 });
// set up CDP completion listener before clicking
const downloadComplete = new Promise((resolve, reject) => {
const timeout = setTimeout(
() => reject(new Error("Download timed out")),
60000
);
client.on("Browser.downloadProgress", (event) => {
if (event.state === "completed") {
clearTimeout(timeout);
resolve(event.guid);
} else if (event.state === "canceled") {
clearTimeout(timeout);
reject(new Error("Download was canceled"));
}
});
});
// trigger the download
await page.click(selector);
// wait for CDP to confirm completion
const guid = await downloadComplete;
console.log(`Download completed. GUID: ${guid}`);
await browser.close();
return guid;
}
// run it
downloadViaClick(
"https://data.worldbank.org/indicator/NY.GDP.MKTP.CD",
"[data-testid='download-button']"
);

Run it:

node downloader.js

Here are some common reasons downloads might fail:

  • Same-tab navigation instead of a download: the server returned the wrong header. The file URL opened in the browser instead of triggering a save. Check the response headers with DevTools. Content-Disposition: attachment must be present.
  • Button opens a new tab, attach a listener for new pages before clicking:
const newPagePromise = new Promise((resolve) =>
browser.once("targetcreated", (target) => resolve(target.page()))
);
await page.click(selector);
const newPage = await newPagePromise;
  • Link generated by JavaScript on hover: the href does not exist until the element is hovered. Trigger hover first:
await page.hover(selector);
await page.waitForSelector(selector + "[href]", { timeout: 5000 });
await page.click(selector);

Method 2: Capturing in-page blobs with the browser fetch API

Use this method when the download is an XHR or fetch call that returns a Blob or ArrayBuffer, which the page converts into a download URL. This is common on dashboards with “Export to CSV” buttons that generate the file in the browser. Keep file size under ~50 MB; larger files are slow to transfer as base64 over the CDP bridge.

Under the hood, the page calls URL.createObjectURL() to create a temporary blob: URL, then programmatically clicks a hidden anchor pointing to it. That blob: URL doesn’t exist outside the page context, and a plain Node fetch can’t reach it.

Capturing in-page blobs with the browser fetch API

The fix is to run the fetch inside page.evaluate() so it inherits the live session’s cookies, auth headers, and CSRF tokens:

const LAUNCH_ARGS = { headless: true, args: [ "--no-sandbox", "--disable-dev-shm-usage", ], };
async function downloadBlob(url, fetchUrl) {
const browser = await puppeteer.launch(LAUNCH_ARGS);
const page = await browser.newPage();
await page.goto(url, { waitUntil: "networkidle2" });
// run fetch inside the browser context to inherit the session
const base64Data = await page.evaluate(async (targetUrl) => {
const response = await fetch(targetUrl, {
credentials: "include", // sends cookies automatically
});
const buffer = await response.arrayBuffer();
const bytes = new Uint8Array(buffer);
let binary = "";
for (let i = 0; i < bytes.byteLength; i++) {
binary += String.fromCharCode(bytes[i]);
}
return btoa(binary); // convert to base64 to ferry across CDP bridge
}, fetchUrl);
// convert base64 back to binary and write to disk
const buffer = Buffer.from(base64Data, "base64");
const outputPath = path.resolve(downloadDir, "export.csv");
fs.writeFileSync(outputPath, buffer);
console.log(`Saved to: ${outputPath}`);
await browser.close();
}

This works better than a plain Node fetch because the blob: URL is a one-time object created with URL.createObjectURL(). It exists only within the browser tab, so a Node fetch outside the browser can't access it.

Method 3: CDP-driven downloads with Browser.downloadProgress

Use this method when production scrapers require precise timing and cannot tolerate filesystem polling races.

This is the most reliable way to know when a download is done. Instead of watching for .crdownload files to disappear, you can listen to CDP events and track Chromium's download progress in real time.

const LAUNCH_ARGS = { headless: true, args: [ "--no-sandbox", "--disable-dev-shm-usage", ], };
async function downloadWithCDPEvents(url, selector) {
const browser = await puppeteer.launch(LAUNCH_ARGS);
const page = await browser.newPage();
const client = await page.createCDPSession();
await client.send("Browser.setDownloadBehavior", {
behavior: "allowAndName",
downloadPath: downloadDir,
eventsEnabled: true,
});
const downloads = new Map();
client.on("Browser.downloadWillBegin", (event) => {
downloads.set(event.guid, {
guid: event.guid,
suggestedFilename: event.suggestedFilename,
state: "pending",
});
console.log(`Download started: ${event.suggestedFilename} (${event.guid})`);
});
await page.goto(url, { waitUntil: "networkidle2" });
await page.waitForSelector(selector);
await page.click(selector);
// resolve is now in scope for the CDP handler
const results = await new Promise((resolve) => {
client.on("Browser.downloadProgress", (event) => {
const download = downloads.get(event.guid);
if (!download) return;
if (event.state === "completed") {
const guidPath = path.resolve(downloadDir, event.guid);
const finalPath = path.resolve(downloadDir, download.suggestedFilename);
fs.renameSync(guidPath, finalPath);
downloads.set(event.guid, { ...download, state: "completed", finalPath });
} else if (event.state === "canceled") {
downloads.set(event.guid, { ...download, state: "canceled" });
}
// check after every state change -- no polling needed
const allDone = [...downloads.values()].every(
(d) => d.state === "completed" || d.state === "canceled"
);
if (allDone) resolve([...downloads.values()]);
});
});
await browser.close();
return results;
}

Why CDP events are better than checking for .crdownload files:

  • Chrome on macOS, Linux, and Windows handles temp file naming differently
  • Antivirus software on Windows can briefly lock the final file after Chrome writes it
  • .crdownload disappearing doesn’t mean the file is fully flushed to disk

CDP events avoid all these problems. When the state is "completed," Chromium has finished writing the file.

With behavior: "allowAndName", Chromium saves files using their CDP GUID as the filename instead of the suggested name. This stops naming conflicts in batch jobs where two downloads might both be called report.csv. After the download is complete, rename the GUID file to the suggested filename.

Method 4: Bypassing the browser with Node-side HTTP for direct asset URLs

Use this method when the file URL is stable and not restricted, such as CDN-hosted files, public S3 objects, or government open-data CSVs whose URLs you can extract from the page.

This method uses Puppeteer only to extract the final href, then passes it to a Node HTTP client. The browser does not download the file.

const https = require("https");
const LAUNCH_ARGS = { headless: true, args: [ "--no-sandbox", "--disable-dev-shm-usage", ], };
async function downloadViaNodeHTTP(pageUrl, linkSelector) {
// step 1 -- use Puppeteer to extract the file URL
const browser = await puppeteer.launch(LAUNCH_ARGS);
const page = await browser.newPage();
await page.goto(pageUrl, { waitUntil: "networkidle2" });
await page.waitForSelector(linkSelector);
const fileUrl = await page.$eval(linkSelector, (el) => el.href);
// forward cookies from Puppeteer to the Node HTTP request
const cookies = await page.cookies();
const cookieHeader = cookies
.map((c) => `${c.name}=${c.value}`)
.join("; ");
await browser.close();
// step 2 -- stream the file directly to disk
const filename = path.basename(new URL(fileUrl).pathname);
const outputPath = path.resolve(downloadDir, filename);
const file = fs.createWriteStream(outputPath);
return new Promise((resolve, reject) => {
https.get(
fileUrl,
{
headers: {
Cookie: cookieHeader,
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
},
},
(response) => {
response.pipe(file);
file.on("finish", () => {
file.close();
console.log(`Saved to: ${outputPath}`);
resolve(outputPath);
});
}
).on("error", (err) => {
fs.unlink(outputPath, () => reject(err));
});
});
}

This method is faster because it skips the browser's rendering and download steps, streams the response directly to disk, and lets you control how many files download at once. For big files or lots of downloads, this makes a big difference.

Forward the cookies before closing the browser. If the asset host checks your session, a Node HTTP request without the right cookies will get a 403 error. Copy all cookies from page.cookies() into your request headers before closing the browser.

When this breaks:

  • The URL is signed and bound to the browser’s TLS fingerprint
  • The asset host inspects JA3/JA4 TLS fingerprints
  • The file is regenerated per-request and only valid inside a click handler

In any of these cases, fall back to Method 3 or use a managed API.

For readers coming from a shell scripting background, how to download files with cURL covers the equivalent patterns. And for Node-side HTTP with proxy integration, proxy integration with Axios goes deeper on the request configuration side.

Scenario

Method

Button with Content-Disposition: attachment

Method 1: page.click()

"Export to CSV" that builds the file in the browser

Method 2: in-page fetch

Production scraper needing reliable completion detection

Method 3: CDP events

Stable, public file URL scraped from the page

Method 4: Node HTTP

Any of the above behind Cloudflare or anti-bot protection

Decodo Web Scraping API

Handling download completion and file management

Starting a download with Puppeteer is easy. Making sure the file is fully and correctly downloaded, without naming conflicts, is where most people run into problems. This section explains reliable patterns and common mistakes.

The unreliable patterns

These methods are common in Stack Overflow answers and older tutorials. They may appear effective, but are unreliable:

  • setTimeout. Waiting a fixed number of seconds after a click assumes the download will always complete within that period. Slow connections or large files may exceed this window, while fast connections result in unnecessary delays:
// unreliable -- arbitrary wait with no completion signal
await page.click(selector);
await new Promise((resolve) => setTimeout(resolve, 5000));
  • fs.watch on a single file. Monitoring for a specific filename can fail if Chrome has not finalized the filename, if the file remains a .crdownload temporary file, or if two parallel downloads suggest the same filename:
// unreliable -- races with Chrome's temp file lifecycle
fs.watch(downloadDir, (event, filename) => {
if (filename && !filename.endsWith(".crdownload")) {
console.log("Download complete:", filename);
}
});
  • The first new file in the folder wins. In batch jobs with multiple concurrent downloads, the first new file detected may not correspond to the intended download. This can cause silent data mismatches that are difficult to diagnose:
// unreliable -- which download does this belong to?
const before = new Set(fs.readdirSync(downloadDir));
await page.click(selector);
await new Promise((resolve) => setTimeout(resolve, 3000));
const after = new Set(fs.readdirSync(downloadDir));
const newFile = [...after].find((f) => !before.has(f));

The reliable pattern: CDP downloadProgress events

The recommended approach is to subscribe to Browser.downloadProgress events and resolve a Promise when state equals "completed". Set a strict timeout for each file to ensure stalled downloads fail the job rather than causing indefinite hangs:

function waitForDownload(client, timeoutMs = 60000) {
return new Promise((resolve, reject) => {
const timeout = setTimeout(
() => reject(new Error(`Download timed out after ${timeoutMs}ms`)),
timeoutMs
);
const downloads = new Map();
client.on("Browser.downloadWillBegin", (event) => {
downloads.set(event.guid, {
guid: event.guid,
suggestedFilename: event.suggestedFilename,
receivedBytes: 0,
lastActivity: Date.now(),
});
});
client.on("Browser.downloadProgress", (event) => {
const download = downloads.get(event.guid);
if (!download) return;
// update heartbeat for stall detection
downloads.set(event.guid, {
...download,
receivedBytes: event.receivedBytes,
lastActivity: Date.now(),
});
if (event.state === "completed") {
clearTimeout(timeout);
resolve({
guid: event.guid,
suggestedFilename: download.suggestedFilename,
totalBytes: event.totalBytes,
});
} else if (event.state === "canceled") {
clearTimeout(timeout);
reject(new Error(`Download canceled: ${download.suggestedFilename}`));
}
});
});
}

Use this function with any of the download methods described in the previous section:

const downloadPromise = waitForDownload(client);
await page.click(selector);
const result = await downloadPromise;
console.log(`Completed: ${result.suggestedFilename} (${result.totalBytes} bytes)`);

Cross-platform temp files

Chrome names files that are still downloading with the extension .crdownload. Firefox uses .part, and Safari uses .download. If you switch to Playwright or use different browsers, the way you check for these temporary files will change. However, CDP events work the same way across all browsers; the same listener will function no matter what browser you use.

Handling filename collisions

With behavior: "allow", Chrome saves files using the server’s suggested filename. When 2 parallel downloads both produce report.csv, Chrome silently renames one to report (1).csv. You won’t know which is which.

With behavior: "allowAndName", Chrome names files by their CDP GUID, guaranteed unique. Rename them yourself after the completed event fires:

function renameDownload(guid, suggestedFilename, downloadDir) {
const guidPath = path.resolve(downloadDir, guid);
const timestamp = Date.now();
const ext = path.extname(suggestedFilename);
const base = path.basename(suggestedFilename, ext);
// add timestamp to prevent collisions on repeated runs
const finalName = `${base}_${timestamp}${ext}`;
const finalPath = path.resolve(downloadDir, finalName);
fs.renameSync(guidPath, finalPath);
console.log(`Renamed: ${guid} → ${finalName}`);
return finalPath;
}

You can also prefix with a jobId so files from parallel runs never collide.

Integrity verification

completed event means Chrome finished writing the file; it doesn’t mean the file is correct. Truncated downloads, partial transfers, and silent server-side errors all produce files that pass completion detection but contain garbage data.

Stream-hash each file after completion and compare against the server’s ETag or Content-Length header when available:

const crypto = require("crypto");
function hashFile(filePath) {
return new Promise((resolve, reject) => {
const hash = crypto.createHash("sha256");
const stream = fs.createReadStream(filePath);
stream.on("data", (chunk) => hash.update(chunk));
stream.on("end", () => resolve(hash.digest("hex")));
stream.on("error", reject);
});
}

This catches truncated downloads that the completed event missed. A server that closes the connection early won't always trigger canceled.

async function verifyDownload(filePath, expectedBytes) {
const stats = fs.statSync(filePath);
if (stats.size !== expectedBytes) {
throw new Error(
`Size mismatch: expected ${expectedBytes} bytes, got ${stats.size}`
);
}
const hash = await hashFile(filePath);
console.log(`SHA-256: ${hash}`);
return hash;
}

Folder hygiene

Each download job generates files that accumulate fast without cleanup.

Assign a unique download directory for each job ID to ensure unambiguous completion detection and prevent restarts from overwriting previous output:

function createJobDirectory(jobId) {
const jobDir = path.resolve(__dirname, "downloads", jobId);
fs.mkdirSync(jobDir, { recursive: true });
return jobDir;
}
// clean up on success
function cleanupJob(jobDir, keep = true) {
if (!keep) {
fs.rmSync(jobDir, { recursive: true, force: true });
console.log(`Cleaned up: ${jobDir}`);
}
}

Without this, long-running scrapers accumulate gigabytes of files with no clear ownership.

Use a timestamp or UUID to generate the job ID:

const { randomUUID } = require("crypto");
const jobDir = createJobDirectory(randomUUID());

Detecting and recovering from stalls

A stalled download differs from a failed one. Chrome may report the download as in progress, but no data is being received. Use the receivedBytes delta from Browser.downloadProgress as a heartbeat indicator:

// Browser.downloadProgress event
function waitForDownloadWithStallDetection(client, timeoutMs = 30000, stallMs = 10000) {
return new Promise((resolve, reject) => {
let lastReceivedBytes = 0;
let stallTimer = null;
const resetStallTimer = (receivedBytes) => {
if (receivedBytes > lastReceivedBytes) {
lastReceivedBytes = receivedBytes;
clearTimeout(stallTimer);
stallTimer = setTimeout(() => {
reject(new Error("Download stalled -- no bytes received"));
}, stallMs);
}
};
const hardTimeout = setTimeout(() => {
reject(new Error("Download timed out"));
}, timeoutMs);
client.on("Browser.downloadProgress", (event) => {
if (event.state === "inProgress") {
resetStallTimer(event.receivedBytes);
} else if (event.state === "completed") {
clearTimeout(hardTimeout);
clearTimeout(stallTimer);
resolve({ guid: event.guid, filename: event.suggestedFilename });
} else if (event.state === "canceled") {
clearTimeout(hardTimeout);
clearTimeout(stallTimer);
reject(new Error("Download was canceled"));
}
});
});
}

Server-side stalls occur when no data is received from the remote host, often due to rate limits or dropped connections. Retry these downloads using exponential backoff.

Chromium-side stalls occur when the browser process becomes unresponsive. Restart the browser and retry the download.

Be aware that using setTimeout to detect stalls assumes that the Node event loop is not blocked. If you are processing large CSV files with heavy tasks, make sure to do it after the download finishes, not during the download. Mixing these blocking tasks with active downloads can delay stall timer callbacks and lead to false positive alerts.

Recovering from crashes

Store the GUID, URL, and retry count in a small SQLite or Redis table before each download starts. If the worker dies mid-transfer, the next run knows exactly which files to retry:

# requires node-gyp — on Alpine: apk add python3 make g++
npm install better-sqlite3

The schema below tracks each download by its CDP GUID. Before triggering a download, insert a pending row. On completion, mark it done. On the next run, getFailedDownloads() returns everything that did not finish. This way, you resume exactly where you left off without re-downloading files that have already completed.

// adapt to your preferred store
const Database = require("better-sqlite3");
const db = new Database("downloads.db");
db.exec(`
CREATE TABLE IF NOT EXISTS downloads (
guid TEXT PRIMARY KEY,
url TEXT NOT NULL,
filename TEXT,
status TEXT DEFAULT 'pending',
retry_count INTEGER DEFAULT 0,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
completed_at DATETIME
)
`);
function trackDownload(guid, url, filename) {
db.prepare(`
INSERT OR REPLACE INTO downloads (guid, url, filename, status)
VALUES (?, ?, ?, 'pending')
`).run(guid, url, filename);
}
function markCompleted(guid) {
db.prepare(`
UPDATE downloads
SET status = 'completed', completed_at = CURRENT_TIMESTAMP
WHERE guid = ?
`).run(guid);
}
function getFailedDownloads(maxRetries = 3) {
return db.prepare(`
SELECT * FROM downloads
WHERE status != 'completed'
AND retry_count < ?
`).all(maxRetries);
}

If you encounter download failures due to network or rate-limit issues at the proxy level, refer to our guide to proxy error codes for common solutions.

Batch and automated file downloads with Puppeteer

Most developers and production scrapers need this because they download hundreds or thousands of files from sources that update regularly. This section explains the full batch pipeline, including link discovery, queue design, concurrency, and resumability.

Wait until you have the complete list of URLs before opening any download tabs. First, collect all download links from the listing page, use that list as your queue, and process each link one by one.

Create a new file for the batch scraper:

touch batch-downloader.js

Add the code below:

const path = require("path");
const fs = require("fs");
async function discoverDownloadLinks(browser, listingUrl, linkSelector) {
const page = await browser.newPage();
await page.goto(listingUrl, { waitUntil: "networkidle2" });
await page.waitForSelector(linkSelector, { timeout: 10000 });
// harvest all download hrefs from the listing page
const links = await page.$$eval(linkSelector, (els) =>
els
.map((el) => ({
url: el.href,
filename: el.getAttribute("download") ||
el.href.split("/").pop().split("?")[0],
text: el.textContent.trim(),
}))
.filter((link) => link.url && link.url.startsWith("http"))
);
await page.close();
console.log(`Discovered ${links.length} download links`);
return links;
}

path.basename runs in Node, not in the browser. page.$$eval executes its callback in the browser context where Node modules don't exist. Plain string operations like .split("/").pop() work in both environments.

If the listing pages are paginated, go through all the pages to collect links before building your queue:

const links = await page.$$eval(linkSelector, (els) =>
els.map((el) => ({
url: el.href,
filename: el.href.split("/").pop().split("?")[0],
}))
);

Make sure to gather the entire list of URLs before starting any downloads. This keeps your queue organized and makes it easier to restart if needed.

Step 2: Queue design with bounded concurrency

Avoid opening all download tabs at the same time. Use a bounded concurrency queue to limit parallel downloads to what your proxy pool and the target site's rate limits can handle. For most sites, 3-5 parallel downloads work well. If you need 20+, you'll need to rotate IPs.

Install p-limit for concurrency control:

# p-limit v4+ is ESM-only and breaks with require()
# install v3 if your project uses CommonJS
npm install p-limit@3

This guide uses CommonJS (require()). If your project uses ES modules (import), you can install the latest version with npm install p-limit and replace require("p-limit") with import pLimit from "p-limit" at the top of your file.

const pLimit = require("p-limit");
async function downloadQueue(browser, links, options = {}) {
const {
concurrency = 3,
downloadDir = path.resolve(__dirname, "downloads"),
timeoutMs = 60000,
} = options;
const limit = pLimit(concurrency);
const manifest = [];
const tasks = links.map((link) =>
limit(async () => {
const jobId = crypto.randomUUID();
const jobDir = path.resolve(downloadDir, jobId);
fs.mkdirSync(jobDir, { recursive: true });
const result = {
url: link.url,
filename: link.filename,
jobId,
status: "pending",
filePath: null,
sha256: null,
startedAt: new Date().toISOString(),
completedAt: null,
};
try {
const filePath = await downloadSingleFile(
browser,
link.url,
jobDir,
timeoutMs
);
result.status = "completed";
result.filePath = filePath;
result.sha256 = await hashFile(filePath);
result.completedAt = new Date().toISOString();
console.log(`✓ ${link.filename}`);
} catch (err) {
result.status = "failed";
result.error = err.message;
console.error(`✗ ${link.filename}: ${err.message}`);
}
manifest.push(result);
return result;
})
);
await Promise.all(tasks);
return manifest;
}

Using Promise.all here is intentional. If one download fails, it won't stop the rest of the batch from running.

Step 3: Browser and context reuse

Starting a new Chromium instance for each file uses a lot of resources. Instead, launch one Browser instance and create a separate BrowserContext for each worker. This approach is more efficient and keeps cookies separate, so parallel logged-in sessions won't interfere with each other.

async function downloadSingleFile(browser, url, jobDir, timeoutMs) {
// create an isolated context per download
const context = await browser.createBrowserContext();
const page = await context.newPage();
const client = await page.createCDPSession();
await client.send("Browser.setDownloadBehavior", {
behavior: "allowAndName",
downloadPath: jobDir,
eventsEnabled: true,
});
try {
const downloadPromise = waitForDownload(client, timeoutMs);
await page.goto(url, { waitUntil: "networkidle2" });
await page.waitForSelector("[data-testid='download-btn']", {
timeout: 10000,
});
await page.click("[data-testid='download-btn']");
const result = await downloadPromise;
// rename from GUID to actual filename
const guidPath = path.resolve(jobDir, result.guid);
const finalPath = path.resolve(jobDir, result.suggestedFilename);
fs.renameSync(guidPath, finalPath);
return finalPath;
} finally {
// always close the context -- never leak browser resources
await context.close();
}
}

Close each context right after its download finishes. If you leave them open, memory usage will increase quickly. Give each download its own subfolder, named by URL hash or job ID. This way, it's clear when a job is done: one folder, one file, one job. Restarts won't overwrite previous results.

Step 4: Retry policy

Not every failure is permanent. Add a retry layer that uses exponential backoff:

async function downloadWithRetry(browser, link, jobDir, options = {}) {
const { maxRetries = 3, baseDelayMs = 2000 } = options;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await downloadSingleFile(browser, link.url, jobDir, 60000);
} catch (err) {
const isLastAttempt = attempt === maxRetries;
if (isLastAttempt) {
console.error(`Dead letter: ${link.url} failed after ${maxRetries} attempts`);
throw err;
}
// exponential backoff with jitter
const delay = baseDelayMs * Math.pow(2, attempt - 1) +
Math.random() * 1000;
console.warn(
`Attempt ${attempt} failed for ${link.filename}. ` +
`Retrying in ${(delay / 1000).toFixed(1)}s...`
);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
}

Use different retry strategies depending on the error type:

  • 429/503 errors: Retry with exponential backoff
  • Transient TLS errors (e.g., ECONNRESET): Retry immediately without delay
  • Permanent failures (after N attempts): Add URL to a dead-letter list for later inspection

Step 5: Memory hygiene at scale

Chromium uses more memory the longer it runs under automation. If you don't manage this, a long batch job can use up all available RAM and crash.

Close pages and contexts as soon as a download finishes. The finally block in downloadSingleFile takes care of this. For very long runs, restart the browser process after every N jobs:

async function runBatchWithRestart(links, options = {}) {
const { restartEvery = 50, concurrency = 3 } = options;
const chunks = [];
// split links into chunks
for (let i = 0; i < links.length; i += restartEvery) {
chunks.push(links.slice(i, i + restartEvery));
}
const allResults = [];
for (const [index, chunk] of chunks.entries()) {
console.log(
`Processing chunk ${index + 1}/${chunks.length} ` +
`(${chunk.length} files)...`
);
const browser = await puppeteer.launch({ headless: true });
try {
const results = await downloadQueue(browser, chunk, { concurrency });
allResults.push(...results);
} finally {
// restart browser between chunks to clear memory
await browser.close();
}
}
return allResults;
}

Step 6: Manifest and resumability

Write a manifest after each run. The next time you run the script, it will read the manifest and skip files that are already completed:

const manifestPath = path.resolve(__dirname, "manifest.jsonl");
function appendToManifest(result) {
fs.appendFileSync(
manifestPath,
JSON.stringify(result) + "\n",
"utf8"
);
}
function loadCompletedUrls() {
if (!fs.existsSync(manifestPath)) return new Set();
const completed = new Set();
const lines = fs.readFileSync(manifestPath, "utf8").split("\n").filter(Boolean);
for (const line of lines) {
try {
const entry = JSON.parse(line);
if (entry.status === "completed") {
completed.add(entry.url);
}
} catch {
// skip malformed lines
}
}
return completed;
}
async function resumableBatch(links, options = {}) {
const completed = loadCompletedUrls();
const remaining = links.filter((link) => !completed.has(link.url));
console.log(
`${completed.size} already completed. ` +
`${remaining.length} remaining.`
);
return runBatchWithRestart(remaining, options);
}

Step 7: Scheduling recurring runs

For daily snapshots or recurring data pulls, pair the batch scraper with a cron job. Create the logs directory first:

mkdir logs

Open your crontab:

crontab -e

Add this entry to run nightly at 3 am:

0 3 * * * cd ~/puppeteer-downloader && node batch-downloader.js >> logs/batch_$(date +%F).log 2>&1

For more complex scheduling and workflow orchestration, how to schedule web scraping tasks covers the options in depth. For non-developer pipelines that trigger on file downloads, building n8n web scraping workflows covers the low-code integration side.

When batch volume requires managed proxy rotation rather than a fixed endpoint, Decodo’s rotating proxies handle IP rotation at the infrastructure level so your scraper doesn’t have to.

Full batch script

Here’s everything wired together in batch-downloader.js:

const puppeteer = require("puppeteer");
const path = require("path");
const fs = require("fs");
const crypto = require("crypto");
const pLimit = require("p-limit");
const downloadDir = path.resolve(__dirname, "downloads");
const manifestPath = path.resolve(__dirname, "manifest.jsonl");
// utilities
function hashFile(filePath) {
return new Promise((resolve, reject) => {
const hash = crypto.createHash("sha256");
const stream = fs.createReadStream(filePath);
stream.on("data", (chunk) => hash.update(chunk));
stream.on("end", () => resolve(hash.digest("hex")));
stream.on("error", reject);
});
}
function waitForDownload(client, timeoutMs = 60000) {
return new Promise((resolve, reject) => {
const timeout = setTimeout(
() => reject(new Error(`Download timed out after ${timeoutMs}ms`)),
timeoutMs
);
const downloads = new Map();
client.on("Browser.downloadWillBegin", (event) => {
downloads.set(event.guid, { suggestedFilename: event.suggestedFilename });
});
client.on("Browser.downloadProgress", (event) => {
if (event.state === "completed") {
clearTimeout(timeout);
resolve({
guid: event.guid,
suggestedFilename: downloads.get(event.guid)?.suggestedFilename,
totalBytes: event.totalBytes,
});
} else if (event.state === "canceled") {
clearTimeout(timeout);
reject(new Error("Download canceled"));
}
});
});
}
function appendToManifest(result) {
fs.appendFileSync(manifestPath, JSON.stringify(result) + "\n", "utf8");
}
function loadCompletedUrls() {
if (!fs.existsSync(manifestPath)) return new Set();
const completed = new Set();
const lines = fs.readFileSync(manifestPath, "utf8").split("\n").filter(Boolean);
for (const line of lines) {
try {
const entry = JSON.parse(line);
if (entry.status === "completed") completed.add(entry.url);
} catch { /* skip malformed lines */ }
}
return completed;
}
// core download
async function downloadSingleFile(browser, url, jobDir, timeoutMs = 60000) {
const context = await browser.createBrowserContext();
const page = await context.newPage();
const client = await page.createCDPSession();
await client.send("Browser.setDownloadBehavior", {
behavior: "allowAndName",
downloadPath: jobDir,
eventsEnabled: true,
});
try {
const downloadPromise = waitForDownload(client, timeoutMs);
await page.goto(url, { waitUntil: "networkidle2" });
await page.waitForSelector("[data-testid='download-btn']", { timeout: 10000 });
await page.click("[data-testid='download-btn']");
const result = await downloadPromise;
const guidPath = path.resolve(jobDir, result.guid);
const finalPath = path.resolve(jobDir, result.suggestedFilename);
fs.renameSync(guidPath, finalPath);
return finalPath;
} finally {
// always close the context -- never leak browser resources
await context.close();
}
}
async function downloadWithRetry(browser, link, jobDir, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await downloadSingleFile(browser, link.url, jobDir);
} catch (err) {
if (attempt === maxRetries) throw err;
const delay = 2000 * Math.pow(2, attempt - 1) + Math.random() * 1000;
console.warn(
`Attempt ${attempt} failed. Retrying in ${(delay / 1000).toFixed(1)}s...`
);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
}
// downloadQueue wraps downloadSingleFile with manifest tracking
// used by runBatchWithRestart for large batch jobs (500+ files)
// main() below implements the same logic inline for clarity
async function downloadQueue(browser, links, options = {}) {
const {
concurrency = 3,
downloadDir: dir = path.resolve(__dirname, "downloads"),
timeoutMs = 60000,
} = options;
const limit = pLimit(concurrency);
const manifest = [];
const tasks = links.map((link) =>
limit(async () => {
const jobId = crypto.randomUUID();
const jobDir = path.resolve(dir, jobId);
fs.mkdirSync(jobDir, { recursive: true });
const result = {
url: link.url,
filename: link.filename,
jobId,
status: "pending",
filePath: null,
sha256: null,
startedAt: new Date().toISOString(),
completedAt: null,
};
try {
const filePath = await downloadSingleFile(
browser,
link.url,
jobDir,
timeoutMs
);
result.status = "completed";
result.filePath = filePath;
result.sha256 = await hashFile(filePath);
result.completedAt = new Date().toISOString();
console.log(`✓ ${link.filename}`);
} catch (err) {
result.status = "failed";
result.error = err.message;
console.error(`✗ ${link.filename}: ${err.message}`);
}
manifest.push(result);
return result;
})
);
await Promise.all(tasks);
return manifest;
}
// for long runs (500+ files), use runBatchWithRestart instead of main()
// it restarts the browser every 50 jobs to prevent memory bloat
async function runBatchWithRestart(links, options = {}) {
const { restartEvery = 50, concurrency = 3 } = options;
const chunks = [];
for (let i = 0; i < links.length; i += restartEvery) {
chunks.push(links.slice(i, i + restartEvery));
}
const allResults = [];
for (const [index, chunk] of chunks.entries()) {
console.log(
`Processing chunk ${index + 1}/${chunks.length} ` +
`(${chunk.length} files)...`
);
const browser = await puppeteer.launch({ headless: true });
try {
const results = await downloadQueue(browser, chunk, { concurrency });
allResults.push(...results);
} finally {
await browser.close();
}
}
return allResults;
}
// batch runner
// main() is suitable for most batch jobs
// for 500+ files, replace with runBatchWithRestart() for memory hygiene
async function main() {
const targetUrl = "https://data.worldbank.org/indicator";
const linkSelector = "a[href$='.csv']";
fs.mkdirSync(downloadDir, { recursive: true });
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(targetUrl, { waitUntil: "networkidle2" });
const links = await page.$$eval(linkSelector, (els) =>
els.map((el) => ({
url: el.href,
filename: el.href.split("/").pop().split("?")[0],
}))
);
await page.close();
const completed = loadCompletedUrls();
const remaining = links.filter((link) => !completed.has(link.url));
console.log(`${remaining.length} files to download.`);
const limit = pLimit(3);
await Promise.all(
remaining.map((link) =>
limit(async () => {
const jobId = crypto.randomUUID();
const jobDir = path.resolve(downloadDir, jobId);
fs.mkdirSync(jobDir, { recursive: true });
const result = {
url: link.url,
filename: link.filename,
status: "pending",
startedAt: new Date().toISOString(),
};
try {
const filePath = await downloadWithRetry(browser, link, jobDir);
result.status = "completed";
result.filePath = filePath;
result.sha256 = await hashFile(filePath);
result.completedAt = new Date().toISOString();
console.log(`✓ ${link.filename}`);
} catch (err) {
result.status = "failed";
result.error = err.message;
console.error(`✗ ${link.filename}: ${err.message}`);
}
appendToManifest(result);
})
)
);
await browser.close();
console.log("Batch complete. Manifest written to manifest.jsonl");
}
main().catch(console.error);

Run it:

node batch-downloader.js

Advanced download scenarios and integrations

Basic download methods save a file to disk, but production scrapers often need more. You may need cloud storage, post-processing, authenticated sessions, and tool integrations to keep your scraper stateless. Here are 4 patterns.

Create a new file for the advanced patterns:

touch advanced-downloader.js

Streaming downloads directly to cloud storage

If you write to a local disk and then upload, you double the I/O, and this approach fails on ephemeral runtimes like Lambda functions, Cloud Run containers, and CI workers that do not keep state between runs. Instead, pipe the response stream directly into your cloud SDK’s upload stream.

Install the AWS SDK:

npm install @aws-sdk/client-s3 @aws-sdk/lib-storage

The function below uses Puppeteer only to extract the file URL and session cookies, then hands the stream straight to S3 using the Upload class from @aws-sdk/lib-storage. The Upload class handles multipart uploads automatically for files over 100 MB, so you don't need to manage part sizes or retry logic yourself.

const { S3Client } = require("@aws-sdk/client-s3");
const { Upload } = require("@aws-sdk/lib-storage");
const { Readable } = require("stream");
const s3 = new S3Client({ region: process.env.AWS_REGION });
async function streamToS3(browser, pageUrl, linkSelector, bucket, key) {
// step 1 -- use Puppeteer to extract the file URL and session cookies
const page = await browser.newPage();
await page.goto(pageUrl, { waitUntil: "networkidle2" });
await page.waitForSelector(linkSelector);
const fileUrl = await page.$eval(linkSelector, (el) => el.href);
const cookies = await page.cookies();
const cookieHeader = cookies.map((c) => `${c.name}=${c.value}`).join("; ");
await page.close();
// step 2 -- stream directly to S3 without touching local disk
const response = await fetch(fileUrl, {
headers: {
Cookie: cookieHeader,
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
},
});
if (!response.ok) {
throw new Error(`Failed to fetch file: ${response.status}`);
}
const upload = new Upload({
client: s3,
params: {
Bucket: bucket,
Key: key,
Body: Readable.fromWeb(response.body),
ContentType:
response.headers.get("content-type") || "application/octet-stream",
},
});
// multipart upload handles files over ~100 MB automatically
upload.on("httpUploadProgress", (progress) => {
console.log(`Uploaded: ${progress.loaded} / ${progress.total} bytes`);
});
await upload.done();
console.log(`Streamed to s3://${bucket}/${key}`);
}

For GCS:

npm install @google-cloud/storage

The GCS client uses a write stream rather than a multipart upload object. Setting resumable: true enables the resumable upload protocol, which handles large files and recovers from dropped connections without restarting the transfer from zero.

const { Storage } = require("@google-cloud/storage");
const { Readable } = require("stream");
const gcs = new Storage();
async function streamToGCS(fileUrl, cookieHeader, bucketName, fileName) {
const response = await fetch(fileUrl, {
headers: { Cookie: cookieHeader },
});
const bucket = gcs.bucket(bucketName);
const file = bucket.file(fileName);
const writeStream = file.createWriteStream({
resumable: true, // handles large files automatically
contentType: response.headers.get("content-type"),
});
Readable.fromWeb(response.body).pipe(writeStream);
return new Promise((resolve, reject) => {
writeStream.on("finish", () => {
console.log(`Streamed to gs://${bucketName}/${fileName}`);
resolve();
});
writeStream.on("error", reject);
});
}

Note: Method 1 (browser-managed downloads using page.click()) always writes to local disk first because Chromium cannot stream to other destinations. For direct-to-cloud streaming, use Method 2 (in-page fetch) or Method 4 (Node HTTP), where you control the response stream.

Compressing and post-processing files inline

If you run batch jobs that deliver daily snapshots, it is better to compress multiple downloaded files into a single archive when the job finishes instead of sending individual files.

Install the archiver package:

npm install archiver@5

archiveDownloads() takes a job directory and zips everything in it into a single output file. The zlib: { level: 9 } option sets maximum compression. The whole operation is wrapped in a Promise so you can await it cleanly before marking the job complete.

const archiver = require("archiver");
const fs = require("fs");
async function archiveDownloads(sourceDir, outputPath) {
return new Promise((resolve, reject) => {
const output = fs.createWriteStream(outputPath);
const archive = archiver("zip", { zlib: { level: 9 } });
output.on("close", () => {
console.log(`Archive: ${outputPath} (${archive.pointer()} bytes)`);
resolve(outputPath);
});
archive.on("error", reject);
archive.pipe(output);
archive.directory(sourceDir, false);
archive.finalize();
});
}

For large uncompressed CSV or JSON files, use gzip while downloading to reduce storage costs by 60-80%:

const zlib = require("zlib");
const { Readable } = require("stream");
const fs = require("fs");
async function downloadAndCompress(fileUrl, cookieHeader, outputPath) {
const response = await fetch(fileUrl, {
headers: { Cookie: cookieHeader },
});
const gzip = zlib.createGzip();
const output = fs.createWriteStream(`${outputPath}.gz`);
Readable.fromWeb(response.body).pipe(gzip).pipe(output);
return new Promise((resolve, reject) => {
output.on("finish", () => {
console.log(`Compressed: ${outputPath}.gz`);
resolve(`${outputPath}.gz`);
});
output.on("error", reject);
});
}

Validate the schema of downloaded data files before marking a job as successful. This helps catch partial downloads and silent server-side regressions before bad data enters your pipeline:

npm install csv-parse

csv-parse provides a synchronous parser via csv-parse/sync, useful for validating small-to-medium files inline without async overhead.

const { parse } = require("csv-parse/sync");
const fs = require("fs");
function validateCSV(filePath, requiredHeaders) {
const content = fs.readFileSync(filePath, "utf8");
let records;
try {
records = parse(content, { columns: true, skip_empty_lines: true });
} catch (err) {
throw new Error(`CSV parse failed: ${err.message}`);
}
if (records.length === 0) {
throw new Error("CSV is empty -- possible partial download");
}
const headers = Object.keys(records[0]);
const missing = requiredHeaders.filter((h) => !headers.includes(h));
if (missing.length > 0) {
throw new Error(`Missing required columns: ${missing.join(", ")}`);
}
console.log(`Validated: ${records.length} rows, ${headers.length} columns`);
return records;
}

Authenticated downloads: preserving session and cookies

If the file is behind a login, authenticate once and reuse the session for each download run.

Persist session with userDataDir:

const puppeteer = require("puppeteer");
const path = require("path");
async function authenticatedBrowser(profileDir) {
return puppeteer.launch({
headless: true,
userDataDir: path.resolve(__dirname, profileDir),
});
}
async function loginOnce(browser, loginUrl, credentials) {
const page = await browser.newPage();
await page.goto(loginUrl, { waitUntil: "networkidle2" });
await page.type("#username", credentials.username);
await page.type("#password", credentials.password);
await page.click("[type='submit']");
await page.waitForNavigation({ waitUntil: "networkidle2" });
console.log("Login complete -- session persisted to profile directory");
await page.close();
}

For the next run, launch with the same userDataDir so the session remains active.

For multi-step authentication such as SSO, OAuth, or MFA, export the storage state after a successful login and reload it for each run:

const fs = require("fs");
async function exportSession(page, outputPath) {
const cookies = await page.cookies();
const localStorage = await page.evaluate(() =>
Object.fromEntries(
Object.keys(localStorage).map((key) => [key, localStorage.getItem(key)])
)
);
fs.writeFileSync(
outputPath,
JSON.stringify({ cookies, localStorage }, null, 2)
);
console.log(`Session exported to: ${outputPath}`);
}
async function restoreSession(page, sessionPath) {
const { cookies, localStorage } = JSON.parse(
fs.readFileSync(sessionPath, "utf8")
);
await page.setCookie(...cookies);
await page.evaluate((storage) => {
for (const [key, value] of Object.entries(storage)) {
localStorage.setItem(key, value);
}
}, localStorage);
console.log("Session restored");
}

Watch out for CSRF tokens: many SaaS dashboards rotate CSRF tokens every time the page loads. Do not cache the token in a config file. Instead, capture it from the page’s DOM when you make the request:

async function getCsrfToken(page) {
return page
.$eval("meta[name='csrf-token']", (el) => el.getAttribute("content"))
.catch(() => null);
}
async function downloadWithCsrf(page, downloadUrl) {
const csrfToken = await getCsrfToken(page);
return page.evaluate(
async (url, token) => {
const response = await fetch(url, {
method: "POST",
headers: {
"X-CSRF-Token": token,
"Content-Type": "application/json",
},
credentials: "include",
});
const buffer = await response.arrayBuffer();
const bytes = new Uint8Array(buffer);
let binary = "";
for (let i = 0; i < bytes.byteLength; i++) {
binary += String.fromCharCode(bytes[i]);
}
return btoa(binary);
},
downloadUrl,
csrfToken
);
}

User-Agent mismatch warning: When you forward session cookies from Puppeteer to a Node HTTP request, copy the User-Agent header exactly. Some sites will invalidate sessions if the User-Agent between the browser and the HTTP client does not match:

const puppeteer = require("puppeteer");
const BROWSER_UA =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36";
const LAUNCH_ARGS = { headless: true, args: [ "--no-sandbox", "--disable-dev-shm-usage", ], };
// set the UA on the page directly, not on the context
const browser = await puppeteer.launch(LAUNCH_ARGS);
const page = await browser.newPage();
await page.setUserAgent(BROWSER_UA);
// use the same UA in your Node HTTP request
const response = await fetch(fileUrl, {
headers: {
Cookie: cookieHeader,
"User-Agent": BROWSER_UA, // must match exactly
},
});

Integration patterns

Queue downloaded files for downstream processing using ioredis as the BullMQ connection:

npm install bullmq ioredis

queueForProcessing() pushes each completed file into a BullMQ queue with its metadata. The attempts: 3 and exponential backoff options mean transient failures in downstream processing retry automatically without any extra code. Note that maxRetriesPerRequest: null is required by BullMQ when using ioredis; without it, the connection throws on startup.

const { Queue } = require("bullmq");
const { Redis } = require("ioredis");
const path = require("path");
const connection = new Redis({
host: process.env.REDIS_HOST || "127.0.0.1",
port: parseInt(process.env.REDIS_PORT || "6379"),
maxRetriesPerRequest: null, // required by BullMQ
});
const processingQueue = new Queue("file-processing", { connection });
async function queueForProcessing(filePath, metadata) {
await processingQueue.add(
"process-file",
{
filePath,
filename: path.basename(filePath),
sha256: metadata.sha256,
downloadedAt: metadata.completedAt,
},
{
attempts: 3,
backoff: { type: "exponential", delay: 2000 },
}
);
console.log(`Queued for processing: ${path.basename(filePath)}`);
}

Trigger a webhook on each successful download:

async function triggerWebhook(webhookUrl, payload) {
const response = await fetch(webhookUrl, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
event: "download.completed",
timestamp: new Date().toISOString(),
...payload,
}),
});
if (!response.ok) {
console.error(`Webhook failed: ${response.status}`);
}
}
// call after each successful download
await triggerWebhook(process.env.WEBHOOK_URL, {
filename: result.filename,
filePath: result.filePath,
sha256: result.sha256,
});

This works well with n8n, Make, or Zapier for low-code post-processing pipelines. The webhook triggers, the workflow tool picks it up, and handles the rest without extra scraper code. For a complete setup, see our web scraping with Decodo’s n8n integration walkthrough.

Use Chrome DevTools Protocol directly for more control

When Puppeteer's public API doesn't expose what you need, intercepting Page.downloadWillBegin to redirect specific file types to different folders, for example, drop down to chrome-remote-interface directly:

npm install chrome-remote-interface

chrome-remote-interface gives you direct access to the Chrome DevTools Protocol without Puppeteer's abstraction layer. Use it when you need finer control than Puppeteer's public API exposes, like intercepting specific download events and redirecting them to different folders based on file type.

const CDP = require("chrome-remote-interface");
const path = require("path");
async function interceptSpecificDownloads() {
const client = await CDP({ port: 9222 });
const { Browser } = client;
await Browser.setDownloadBehavior({
behavior: "allowAndName",
downloadPath: path.resolve(__dirname, "downloads/default"),
eventsEnabled: true,
});
// redirect PDFs to a separate folder
client.on("Browser.downloadWillBegin", async (event) => {
if (event.suggestedFilename.endsWith(".pdf")) {
await Browser.setDownloadBehavior({
behavior: "allowAndName",
downloadPath: path.resolve(__dirname, "downloads/pdfs"),
eventsEnabled: true,
});
}
});
}

Combine Puppeteer with an MCP server for AI-driven file collection

If your AI agents need to request and process files as part of a larger automation workflow, Decodo’s MCP server lets agents collect web data directly. For orchestration patterns that combine agents with file downloads, see the our AI agent orchestration tutorial with n8n and Decodo MCP Server.

When a Puppeteer download file job gets blocked

The steps in this guide work well when the target site is not blocking you. If the site tries to stop automation, you’ll notice certain symptoms. Solutions can be as simple as changing settings or as involved as using a managed API.

Recognizing the symptoms

Blocking usually doesn’t show a clear error. Instead, your scraper might seem to work but actually returns nothing useful.

  • HTTP 403 or 429 on the file endpoint. The page loads fine, but the download request gets blocked. The server is distinguishing between a browser session loading a page and an automated request fetching a file. Check the response status on the download request, specifically, not just the page load:
// intercept responses to catch silent blocks on the download endpoint
page.on("response", (response) => {
if (response.url().includes("/export") || response.url().includes("/download")) {
console.log(`Download endpoint: ${response.status()} ${response.url()}`);
}
});
  • CAPTCHA page on click. The download button triggers a challenge instead of a file. The CDP downloadWillBegin event never fires. Your waitForDownload promise times out with no output.
  • File arrives as 0 bytes or HTML. Sometimes, the file downloads but is either empty or contains an HTML error page with a .csv extension. This can be hard to spot at first. To catch this, check the file’s content after every download.
function validateDownloadedFile(filePath, expectedType = "csv") {
const stats = fs.statSync(filePath);
// catch 0-byte files
if (stats.size === 0) {
throw new Error(`Downloaded file is empty: ${filePath}`);
}
// catch HTML error pages disguised as data files
const header = Buffer.alloc(512);
const fd = fs.openSync(filePath, "r");
fs.readSync(fd, header, 0, 512, 0);
fs.closeSync(fd);
const content = header.toString("utf8").toLowerCase().trim();
if (content.startsWith("<!doctype") || content.startsWith("<html")) {
throw new Error(
`Downloaded file contains HTML -- likely a CAPTCHA or error page: ${filePath}`
);
}
console.log(`File validated: ${path.basename(filePath)} (${stats.size} bytes)`);
return true;
}
  • Silent IP ban after a handful of successful pulls. Sometimes, your first few requests work, but then you stop getting results and don’t see any errors. Response times slow down, and requests may start timing out. This usually means your IP has been flagged. Try running the same request from a different IP to check.

Why headless Chrome loses these fights

There are 3 main reasons why headless Chromium is easier to detect than regular Chrome:

  • TLS fingerprinting. The JA3/JA4 fingerprint of a headless Chrome session differs from a real browser. Sites running Cloudflare, Akamai, or DataDome check this at the TLS handshake level before any HTTP headers are sent. No amount of user agent spoofing fixes a JA3 mismatch.
  • Single-IP request volume. A residential user visits a product page a handful of times. An automation script hits dozens of pages per minute from the same IP. The behavioral signal is obvious even without fingerprinting.
  • Challenge pages get tougher with each request. Services like Cloudflare, DataDome, and Akamai Bot Manager start with a simple check, then move to a JavaScript challenge, and finally a CAPTCHA. Each step is harder to get past, and headless Chrome usually fails at some point.

What helps before reaching for a managed API

Before reaching for a managed API, exhaust the self-hosted options:

  • Rotating residential IPs. Instead of using one proxy, use a pool of residential IPs that change with each request. This makes it much harder for sites to detect high-volume scraping.
// rotate through a proxy pool per request
const proxies = [
"http://user:pass@gate.decodo.com:7001",
"http://user:pass@gate.decodo.com:7002",
"http://user:pass@gate.decodo.com:7003",
];
function getRandomProxy() {
return proxies[Math.floor(Math.random() * proxies.length)];
}
async function launchWithRotatingProxy() {
const proxy = getRandomProxy();
return puppeteer.launch({
headless: true,
args: [`--proxy-server=${proxy}`],
});
}
  • Randomized timing. Don’t use fixed delays. Instead, randomize the timing for every action, like between loading a page and clicking, or between hovering and clicking. This helps your script look more human.
// vary every interaction timing
async function humanizedClick(page, selector) {
await page.waitForSelector(selector);
// random delay before interaction
await new Promise((r) => setTimeout(r, Math.random() * 2000 + 500));
await page.hover(selector);
// small delay between hover and click
await new Promise((r) => setTimeout(r, Math.random() * 500 + 100));
await page.click(selector);
}
  • Persistent storage state. Some sites treat returning users differently. Use userDataDir when launching so your session keeps cookies and storage between runs.
const browser = await puppeteer.launch({
headless: true,
userDataDir: path.resolve(__dirname, "chrome-profile"),
});
  • Stealth plugins. Using puppeteer-extra with the stealth plugin helps hide signs of automation that sites look for.
npm install puppeteer-extra puppeteer-extra-plugin-stealth

puppeteer-extra is a wrapper around Puppeteer that supports plugins. The stealth plugin is the most useful one because it patches the most common automation signals that bot detection platforms check before serving any content.

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());
const LAUNCH_ARGS = { headless: true, args: [ "--no-sandbox", "--disable-dev-shm-usage", ], };
const browser = await puppeteer.launch(LAUNCH_ARGS);

The stealth plugin patches navigator.webdriver, fixes chrome.runtime exposure, spoofs plugin arrays, and handles a handful of other tells that basic headless Chrome exposes. It's not a silver bullet against serious bot management platforms, but it clears most lightweight detection.

If you want to learn more about bypassing CAPTCHA with Puppeteer or see a full breakdown of anti-bot tools, check out the detailed guides on these topics.

When self-hosting stops being worth it

At some point, keeping up with fingerprint patches, proxy rotation, CAPTCHA solving, and TLS rewrites takes more time and effort than the data is worth.

The signal that you've crossed that line:

  • Selectors break weekly because the site serves different HTML to suspected bots
  • IP bans arrive faster than you can rotate new addresses into the pool
  • CAPTCHA solve rates drop below 80%, and the backlog grows faster than it clears
  • The site started serving honeypot data plausible-looking results that are actually wrong

When that happens, switching to a managed scraping API is usually cheaper. It saves both on infrastructure and on the time engineers spend maintaining workarounds.

How a managed API fits into a Puppeteer workflow

Decodo's Web Scraping API handles JS rendering, anti-bot bypass, and proxy rotation as a single managed endpoint. The integration is straightforward: hand it the URL, receive the rendered HTML or file response, and handle the output on your side:

npm install jsdom

jsdom parses raw HTML strings into a queryable DOM, the same API you'd use in a browser, but running in Node. It's the lightweight alternative to spinning up a full Puppeteer page just to find a single element.

async function fetchViaScrapingAPI(targetUrl) {
const response = await fetch("https://scraper-api.decodo.com/v2/scrape", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Basic ${process.env.DECODO_API_KEY}`,
},
body: JSON.stringify({
url: targetUrl,
target: "universal",
headless: true,
wait_for: "[data-testid='download-btn']",
}),
});
if (!response.ok) {
throw new Error(`Scraping API error: ${response.status}`);
}
const data = await response.json();
return data.results[0].content;
}
// use alongside Puppeteer -- API handles the unblock, Puppeteer handles the rest
async function hybridDownload(protectedUrl, downloadSelector) {
// step 1 -- use the API to get past the protection layer
const html = await fetchViaScrapingAPI(protectedUrl);
// step 2 -- parse the rendered HTML for the download URL
const { JSDOM } = require("jsdom");
const dom = new JSDOM(html);
const downloadUrl = dom.window.document
.querySelector(downloadSelector)
?.getAttribute("href");
if (!downloadUrl) {
throw new Error("Download link not found in rendered HTML");
}
// step 3 -- fetch the file directly
const fileResponse = await fetch(downloadUrl);
const buffer = await fileResponse.arrayBuffer();
const outputPath = path.resolve(downloadDir, path.basename(downloadUrl));
fs.writeFileSync(outputPath, Buffer.from(buffer));
console.log(`Downloaded via hybrid approach: ${outputPath}`);
return outputPath;
}

If your team wants to keep using Puppeteer and handle IP rotation on your own, Decodo’s residential proxies offer an IP pool without the managed API. For more on why rotating IPs is important for ongoing download jobs, see the guide on rotating proxies.

Final thoughts

Start with configuration, choose your method based on the situation, use CDP events to confirm completion, batch tasks with limited concurrency, and escalate if you get stuck. This decision framework works whether you are downloading one file or ten thousand.

The real challenge isn’t clicking the button. It’s making sure the file actually arrives, preventing parallel jobs from interfering with each other, and staying up and running when a site blocks headless Chrome.

When protected sites block downloads, Decodo Web Scraping API takes care of the hard parts like JavaScript rendering, anti-bot bypass, and proxy rotation, all in one place. If your team prefers to use Puppeteer directly, Decodo rotating proxies let you rotate IPs while keeping full control of your browser.

Skip the boilerplate

Decodo's Web Scraping API handles proxies, CAPTCHAs, and anti-bot detection so your code stays short and your requests actually land.

Share article:

About the author

Justinas Tamasevicius

Director of Engineering

Justinas Tamaševičius is Director of Engineering with over two decades of expertise in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.

Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Why does my Puppeteer download job hang and never close the browser?

The browser closes before the download is done. If you use a "click then close" approach, this will happen. Instead, use a Browser.downloadProgress listener that waits until the state is "completed." Also, set a timeout for each file so stalled downloads fail quickly and don't block your worker.

Why does setDownloadBehavior throw "Cannot find context with specified id"?

This error occurs when the CDP command runs on a page that's already closed, or when the session starts from a page that was removed by navigation. To fix this, open the session from the browser-level CDP target and set the behavior for the whole browser. This way, it will keep working even if pages are closed or reloaded.

How do I download a file that opens in a new tab?

Listen for browser.on("targetcreated") to grab the new page handle before the click fires. Apply setDownloadBehavior browser-wide beforehand so it covers the new tab automatically, then await CDP download events on that page. Avoid using page.waitForNewPage() in headless mode, as it can result in a race condition with the download trigger.

Should I use Puppeteer or Playwright for downloads?

Playwright has a higher-level page.waitForEvent("download") API that returns a Download object with .saveAs(), .path(), and .failure() helpers, which results in less code and fewer edge cases. Puppeteer gives you direct CDP access, which is more flexible for unusual cases such as redirecting specific file types or hooking Page.downloadWillBegin. Use Playwright for new projects unless you already have an existing Puppeteer codebase or specifically require raw CDP control.

Can I download a file without rendering the full page?

Yes. Use Puppeteer only to extract the file URL and any required cookies or tokens, then handle the actual download with a Node HTTP client (Method 4). It's faster and uses far less memory than letting Chromium manage the transfer.

Why do my downloaded files contain HTML instead of the expected content?

The server returned a challenge page, Cloudflare, a login redirect, or a rate-limit notice, and Chrome saved it under the original filename. Check the first bytes of any completed file: if it starts with <!DOCTYPE html, the download was blocked. Add IP rotation, use persistent sessions, or route the request through the Decodo Web Scraping API.

How do I make Puppeteer downloads work in Docker or AWS Lambda?

Use puppeteer-core with a slim Chromium build (chrome-aws-lambda or sparticuz/chromium for Lambda), set /tmp as the absolute downloadPath; it's the only writable filesystem on Lambda, and stream files directly to S3 instead of leaving them on ephemeral disk.

Playwright logo with red and green theater masks sits beside large white text, “Playwright,” above smaller gray text, “Web Scraping Tutorial,” on a dark blue-black abstract background.

Playwright Web Scraping: A Practical Tutorial

Web scraping can feel like directing a play without a script – unpredictable and chaotic. That’s where Playwright steps in: a powerful, headless browser automation tool that makes scraping modern, dynamic websites smoother than ever. In this practical tutorial, you’ll learn how to use Playwright to reliably extract data from any web page.

Puppeteer Bypass CAPTCHA

How to Bypass CAPTCHA With Puppeteer: A Step-By-Step Guide

Since their inception in 2000, CAPTCHAs have been crucial for website security, distinguishing human users from bots. They are a savior for website owners and a nightmare for data gatherers. While CAPTCHAs enhance website integrity, they pose challenges for those reliant on automated data gathering. In this comprehensive guide, we delve into the fundamentals of Puppeteer, focusing on techniques for CAPTCHA detection and avoidance using Puppeteer. We also explore strategies for how to bypass CAPTCHA verification, methods for solving CAPTCHAs with specialized third-party services, and the alternative solutions provided by our Site Unblocker.

Dark-themed dashboard panels display “Authentication method” and “Endpoint generator,” listing users, hidden passwords, traffic limits, and code options like cURL, Python, NodeJS, and PHP against a dotted neon background.

How To Scrape Websites With Dynamic Content Using Python

You've mastered static HTML scraping, but now you're staring at a site where Requests + Beautiful Soup returns nothing but an empty <div> and <script> tags. Welcome to JavaScript-rendered content, where you get the material after the initial request. In this guide, we'll tackle dynamic sites using Python and Selenium (plus a Beautiful Soup alternative).

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved