Back to blog

Web Scraping With Node Fetch: A Practical Guide

Share article:

Web scraping with Node Fetch offers a lightweight way to collect data in Node.js. By fetching raw HTML or JSON responses and pairing them with parsers like Cheerio, developers can transform unstructured pages into structured datasets. This Node Fetch tutorial explains request handling, response parsing, data extraction, proxy integration, and when managed scraping APIs are necessary to effectively bypass advanced anti-bot protections.

A circle centered inside a squircle. Four additional squircles are attached to the circle at the top, bottom, left, and right sides.

TL;DR

  • Achieve lightweight web scraping on Node.js by sending a GET/POST request with node-fetch and easily extract the response body with Cheerio
  • There are 3 concurrency patterns for effective parallel fetching with node-fetch, namely, Promise.all, Bounded concurrency with p-limit, and streamlining queues
  • Handle advanced request options by rotating a small pool of realistic User-Agent strings, installing fetch-cookie, and using POST for form submissions 
  • Manage anti-scraping mechanics like CAPTCHA on node-fetch through proxies or a third-party managed web scraping API

What is node-fetch and how it relates to the Fetch API

The Fetch API is a WHATWG (Web Hypertext Application Technology Working Group) Specification that standardizes how JavaScript applications send HTTP requests and handle Responses when communicating with APIs or fetching resources from a server. Browsers have supported the Fetch API natively for roughly a decade.

node-fetch is a lightweight Node.js library that brought the same Fetch API syntax to server-side JavaScript before Node.js added native support via fetch(). Developers like node-fetch because it makes HTTP requests with the familiar fetch() syntax.

The key design difference between the Fetch API (fetch/node-fetch) and older JavaScript networking approaches is that the built-in Node.js HTTP module relied heavily on callbacks, which often led to nested and difficult-to-read code. Browsers also used XMLHttpRequest (XHR), an older API that required significantly more setup just to send a simple request. 

Then the Fetch API simplified this process by returning a Promise. A Promise is a JavaScript object that represents a future result. So, instead of passing callback functions around, developers can wait for the result with async/await, making asynchronous code read more like regular synchronous code.

The Promise returned by fetch() resolves as soon as the Response headers arrive. At that point, the full Response body may still be downloading because it's exposed as a ReadableStream. A ReadableStream is a stream-based interface that delivers data in small chunks instead of loading the entire Response into memory at once. This is useful for large files, APIs with continuous data, or streaming content because the application can begin processing data immediately while the remaining chunks are still arriving.

You can consume the stream in different formats depending on the response type:

  • response.text(). returns plain text, such as HTML or raw content.
  • response.json(). parses the Response as JSON and returns a JavaScript object.
  • response.buffer(). Returns binary data as a buffer in node-fetch.
  • response.body. Exposes the raw ReadableStream for manual stream handling.

However, status codes don't reject the Promise either. A 404 or 500 response still resolves as a successful Promise request. Only network-level failures, such as a DNS error, a refused connection, or an aborted request, fail to return a Promise. Always confirm response.ok (true for 200-299) or response.status before passing the body to a parser.

Latest Node.js versions, starting with Node 18, ship with a built-in global fetch based on undici, an HTTP client library for Node.js. So why still use the node-fetch Library? There are 3 reasons:

  • Projects pinned to Node versions older than 18
  • Teams that need the v2 CommonJS require() syntax rather than ESM (ECMAScript Modules) import
  • Codebases built around node-fetch-compatible plugins like fetch-cookie or node-fetch-har

Note. For new projects, use node-fetch v3 (ESM). For CommonJS codebases, v2 is the stable choice.

For a Python equivalent of a Node.js scraping task, see HTTPX vs. Requests vs. aiohttp. Also, if you're comparing both ecosystems while learning backend development, The Best Python HTTP Clients is a useful follow-up because it shows how Python libraries like RequestsHTTPX, and aiohttp solve many of the same problems in different ways. 

Setting up the project and installing dependencies

The first step to web scraping with Node Fetch is setting up a Node.js environment.

1. Install Node.js. Download and install Node.js 18 or newer (Long-Term Support recommended) from the official Node.js website. If you already use Node Version Manager (NVM), you can install Node.js from the terminal:

On Windows (PowerShell):

nvm install 18.20.8
nvm use 18.20.8

On  Mac:

nvm install 18
nvm use 18

2. Verify your Node version. Restart your terminal before verifying if you just installed Node.js. In case you already have Node.js installed, verify your Node version:

node -v

3. Initialize your node-fetch project folder:

mkdir node_fetch_scraper
cd node_fetch_scraper

4. Create your package.json file. It is the hub for your project dependencies:

npm init

5. Update your package.json file to include “type”: "module”. It enables modern node-fetch versions to support ES modules, not just ESM only. 

{
"name": "node-fetch-project",
"version": "1.0.0",
"type": "module",
"main": "index.js",
// ...
}

An alternative way to support ES modules is to store all your files with the .mjs extension. 

6. Dependencies to install: 

  • node-fetch. The HTTP Client itself. In this article, we will install node-fetch for its compatibility and other advantages over the built-in Fetch API
  • cheerio. It is the server-side HTML parser that enables data retrieval in jQuery style
  • https-proxy-agent. It is needed later in the tutorial for routing requests through an HTTP/HTTPS proxy
  • dotenv(optional). It is useful for storing proxy credentials and other sensitive values in environment variables instead of hardcoding them
  • fetch-cookie (optional). It preserves cookies across requests, useful for session-based targets

Note. If your Node.js version is 18 or above, the Fetch API (fetch()) is automatically available in your development environment. However, for resilience, robust community support, and scraping customization, node-fetch is preferable. Hence, it would be installed as a dependency. Keep in mind that node-fetch version 3 is ESM-only (ECMAScript Modules). It no longer uses require(), but instead uses import syntax.

npm install node-fetch cheerio https-proxy-agent dotenv fetch-cookie

Project structure recommendation

Here’s a recommended project structure for production scraping workflows:

  • A fetcher module. It wraps node-fetch with default headers, timeouts, and retry logic, keeping the rest of the app clean. 
  • A parser folder. It contains a parsing file(s) per target URL responsible for exporting a pure function that takes HTML and returns structured data.
  • A runner module. It organizes the URL list and outputs (CSV/JSON/database) appropriately.
node_fetch_scraper/
├── src/
│ ├── fetcher.js - HTTP wrapper
│ ├── runner.js - Orchestrator
│ └── parsers/
│ └── parser_file.js - Parsing functions
├── data/ - Output storage
├── package.json - Dependencies
└── .env - Environmental variables
└── README.md - Documentation

Separating fetching, parsing, and orchestration lets you swap node-fetch for advanced scraping services, such as a third-party scraping API, later, without rewriting parsing logic when scraping at scale in production. 

Basic fetch requests and data retrieval with Node Fetch

Let’s start by building a Node.js scraper that retrieves and parses HTML to extract data. We would then build a Node.js scraper to retrieve JSON data via an API call, without parsing HTML. 

Sending a GET request

1. Import node-fetch. Then call fetch(url) with a target URL string, and await the returned Promise to receive a Response. Our target URL for this tutorial is the Wikipedia country list

import fetch from 'node-fetch';
async function basicGetRequest() {
console.log('='.repeat(60));
console.log(' Section 2.1: Sending a GET Request');
console.log('='.repeat(60));
// Example url: Wikipedia country list
const url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area';
try {
console.log(`\n Fetching: Wikipedia - Countries by Area\n`);
// The returned Promise resolves to a Response once headers arrive
const response = await fetch(url);

2. Inspect the Response object

console.log(' Response Properties:');
console.log(` Status: ${response.status}`);
console.log(` Status Text: ${response.statusText}`);
console.log(` Content-Type: ${response.headers.get('content-type')}`);
console.log(` Content-Length: ${response.headers.get('content-length') || 'not provided'}`);
// Check response.ok (true for 200299 status codes)
if (!response.ok) {
throw new Error(
`HTTP Error: ${response.status} ${response.statusText} from ${url}`
);
}

Ensure response.statusText is 'ok' before passing the body to a parser.

3. Read the Response body.

// For HTML pages, use response.text()
console.log('\n Reading response body...');
const html = await response.text();

Note that reading the Response body is asynchronous because the body content is streamed from the server until the Fetch function fully consumes it. 

4. Put it all together. Here is the full running script:

import fetch from 'node-fetch';
async function basicGetRequest() {
console.log('='.repeat(60));
console.log(' Section 2.1: Sending a GET Request');
console.log('='.repeat(60));
const url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area';
try {
console.log(`\n Fetching: Wikipedia - Countries by Area\n`);
// Step 1: Import node-fetch and call fetch(url)
const response = await fetch(url);
// Step 2: Inspect the Response object
console.log(' Response Properties:');
console.log(` Status: ${response.status}`);
console.log(` Status Text: ${response.statusText}`);
console.log(` Content-Type: ${response.headers.get('content-type')}`);
console.log(` Content-Length: ${response.headers.get('content-length') || 'not provided'}`);
// Step 3: Check response.ok (true for 200299 status codes)
if (!response.ok) {
throw new Error(
`HTTP Error: ${response.status} ${response.statusText} from ${url}`
);
}
// Step 4: Read the body
console.log('\n Reading response body...');
const html = await response.text();
// Run the result
console.log(`\n Success! Received ${html.length} characters of HTML\n`);
console.log('First 500 characters of the response:');
console.log('-'.repeat(60));
console.log(html.slice(0, 500));
console.log('-'.repeat(60));
console.log('\n Summary:');
console.log(` url: Wikipedia - Countries by Area`);
console.log(` Status: ${response.status} ${response.statusText}`);
console.log(` Body size: ${html.length} bytes`);
} catch (error) {
console.error('\n Error during fetch:');
console.error(` ${error.message}`);
if (error.code === 'ENOTFOUND') {
console.error(' → DNS resolution failed. Check the url.');
} else if (error.code === 'ECONNREFUSED') {
console.error(' → Connection refused. The server may be down.');
} else if (error.name === 'AbortError') {
console.error(' → Request timed out.');
}
}
console.log('\n' + '='.repeat(60) + '\n');
}
// Run this example
basicGetRequest();

Run it with:

node your_script.js

Here is the output: 

Fetching: Wikipedia - Countries by Area
--------------------------------------------------
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-menu-disabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available skin-thumbsize-clientpref-standard" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>List of countries and dependencies by area - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-menu-disabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-mai
-----------------------------------------------------------

Retrieving JSON from an API

Many sites have API endpoints that return JSON. Hence, you could use response.json() to parse the Response body directly into a JavaScript object instead of retrieving it via HTML first. We will be using Open Library’s public API for this use case. 

1. Import node-fetch. Many requests without realistic headers are blocked; hence, fetch the URL with realistic headers: 

import fetch from 'node-fetch';
async function fetchBooksFromOpenLibrary() {
console.log('='.repeat(60));
console.log(' Section 2.2: Retrieving JSON from an API');
console.log('='.repeat(60));
const topic = 'science-fiction';
const url = `https://openlibrary.org/subjects/${topic}.json?limit=10`;
try {
console.log(`\n Fetching books from topic: "${topic}"`);
console.log(` url: ${url}\n`);
const response = await fetch(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'application/json',
'Accept-Language': 'en-US,en;q=0.9',
},
timeout: 10000,
});

2. Check Response status. Always check if response.ok is true before parsing: 

console.log(` Status: ${response.status} ${response.statusText}`);
console.log(` Content-Type: ${response.headers.get('content-type')}\n`);
if (!response.ok) {
throw new Error(
`API Error: ${response.status} ${response.statusText} from ${url}`
);
}

3. Use response.json() to parse API endpoints. It reads the body and parses it as JSON in one step:

console.log(' Parsing JSON response...');
const data = await response.json();

Recall in the previous section that for retrieving HTML response.text() is used, but for JSON, response.json() is used instead.

4. Organize the results.

console.log(`\n Success! Received API response\n`);
if (data.works && data.works.length > 0) {
console.log(` Found ${data.works.length} books:\n`);
data.works.forEach((book, index) => {
console.log(`${index + 1}. ${book.title}`);
console.log(` Author(s): ${book.authors?.map(a => a.name).join(', ') || 'Unknown'}`);
console.log(` First Published: ${book.first_publish_year || 'Unknown'}`);
console.log(` Edition Count: ${book.edition_count}`);
console.log('');
});
} else {
console.log('No books found for this topic.');
}

5. Put it all together. Here is the full running script:

import fetch from 'node-fetch';
async function fetchBooksFromOpenLibrary() {
console.log('='.repeat(60));
console.log(' Section 2.2: Retrieving JSON from an API');
console.log('='.repeat(60));
// Open Library API endpoint: Get books by subject/topic
// Documentation: https://openlibrary.org/developers/api
const topic = 'science-fiction';
const url = `https://openlibrary.org/subjects/${topic}.json?limit=10`;
try {
console.log(`\n Fetching books from topic: "${topic}"`);
console.log(` url: ${url}\n`);
// Step 1: Fetch the url with realistic headers
const response = await fetch(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'application/json',
'Accept-Language': 'en-US,en;q=0.9',
},
timeout: 10000,
});
// Step 2: Check response status
console.log(` Status: ${response.status} ${response.statusText}`);
console.log(` Content-Type: ${response.headers.get('content-type')}\n`);
// Step 3: Always check response.ok before parsing
if (!response.ok) {
throw new Error(
`API Error: ${response.status} ${response.statusText} from ${url}`
);
}
// Step 4: Use response.json() for API endpoints
// This reads the body AND parses it as JSON in one step
console.log(' Parsing JSON response...');
const data = await response.json();
// Step 5: Display results
console.log(`\n Success! Received API response\n`);
if (data.works && data.works.length > 0) {
console.log(` Found ${data.works.length} books:\n`);
data.works.forEach((book, index) => {
console.log(`${index + 1}. ${book.title}`);
console.log(` Author(s): ${book.authors?.map(a => a.name).join(', ') || 'Unknown'}`);
console.log(` First Published: ${book.first_publish_year || 'Unknown'}`);
console.log(` Edition Count: ${book.edition_count}`);
console.log('');
});
} else {
console.log('No books found for this topic.');
}
// Step 6: Show structure of the API response
console.log(' API Response Structure:');
console.log(` Name: ${data.name}`);
console.log(` Work Count: ${data.work_count}`);
console.log(` Key: ${data.key}`);
} catch (error) {
console.error('\n Error during API call:');
console.error(` ${error.message}`);
if (error.message.includes('JSON')) {
console.error(' → The response was not valid JSON. The API may have failed.');
} else if (error.code === 'ENOTFOUND') {
console.error(' → DNS resolution failed. Check your internet connection.');
}
}
console.log('\n' + '='.repeat(60));
console.log('\n💡 Key Difference from HTML Scraping:');
console.log(' - HTML scraping: response.text() + Cheerio selector parsing');
console.log(' - API consumption: response.json() + direct data access');
console.log('\n' + '='.repeat(60) + '\n');
}
// Run
async function main() {
await fetchBooksFromOpenLibrary();
}
main();

Run it with:

node your_script.js

Here is the output:

Fetching books from topic: "science-fiction"
url: https://openlibrary.org/subjects/science-fiction.json?limit=10
Status: 200 OK
Content-Type: application/json
Parsing JSON response...
Success! Received API response
Found 10 books:
1. The Time Machine
Author(s): H. G. Wells
First Published: 1895
Edition Count: 1146
……………

Error handling

  • Wrap the call in a try-catch block to capture network errors such as DNS failures, refused connections, and AbortController timeouts
  • Inside the try block, check whether response.ok is true and throw a custom error that includes the status and url(s), especially in multi-page scraping, so logs show which target failed and why
  • For long-running jobs, it’s best to classify failures. 4xx usually means a fix is needed, like a bad URL or a missing auth header, while 5xx and timeouts usually warrant a retry with backoff

Parsing and extracting data with Cheerio

After node-fetch returns the raw HTML from the Wikipedia country list page via a GET request, Cheerio parses it. Let’s see it in action while considering selector strategies, nested traversal, and how to avoid common scraping pitfalls. 

1. Pass the HTML into cheerio.load(). This returns a function, commonly named “$” that mirrors jQuery’s selector syntax. The $ function returns a Cheerio object, which acts as a collection of DOM elements ready to be queried using built-in methods like .find(), .text(), and .attr(‘name’):

import * as cheerio from 'cheerio';
const $ = cheerio.load(html);
console.log($('title').text());

2. Prefer stable structural selectors like semantic HTML, table rows, or aria-label instead of fragile auto-generated class names from site-building frameworks. For example, a class like (._eYtD2XCVieq6emjKBH3m) is the kind of selector that breaks weekly. Reliable selectors like table rows, on the other hand, are more sustainable:

// Good selectors
$('table tbody tr');
$('a[href]');
$('div[data-id]');
// Bad selectors
$('._xYz123'); // Minified classes break weekly
$('div:nth-child(1)'); // Position-based breaks

To master selector strategies, explore XPath vs. CSS Selectors for more guidance.

3. Iterate over a collection to extract data from similar rows. Always wrap the element in $ before calling Cheerio methods on it:

$('table tbody tr').each((i, el) => {
const $row = $(el);
const name = $row.find('td:nth-child(1)').text().trim();
console.log(`${i}: ${name}`);
});

4. Extract links and custom data attributes from href and data-ID attributes. Resolve relative URLs with the new url (href, pageUrl).toString():

$('a').each((i, el) => {
const href = $(el).attr('href');
const absUrl = new url(href, 'https://en.wikipedia.org').toString();
console.log(absUrl);
});

Ensure that the URLs passed into the new URL function are accurate; otherwise, the relative links extracted will be broken.

5. Clean text with .text().trim() to remove whitespace. Collapse multi-line content into a single string with .replace(/\s+/g, ' '):

const messyText = $('p').text();
const clean = messyText.trim().replace(/\s+/g, ' ');
console.log(clean);

6. Parse defensively. Cheerio returns an empty collection (length = 0) when there’s no matching selector to extract from. So, never assume a selector hit, store the selector result in a variable first, then guard with a length check before calling .text() on it:

const title = $('h1');
if (title.length) {
console.log(title.text());
} else {
console.log('No h1 found');
}

7. Put it all together. Here is the full running script:

import fetch from 'node-fetch';
import * as cheerio from 'cheerio';
async function scrapeCountries() {
try {
const response = await fetch('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area');
const html = await response.text();
const $ = cheerio.load(html);
const countries = [];
$('table tbody tr').each((i, el) => {
const $row = $(el);
const name = $row.find('td:nth-child(2)').text().trim();
const area = $row.find('td:nth-child(3)').text().trim();
if (name) countries.push({ name, area });
});
console.log(` Extracted ${countries.length} countries`);
console.log(countries.slice(0, 5));
} catch (error) {
console.error(`Error: ${error.message}`);
}
}
scrapeCountries();

Run it with:

node your_script.js

Here is the output:

Extracted 281 countries
[
{ name: 'Earth', area: '510,072,000 (196,940,000)' },
{ name: 'Russia', area: '17,098,246 (6,601,667)' },
{ name: 'Antarctica', area: '14,200,000 (5,480,000)' },
{ name: 'Canada', area: '9,984,670 (3,855,100)' },
{ name: 'China', area: '9,596,960 (3,705,410)' }

Tip: If your target site is a product, recipe, or article page, it typically embeds JSON in a <script type=" application/ld+json"> tag so machines can easily read product prices, author details, etc., via JSON Linked Data (JSON-LD). Scrape this easily by parsing JSON-LD with Cheerio + JSON.parse.It’s more reliable than scraping the site’s rendered HTML: 

const jsonLdScript = $('script[type="application/ld+json"]').text();
const structuredData = JSON.parse(jsonLdScript);
console.log(structuredData);

For a deeper understanding, refer to the web scraping with Cheerio and Node.js guide 

Handling advanced request options: Headers, cookies, and POST

A bare fetch call sends a GET request with node-fetch's default User-Agent and no cookies. Most scraping targets reject it before you get any useful data.

Custom request headers

fetch() accepts a second argument, called the options object — a JavaScript object containing configuration values that control how the request is sent. One of the most important properties inside this object is headers.

The headers field is itself a JavaScript object that contains header configurations sent with the request. It allows the client to describe how it wants to communicate with the server, what content it accepts, and even what type of browser or application it appears to be. 

Note that the default node-fetch User-Agent string  — node-fetch/x.y.z is the single biggest signal that a request is automated. The  “x.y.z” part represents the installed version number, such as “node-fetch/3.3.2”. Instead, replace it with a realistic Chrome or Firefox header string.

const response = await fetch(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.google.com/'
}
});

Don't reuse the same User-Agent across every request either. Rotate a small pool of realistic strings and keep them paired with matching sec-ch-ua client-hints — HTTP headers that indicate which browser version sent the request. Mismatched hints create a fingerprinting inconsistency that anti-bot systems catch quickly. Also, for protected endpoints, add authorization headers directly:

headers: {
'Authorization': 'Bearer YOUR_TOKEN',
'X-Api-Key': 'YOUR_API_KEY'
}

Cookies and sessions

node-fetch doesn't persist cookies between requests. There are two options to navigate this:

  1. Read the Set-Cookie. It retrieves headers from one response and forwards them to the next request's Cookie header.
  2. Install fetch-cookie. It wraps node-fetch with a Cookie jar (a storage object that automatically saves and sends cookies, just like a browser).
import fetchCookie from 'fetch-cookie';
import fetch from 'node-fetch';
const cookieFetch = fetchCookie(fetch);
// cookies are now persisted automatically across requests

This matters because many sites set a session ID or anti-bot challenge cookie on the first request and reject any follow-up request that doesn't echo it back. Also, for token-style authentication, capture the cookie set after a login POST and reuse it for every protected page in the same session.

import fetch from 'node-fetch';
async function scrapeProtectedSession() {
// 1. Send Login Request
const loginResponse = await fetch('https://example.com', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ username: 'user', password: 'password' })
});
// 2. Capture the session cookie from headers
const authCookie = loginResponse.headers.get('set-cookie');
if (!authCookie) {
throw new Error('Authentication failed: No cookie returned.');
}
// 3. Reuse the captured cookie for protected pages
const dataResponse = await fetch('https://example.com', {
method: 'GET',
headers: {
'Cookie': authCookie, // Injects the session token
'User-Agent': 'ScraperBot/1.0'
}
});
const protectedData = await dataResponse.json();
console.log('Successfully fetched protected data:', protectedData);
}

POST requests and form submissions

A POST request isn't just about sending data; it’s the point where the request method, payload structure, and encoding format must all agree on how the data is interpreted on the server.

This is why method, body, and Content-Type are inherently coupled:

  • The method (POST). It signals that data is being sent for processing, not just retrieved.
  • The body. It contains the actual payload.
  • The Content-Type. It defines how that payload should be parsed on the server.

If any of these are mismatched, the server may receive the data but interpret it incorrectly.

GET requests work for static pages. However, you will need a POST request for search forms that return results only after submission, login endpoints that gate the content you need, or API endpoints that expect a JSON payload.

node-fetch handles all 3 body types with the same options object. The difference is in how you format the body and what Content-Type you declare:

// JSON API -- used when the endpoint expects a structured payload
const response = await fetch('https://api.example.com/search', {
method: 'POST',
headers: { 'Content-Type': 'application/json' }, // tells the server to expect JSON
body: JSON.stringify({ query: 'web scraping', page: 1 }) // serializes the JS object to a JSON string
});
// HTML form -- used when replicating a standard web form submission
const response = await fetch('https://example.com/search', {
method: 'POST',
headers: { 'Content-Type': 'application/x-www-form-urlencoded' }, // standard form encoding
body: new URLSearchParams({ q: 'node fetch', category: 'tech' }) // encodes key-value pairs the way a browser form would
});

Here’s a realistic form-submission end-to-end scenario: 

1. On http.org forms, open DevTools (F12) and go to the Network tab.

2. Fill in the required fields and submit.

3. Retrieve the request URL, named "url", request method, and form data named "form".

4. Produce the request in your code and parse with Cheerio to extract results.

import fetch from 'node-fetch';
import * as cheerio from 'cheerio';
async function submitForm() {
try {
console.log('='.repeat(60));
console.log('POST Form Submission Example');
console.log('='.repeat(60));
// Step 1: Prepare form data payload
const formData = {
custname: 'John Doe',
custtel: '123456789',
custemail: 'john@example.com',
size: 'medium',
topping: 'bacon',
comments: 'extra cheese',
};
console.log('\n[INFO] Form data payload:');
console.log(formData);
// Step 2: Make POST request with realistic headers
console.log('\n[INFO] Sending POST request...\n');
const response = await fetch('https://httpbin.org/post', {
method: 'POST',
headers: {
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Referer': 'https://httpbin.org/forms/post',
},
body: new URLSearchParams(formData),
});
// Step 3: Check response status
if (!response.ok) {
throw new Error(`HTTP Error: ${response.status}`);
}
// Step 4: Parse and display response
const data = await response.json();
console.log('[INFO] Submitted form data:');
console.log(data.form);
console.log('\n' + '='.repeat(60));
console.log('Form submission completed successfully!\n');
} catch (error) {
console.error('\n[ERROR] Form submission failed:');
console.error(` ${error.message}`);
}
}
submitForm();

5. Run it with:

node your_script.js

6. Here’s the output: 

[INFO] Submitted form data:
{
comments: 'extra cheese',
custemail: 'john@example.com',
custname: 'John Doe',
custtel: '123456789',
size: 'medium',
topping: 'bacon'
}
============================================================
Form submission completed successfully!

This is useful for sites where you have to submit forms repeatedly. For your target site, check whether the form includes any hidden input fields — session tokens, CSRF tokens (a security value the server generates per session to verify the request came from a real form submission), or internal page identifiers the server validates before returning results. Also, for multipart file uploads, use the form data package and set Content-Type to multipart/form-data. This is less common in scraping but necessary when a form includes file inputs.

Query parameters and URL building

  • Build query strings with the native URL object rather than manual string concatenation, which breaks on special characters:
const url = new URL('https://example.com/listings');
url.searchParams.append('city', 'New York');
url.searchParams.append('page', '2');
const response = await fetch(url.toString());
  • Use URLSearchParams when iterating over a parameter map: 
const params = new URLSearchParams({ category: 'electronics', sort: 'price_asc' });
const response = await fetch(`https://example.com/products?${params}`);

Parallel and efficient fetching with node-fetch

Parallelism in the context of HTTP requests means executing multiple network calls simultaneously rather than waiting for each to finish before starting the next. In Node.js, this matters because HTTP requests are I/O bound. While a request is in flight, the event loop is not doing CPU work.

Why parallel fetching matters

HTTP requests are I/O-bound, meaning Node's event loop sits idle while waiting for the network. Sequential requests waste that idle time. Running them in parallel can cut a 60-second job down to a few seconds.

But unbounded parallelism backfires fast: it exhausts socket pools, triggers rate limits, and burns through proxy quotas. Choose the concurrency pattern based on the number of URLs you need to scrape.

Let’s look at 4 different patterns. 

Pattern 1: Promise.all for fixed small batches

Promise.all takes an array of Promises and resolves when all of them complete. It's the right tool for a known, small list (10-50 URLs). The catch is that Promise.all rejects on the first failure. Wrap each fetch in a try-catch block that returns a result object instead of rejecting, so the batch can finish, and you collect partial results:

const results = await Promise.all(
urls.map(async (url) => {
try {
const res = await fetch(url);
const html = await res.text();
return { ok: true, url, html };
} catch (err) {
return { ok: false, url, error: err.message };
}
})
);

A safer alternative is Promise.allSettled, which returns one result per Promise (either fulfilled or rejected) without short-circuiting on failure. For scraping, this is usually the right call.

Pattern 2: Bounded concurrency with p-limit

For thousands of URLs, cap the number of in-flight requests using p-limit, a package that limits how many async functions run at the same time. Start with 5–10 concurrent requests per domain to avoid triggering rate limits:

import pLimit from 'p-limit';
import fetch from 'node-fetch';
const limit = pLimit(5); // max 5 requests at a time
const results = await Promise.all(
urls.map(url => limit(async () => {
const res = await fetch(url);
return res.text();
}))
);

p-limit performs better than Promise.all at scale because it respects rate limits, keeps memory bounded, and plays nicely with your proxy pool size. See concurrency vs parallelism if the distinction between the two needs more context.

Pattern 3: Streaming queues for large jobs

When the URL list comes from a database or a streamed sitemap, use an async generator (a function that yields values one at a time, on demand) paired with a worker pool – a fixed group of concurrent workers that continuously pull the next available URL as soon as one finishes processing. This approach limits concurrent requests while keeping the crawler fast, and avoids loading every URL into memory upfront. 

Pair this with AbortController to cancel hung requests after a timeout. AbortController is a built-in Web and Node.js API used to cancel asynchronous operations, most commonly HTTP requests made with fetch():

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30000); // 30s timeout
try {
const res = await fetch(url, { signal: controller.signal });
const html = await res.text();
clearTimeout(timeout);
return html;
} catch (err) {
if (err.name === 'AbortError') console.log(`Timed out: ${url}`);
}

Retry, backoff, and idempotency

Retry only on idempotent failures, which are failures where repeating the same request is unlikely to create duplicate side effects or change the intended outcome. In practice, this includes transient network failures, 429 Too Many Requests rate limits, and temporary server-side errors such as 502503, and 504.

By contrast, most 4xx client errors indicate that the request itself is invalid or unauthorized. Retrying 403 Forbidden or 404 Not Found usually wastes bandwidth because the problem is not temporary server instability; the URL, permissions, or request parameters are the issue, not the transport layer. Use exponential backoff with jitter – add a small random delay to each retry interval to reduce failures:

const wait = (ms) => new Promise(resolve => setTimeout(resolve, ms));
async function fetchWithRetry(url, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
const res = await fetch(url);
if (res.ok) return res;
if (res.status === 429 || res.status >= 500) {
const delay = Math.pow(2, i) * 1000 + Math.random() * 500; // jitter
await wait(delay);
}
} catch (err) {
if (i === retries - 1) throw err;
}
}
}

Constant-interval retries (retrying every 5 seconds on the dot) are themselves a bot signal. For more on rate-limit handling, see the YouTube error 429

Managing proxies and CAPTCHAs in node-fetch

node-fetch doesn't have built-in proxy support. You wire it in manually using a custom agent, an object that controls how the underlying TCP (Transmission Control Protocol) connection is made.  TCP one of the core communication rules of the internet. When you use something like node-fetch to make an HTTP request, that request ultimately travels over TCP. 

Routing requests through a proxy

1. Install https-proxy-agent.

npm install https-proxy-agent

2. Use Decodo’s residential proxies as your proxy service provider. Extract your proxy credentials as described in the Decodo documentation, then pass them as the proxyAgent option.

import fetch from 'node-fetch';
import { HttpsProxyAgent } from 'https-proxy-agent';
const proxyAgent = new HttpsProxyAgent(
'http://YOUR_PROXY_USERNAME:YOUR_PROXY_PASSWORD@gate.decodo.com:7000'
);
// Store your credentials as environmental variables
const response = await fetch('https://target-site.com', { agent: proxyAgent });
const html = await response.text();
console.log(html);

3. Build a small helper that returns the right agent based on protocol, for scrapes that hit both HTTP and HTTPS targets: 

import { HttpsProxyAgent } from 'https-proxy-agent';
import { HttpProxyAgent } from 'http-proxy-agent';
function getAgent(targetUrl, proxyUrl) {
return targetUrl.startsWith('https')
? new HttpsProxyAgent(proxyUrl)
: new HttpProxyAgent(proxyUrl);
}

Datacenter proxies are cheap but easy to fingerprint: they come from known hosting IP ranges that anti-bot systems maintain blocklists for. Residential proxies route through real consumer IPs and are significantly harder to detect.

With Decodo’s large residential proxy network (115M+ IPs), Decodo rotates IPs through endpoints and session-based controls. This means rotating endpoints can send requests through different IPs, without you having to manually manage the rotation.

Instead of reusing a single IP for an entire job, you either pull from a rotating endpoint or cycle through proxies in the pool, which helps distribute traffic, reduce blocks, and keep large-scale requests stable. See what are rotating proxies? for how rotation works in practice.

Handling CAPTCHAs

node-fetch can't solve CAPTCHAs. It doesn't render JavaScript or interact with challenge widgets. There are 3 options when you hit one:

  1. Integrate a CAPTCHA-solving service and POST the token back to the form.
  2. Avoid triggering them in the first place. Slow requests down, use clean residential IPs, and rotate User-Agent strings.
  3. Hand the request to a managed scraping API that solves CAPTCHAs transparently.

Practical signals that your scraper has hit a CAPTCHA wall: 

  • a 403 status, 
  • a response body containing "captcha", "challenge", or "verify you are human"
  • a redirect to a /challenge url

For a full breakdown of bypass strategies, see how to bypass CAPTCHAs and anti-scraping techniques and how to outsmart them.

When to escalate to a managed scraping API

If you're rotating proxies, randomizing headers, retrying with backoff, and if retry rates recur like once in every 10 scraping attempts, then the target's anti-bot stack is winning.

Thankfully, Decodo's Web Scraping API handles JavaScript rendering, proxy rotation, header fingerprinting, and CAPTCHA solving within a single request. To use Decodo's Web Scraping API with node-fetch, you simply swap your target URL for the Decodo API endpoint and include your credentials in the Authorization header. Your parsing pipeline (Cheerio plus your selectors) doesn't change at all. Get your  Decodo Web Scraping API details here. Here is a sample implementation of a Web Scraping API with node-fetch and Cheerio:

const fetch = require('node-fetch');
const cheerio = require('cheerio');
async function scrapeWithDecodo() {
// 1. Define your Decodo API endpoint and credentials
const API_URL = 'https://scraper-api.decodo.com/v2/scrape';
const TOKEN = 'YOUR_BASE64_ENCODED_CREDENTIALS'; // Found in your Decodo dashboard
// 2. Prepare the request payload
const payload = {
url: 'https://example.com', // The actual website you want to scrape
target: 'universal', // Use 'universal' to get raw HTML
headless: 'html' // Options include 'html' or 'markdown'
};
try {
// 3. Send the request via node-fetch
const response = await fetch(API_URL, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Basic ${TOKEN}`,
'Accept': 'application/json'
},
body: JSON.stringify(payload)
});
if (!response.ok) {
throw new Error(`Error: ${response.statusText}`);
}
const data = await response.json();
// 4. Use Cheerio to parse the returned HTML as usual
const $ = cheerio.load(data.content);
const pageTitle = $('title').text();
console.log('Page Title:', pageTitle);
} catch (error) {
console.error('Scraping failed:', error);
}
}
scrapeWithDecodo();

Note that Decodo residential proxies are the right choice when you want to keep request logic in your own code while outsourcing only the IP pool. That fits projects already invested in their own retry and header stack. For more, see how to bypass anti-bot systems.

node-fetch can't rotate IPs

Your fetch calls are clean. Your single IP isn't. Decodo's residential proxies rotate through 115M+ addresses so your scraper doesn't get flagged after the first loop.

Before sending a request, run through these checks. They take about 5 minutes and can save you from a legal dispute.

  • Check robots.txt first. The robots.txt file lists which paths a site asks crawlers to avoid. node-fetch can retrieve it directly:
const res = await fetch('https://target-site.com/robots.txt');
const rules = await res.text();
console.log(rules);

Use the robots-parser package to programmatically check whether a URL is allowed before adding it to your queue, as shown below.

import fetch from "node-fetch";
import robotsParser from "robots-parser";
const robotsTxt = await fetch("https://target-site.com/robots.txt").then(r => r.text());
const robots = robotsParser("https://target-site.com/robots.txt", robotsTxt);
const url = "https://target-site.com/private-page";
if (robots.isAllowed(url, "MyCrawler")) {
console.log("Allowed → add to queue");
} else {
console.log("Disallowed → skip");
}
  • Read the Terms of Service. Many sites prohibit automated access even when robots.txt is silent. Public, non-logged-in data is generally safer to scrape than data behind authentication walls
  • Respect rate limits and Retry-After headers. When a server sends a Retry-After header with a 429 response, read it and wait that long before retrying. Aggressive scraping that ignores rate limits can constitute a denial-of-service in some jurisdictions
  • Handle personal data carefully. Avoid collecting PII (Personally Identifiable Information, such as names, email addresses, or phone numbers) without a lawful basis under GDPR (EU), CCPA (California), or similar data protection laws
  • Cache and deduplicate. Don't re-fetch the same URL hourly when daily is enough. It's cheaper, faster, and more respectful of the target's infrastructure

When in doubt, prefer the site's official API or a licensed data feed over scraping. See Is web scraping legal? for a full breakdown of laws and cases, and how to check if a website allows scraping for a practical pre-scrape checklist.

Conclusion

Node Fetch and Cheerio form a lightweight web scraping toolkit for Node.js that does not require heavy browser automation. Although Node.js has a built-in Fetch API, node-fetch is still useful for CommonJS projects, offers compatible plugins like fetch-cookie, and is resilient for web scraping. 

Despite Node fetch being versatile for advanced web scraping and efficient parallel fetching, it lacks proxy support, CAPTCHA handling, and JS rendering. Hence, use Decodo’s rotating residential proxies to defeat proxy-level anti-scraping bots. 

Then, easily escalate to Decodo’s Web Scraping API for proxy rotations, CAPTCHA, header fingerprinting, and JS rendering with node-fetch. This is the right path for production-grade web scraping with Node.js.

When fetch alone isn't enough

JS rendering, CAPTCHAs, and anti-bot detection. Decodo's Web Scraping API handles everything node-fetch can't and returns structured data in one call.

Share article:

About the author

Lukas Mikelionis

Senior Account Manager

Lukas is a seasoned enterprise sales professional with extensive experience in the SaaS industry. Throughout his career, he has built strong relationships with Fortune 500 technology companies, developing a deep understanding of complex enterprise needs and strategic account management.

Connect with Lukas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently Asked Questions

Is node-fetch still needed in 2026 now that Node.js has a built-in fetch?

Native fetch landed stable in Node 18 and is fine for most new projects. However, node-fetch is still useful for codebases pinned to older Node versions, CommonJS projects that prefer require(), and ecosystems built around node-fetch–compatible plugins like fetch-cookie. node-fetch remains useful because of its plugin ecosystem and CommonJS compatibility. For a comparison with other HTTP clients, see HTTPX vs Requests vs aiohttp.

Can node-fetch scrape JavaScript-rendered pages?

No. node-fetch returns whatever the server sends in the initial HTML response. Pages that build their DOM client-side will appear empty. Render them with a headless browser (Puppeteer, Playwright) or route the request through Decodo's Web Scraping API, which handles JavaScript rendering.

How do I add a proxy to node-fetch?

Install https-proxy-agent, instantiate it with your proxy URL (with credentials), and pass it as the agent option on each fetch call. For rotating IPs, swap the agent on every request. Learn what are rotating proxies and how to bypass the anti-bot systems for optimal proxy rotation strategies.

Why does my scraper get blocked even though I set a User-Agent?

A single header rarely fools modern anti-bot systems. They also fingerprint TLS, header order, missing client hints, and request timing. Combine realistic header sets, randomized delays, residential proxies, and, when needed, a managed scraping API.

node-fetch vs. axios – which is better for scraping?

Both work. node-fetch mirrors the browser standard, has zero opinion, and is lightweight. axios offers built-in interceptors, automatic JSON parsing, and request cancellation out of the box. For new scraping projects, native fetch or node-fetch are usually enough; reach for axios when you have already existing code that serves as middleware – automatically handles background tasks. Understanding concurrency vs parallelism will help you choose the right approach for your needs.

Neon rounded-square icon of a teardrop shape with circular head and horizontal bars, glowing on a dark perforated background

HTTPX vs. Requests vs. AIOHTTP: How to Choose the Right Python HTTP Client

Requests, HTTPX, and AIOHTTP all make HTTP requests, but they differ in how they handle concurrency. Requests is synchronous and has been the default since 2011. HTTPX gives you both sync and async with HTTP/2 support. AIOHTTP is async-only and faster at high concurrency, but has a steeper learning curve. The right choice depends on your async model, whether you need WebSockets or HTTP/2, and how much code you're willing to rewrite. This article covers architecture, performance data, proxy setup, migration paths, and common mistakes in production scraping setups.

JS logo overlaying a glowing blue code snippet on a dark abstract background

JavaScript Web Scraping Tutorial (2026)

Ever wished you could make the web work for you? JavaScript web scraping allows you to gather valuable information from websites in an automated way, unlocking insights that would be difficult to collect manually. In this guide, you'll learn the key tools, techniques, and best practices to scrape data efficiently, whether you're a beginner or a developer looking to streamline data collection.

Lock labeled 'Node Unblocker' unblocking flow from browser icon to file icon on dark background with Node.js

Node Unblocker: A Comprehensive Guide

Node Unblocker is an open-source web proxy built on Node.js that allows users to bypass internet censorship, evade network filters, and access restricted content. Whether you are dealing with strict corporate firewalls, educational network restrictions, or geo-blocked websites, Node Unblocker acts as a seamless intermediary to securely route your web traffic.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved