Web Scraping With Node Fetch: A Practical Guide
Web scraping with Node Fetch offers a lightweight way to collect data in Node.js. By fetching raw HTML or JSON responses and pairing them with parsers like Cheerio, developers can transform unstructured pages into structured datasets. This Node Fetch tutorial explains request handling, response parsing, data extraction, proxy integration, and when managed scraping APIs are necessary to effectively bypass advanced anti-bot protections.
Lukas Mikelionis
Last updated: Jun 11, 2026
17 min read

TL;DR
- Achieve lightweight web scraping on Node.js by sending a GET/POST request with node-fetch and easily extract the response body with Cheerio
- There are 3 concurrency patterns for effective parallel fetching with node-fetch, namely, Promise.all, Bounded concurrency with p-limit, and streamlining queues
- Handle advanced request options by rotating a small pool of realistic User-Agent strings, installing fetch-cookie, and using POST for form submissions
- Manage anti-scraping mechanics like CAPTCHA on node-fetch through proxies or a third-party managed web scraping API
What is node-fetch and how it relates to the Fetch API
The Fetch API is a WHATWG (Web Hypertext Application Technology Working Group) Specification that standardizes how JavaScript applications send HTTP requests and handle Responses when communicating with APIs or fetching resources from a server. Browsers have supported the Fetch API natively for roughly a decade.
node-fetch is a lightweight Node.js library that brought the same Fetch API syntax to server-side JavaScript before Node.js added native support via fetch(). Developers like node-fetch because it makes HTTP requests with the familiar fetch() syntax.
The key design difference between the Fetch API (fetch/node-fetch) and older JavaScript networking approaches is that the built-in Node.js HTTP module relied heavily on callbacks, which often led to nested and difficult-to-read code. Browsers also used XMLHttpRequest (XHR), an older API that required significantly more setup just to send a simple request.
Then the Fetch API simplified this process by returning a Promise. A Promise is a JavaScript object that represents a future result. So, instead of passing callback functions around, developers can wait for the result with async/await, making asynchronous code read more like regular synchronous code.
The Promise returned by fetch() resolves as soon as the Response headers arrive. At that point, the full Response body may still be downloading because it's exposed as a ReadableStream. A ReadableStream is a stream-based interface that delivers data in small chunks instead of loading the entire Response into memory at once. This is useful for large files, APIs with continuous data, or streaming content because the application can begin processing data immediately while the remaining chunks are still arriving.
You can consume the stream in different formats depending on the response type:
- response.text(). returns plain text, such as HTML or raw content.
- response.json(). parses the Response as JSON and returns a JavaScript object.
- response.buffer(). Returns binary data as a buffer in node-fetch.
- response.body. Exposes the raw ReadableStream for manual stream handling.
However, status codes don't reject the Promise either. A 404 or 500 response still resolves as a successful Promise request. Only network-level failures, such as a DNS error, a refused connection, or an aborted request, fail to return a Promise. Always confirm response.ok (true for 200-299) or response.status before passing the body to a parser.
Latest Node.js versions, starting with Node 18, ship with a built-in global fetch based on undici, an HTTP client library for Node.js. So why still use the node-fetch Library? There are 3 reasons:
- Projects pinned to Node versions older than 18
- Teams that need the v2 CommonJS require() syntax rather than ESM (ECMAScript Modules) import
- Codebases built around node-fetch-compatible plugins like fetch-cookie or node-fetch-har
Note. For new projects, use node-fetch v3 (ESM). For CommonJS codebases, v2 is the stable choice.
For a Python equivalent of a Node.js scraping task, see HTTPX vs. Requests vs. aiohttp. Also, if you're comparing both ecosystems while learning backend development, The Best Python HTTP Clients is a useful follow-up because it shows how Python libraries like Requests, HTTPX, and aiohttp solve many of the same problems in different ways.
Setting up the project and installing dependencies
The first step to web scraping with Node Fetch is setting up a Node.js environment.
1. Install Node.js. Download and install Node.js 18 or newer (Long-Term Support recommended) from the official Node.js website. If you already use Node Version Manager (NVM), you can install Node.js from the terminal:
On Windows (PowerShell):
On Mac:
2. Verify your Node version. Restart your terminal before verifying if you just installed Node.js. In case you already have Node.js installed, verify your Node version:
3. Initialize your node-fetch project folder:
4. Create your package.json file. It is the hub for your project dependencies:
5. Update your package.json file to include “type”: "module”. It enables modern node-fetch versions to support ES modules, not just ESM only.
An alternative way to support ES modules is to store all your files with the .mjs extension.
6. Dependencies to install:
- node-fetch. The HTTP Client itself. In this article, we will install node-fetch for its compatibility and other advantages over the built-in Fetch API
- cheerio. It is the server-side HTML parser that enables data retrieval in jQuery style
- https-proxy-agent. It is needed later in the tutorial for routing requests through an HTTP/HTTPS proxy
- dotenv(optional). It is useful for storing proxy credentials and other sensitive values in environment variables instead of hardcoding them
- fetch-cookie (optional). It preserves cookies across requests, useful for session-based targets
Note. If your Node.js version is 18 or above, the Fetch API (fetch()) is automatically available in your development environment. However, for resilience, robust community support, and scraping customization, node-fetch is preferable. Hence, it would be installed as a dependency. Keep in mind that node-fetch version 3 is ESM-only (ECMAScript Modules). It no longer uses require(), but instead uses import syntax.
Project structure recommendation
Here’s a recommended project structure for production scraping workflows:
- A fetcher module. It wraps node-fetch with default headers, timeouts, and retry logic, keeping the rest of the app clean.
- A parser folder. It contains a parsing file(s) per target URL responsible for exporting a pure function that takes HTML and returns structured data.
- A runner module. It organizes the URL list and outputs (CSV/JSON/database) appropriately.
Separating fetching, parsing, and orchestration lets you swap node-fetch for advanced scraping services, such as a third-party scraping API, later, without rewriting parsing logic when scraping at scale in production.
Basic fetch requests and data retrieval with Node Fetch
Let’s start by building a Node.js scraper that retrieves and parses HTML to extract data. We would then build a Node.js scraper to retrieve JSON data via an API call, without parsing HTML.
Sending a GET request
1. Import node-fetch. Then call fetch(url) with a target URL string, and await the returned Promise to receive a Response. Our target URL for this tutorial is the Wikipedia country list:
2. Inspect the Response object.
Ensure response.statusText is 'ok' before passing the body to a parser.
3. Read the Response body.
Note that reading the Response body is asynchronous because the body content is streamed from the server until the Fetch function fully consumes it.
4. Put it all together. Here is the full running script:
Run it with:
Here is the output:
Retrieving JSON from an API
Many sites have API endpoints that return JSON. Hence, you could use response.json() to parse the Response body directly into a JavaScript object instead of retrieving it via HTML first. We will be using Open Library’s public API for this use case.
1. Import node-fetch. Many requests without realistic headers are blocked; hence, fetch the URL with realistic headers:
2. Check Response status. Always check if response.ok is true before parsing:
3. Use response.json() to parse API endpoints. It reads the body and parses it as JSON in one step:
Recall in the previous section that for retrieving HTML response.text() is used, but for JSON, response.json() is used instead.
4. Organize the results.
5. Put it all together. Here is the full running script:
Run it with:
Here is the output:
Error handling
- Wrap the call in a try-catch block to capture network errors such as DNS failures, refused connections, and AbortController timeouts
- Inside the try block, check whether response.ok is true and throw a custom error that includes the status and url(s), especially in multi-page scraping, so logs show which target failed and why
- For long-running jobs, it’s best to classify failures. 4xx usually means a fix is needed, like a bad URL or a missing auth header, while 5xx and timeouts usually warrant a retry with backoff
Parsing and extracting data with Cheerio
After node-fetch returns the raw HTML from the Wikipedia country list page via a GET request, Cheerio parses it. Let’s see it in action while considering selector strategies, nested traversal, and how to avoid common scraping pitfalls.
1. Pass the HTML into cheerio.load(). This returns a function, commonly named “$” that mirrors jQuery’s selector syntax. The $ function returns a Cheerio object, which acts as a collection of DOM elements ready to be queried using built-in methods like .find(), .text(), and .attr(‘name’):
2. Prefer stable structural selectors like semantic HTML, table rows, or aria-label instead of fragile auto-generated class names from site-building frameworks. For example, a class like (._eYtD2XCVieq6emjKBH3m) is the kind of selector that breaks weekly. Reliable selectors like table rows, on the other hand, are more sustainable:
To master selector strategies, explore XPath vs. CSS Selectors for more guidance.
3. Iterate over a collection to extract data from similar rows. Always wrap the element in $ before calling Cheerio methods on it:
4. Extract links and custom data attributes from href and data-ID attributes. Resolve relative URLs with the new url (href, pageUrl).toString():
Ensure that the URLs passed into the new URL function are accurate; otherwise, the relative links extracted will be broken.
5. Clean text with .text().trim() to remove whitespace. Collapse multi-line content into a single string with .replace(/\s+/g, ' '):
6. Parse defensively. Cheerio returns an empty collection (length = 0) when there’s no matching selector to extract from. So, never assume a selector hit, store the selector result in a variable first, then guard with a length check before calling .text() on it:
7. Put it all together. Here is the full running script:
Run it with:
Here is the output:
Tip: If your target site is a product, recipe, or article page, it typically embeds JSON in a <script type=" application/ld+json"> tag so machines can easily read product prices, author details, etc., via JSON Linked Data (JSON-LD). Scrape this easily by parsing JSON-LD with Cheerio + JSON.parse.It’s more reliable than scraping the site’s rendered HTML:
For a deeper understanding, refer to the web scraping with Cheerio and Node.js guide
Handling advanced request options: Headers, cookies, and POST
A bare fetch call sends a GET request with node-fetch's default User-Agent and no cookies. Most scraping targets reject it before you get any useful data.
Custom request headers
fetch() accepts a second argument, called the options object — a JavaScript object containing configuration values that control how the request is sent. One of the most important properties inside this object is headers.
The headers field is itself a JavaScript object that contains header configurations sent with the request. It allows the client to describe how it wants to communicate with the server, what content it accepts, and even what type of browser or application it appears to be.
Note that the default node-fetch User-Agent string — node-fetch/x.y.z is the single biggest signal that a request is automated. The “x.y.z” part represents the installed version number, such as “node-fetch/3.3.2”. Instead, replace it with a realistic Chrome or Firefox header string.
Don't reuse the same User-Agent across every request either. Rotate a small pool of realistic strings and keep them paired with matching sec-ch-ua client-hints — HTTP headers that indicate which browser version sent the request. Mismatched hints create a fingerprinting inconsistency that anti-bot systems catch quickly. Also, for protected endpoints, add authorization headers directly:
Cookies and sessions
node-fetch doesn't persist cookies between requests. There are two options to navigate this:
- Read the Set-Cookie. It retrieves headers from one response and forwards them to the next request's Cookie header.
- Install fetch-cookie. It wraps node-fetch with a Cookie jar (a storage object that automatically saves and sends cookies, just like a browser).
This matters because many sites set a session ID or anti-bot challenge cookie on the first request and reject any follow-up request that doesn't echo it back. Also, for token-style authentication, capture the cookie set after a login POST and reuse it for every protected page in the same session.
POST requests and form submissions
A POST request isn't just about sending data; it’s the point where the request method, payload structure, and encoding format must all agree on how the data is interpreted on the server.
This is why method, body, and Content-Type are inherently coupled:
- The method (POST). It signals that data is being sent for processing, not just retrieved.
- The body. It contains the actual payload.
- The Content-Type. It defines how that payload should be parsed on the server.
If any of these are mismatched, the server may receive the data but interpret it incorrectly.
GET requests work for static pages. However, you will need a POST request for search forms that return results only after submission, login endpoints that gate the content you need, or API endpoints that expect a JSON payload.
node-fetch handles all 3 body types with the same options object. The difference is in how you format the body and what Content-Type you declare:
Here’s a realistic form-submission end-to-end scenario:
1. On http.org forms, open DevTools (F12) and go to the Network tab.
2. Fill in the required fields and submit.
3. Retrieve the request URL, named "url", request method, and form data named "form".
4. Produce the request in your code and parse with Cheerio to extract results.
5. Run it with:
6. Here’s the output:
This is useful for sites where you have to submit forms repeatedly. For your target site, check whether the form includes any hidden input fields — session tokens, CSRF tokens (a security value the server generates per session to verify the request came from a real form submission), or internal page identifiers the server validates before returning results. Also, for multipart file uploads, use the form data package and set Content-Type to multipart/form-data. This is less common in scraping but necessary when a form includes file inputs.
Query parameters and URL building
- Build query strings with the native URL object rather than manual string concatenation, which breaks on special characters:
- Use URLSearchParams when iterating over a parameter map:
Parallel and efficient fetching with node-fetch
Parallelism in the context of HTTP requests means executing multiple network calls simultaneously rather than waiting for each to finish before starting the next. In Node.js, this matters because HTTP requests are I/O bound. While a request is in flight, the event loop is not doing CPU work.
Why parallel fetching matters
HTTP requests are I/O-bound, meaning Node's event loop sits idle while waiting for the network. Sequential requests waste that idle time. Running them in parallel can cut a 60-second job down to a few seconds.
But unbounded parallelism backfires fast: it exhausts socket pools, triggers rate limits, and burns through proxy quotas. Choose the concurrency pattern based on the number of URLs you need to scrape.
Let’s look at 4 different patterns.
Pattern 1: Promise.all for fixed small batches
Promise.all takes an array of Promises and resolves when all of them complete. It's the right tool for a known, small list (10-50 URLs). The catch is that Promise.all rejects on the first failure. Wrap each fetch in a try-catch block that returns a result object instead of rejecting, so the batch can finish, and you collect partial results:
A safer alternative is Promise.allSettled, which returns one result per Promise (either fulfilled or rejected) without short-circuiting on failure. For scraping, this is usually the right call.
Pattern 2: Bounded concurrency with p-limit
For thousands of URLs, cap the number of in-flight requests using p-limit, a package that limits how many async functions run at the same time. Start with 5–10 concurrent requests per domain to avoid triggering rate limits:
p-limit performs better than Promise.all at scale because it respects rate limits, keeps memory bounded, and plays nicely with your proxy pool size. See concurrency vs parallelism if the distinction between the two needs more context.
Pattern 3: Streaming queues for large jobs
When the URL list comes from a database or a streamed sitemap, use an async generator (a function that yields values one at a time, on demand) paired with a worker pool – a fixed group of concurrent workers that continuously pull the next available URL as soon as one finishes processing. This approach limits concurrent requests while keeping the crawler fast, and avoids loading every URL into memory upfront.
Pair this with AbortController to cancel hung requests after a timeout. AbortController is a built-in Web and Node.js API used to cancel asynchronous operations, most commonly HTTP requests made with fetch():
Retry, backoff, and idempotency
Retry only on idempotent failures, which are failures where repeating the same request is unlikely to create duplicate side effects or change the intended outcome. In practice, this includes transient network failures, 429 Too Many Requests rate limits, and temporary server-side errors such as 502, 503, and 504.
By contrast, most 4xx client errors indicate that the request itself is invalid or unauthorized. Retrying 403 Forbidden or 404 Not Found usually wastes bandwidth because the problem is not temporary server instability; the URL, permissions, or request parameters are the issue, not the transport layer. Use exponential backoff with jitter – add a small random delay to each retry interval to reduce failures:
Constant-interval retries (retrying every 5 seconds on the dot) are themselves a bot signal. For more on rate-limit handling, see the YouTube error 429.
Managing proxies and CAPTCHAs in node-fetch
node-fetch doesn't have built-in proxy support. You wire it in manually using a custom agent, an object that controls how the underlying TCP (Transmission Control Protocol) connection is made. TCP one of the core communication rules of the internet. When you use something like node-fetch to make an HTTP request, that request ultimately travels over TCP.
Routing requests through a proxy
1. Install https-proxy-agent.
2. Use Decodo’s residential proxies as your proxy service provider. Extract your proxy credentials as described in the Decodo documentation, then pass them as the proxyAgent option.
3. Build a small helper that returns the right agent based on protocol, for scrapes that hit both HTTP and HTTPS targets:
Datacenter proxies are cheap but easy to fingerprint: they come from known hosting IP ranges that anti-bot systems maintain blocklists for. Residential proxies route through real consumer IPs and are significantly harder to detect.
With Decodo’s large residential proxy network (115M+ IPs), Decodo rotates IPs through endpoints and session-based controls. This means rotating endpoints can send requests through different IPs, without you having to manually manage the rotation.
Instead of reusing a single IP for an entire job, you either pull from a rotating endpoint or cycle through proxies in the pool, which helps distribute traffic, reduce blocks, and keep large-scale requests stable. See what are rotating proxies? for how rotation works in practice.
Handling CAPTCHAs
node-fetch can't solve CAPTCHAs. It doesn't render JavaScript or interact with challenge widgets. There are 3 options when you hit one:
- Integrate a CAPTCHA-solving service and POST the token back to the form.
- Avoid triggering them in the first place. Slow requests down, use clean residential IPs, and rotate User-Agent strings.
- Hand the request to a managed scraping API that solves CAPTCHAs transparently.
Practical signals that your scraper has hit a CAPTCHA wall:
- a 403 status,
- a response body containing "captcha", "challenge", or "verify you are human"
- a redirect to a /challenge url
For a full breakdown of bypass strategies, see how to bypass CAPTCHAs and anti-scraping techniques and how to outsmart them.
When to escalate to a managed scraping API
If you're rotating proxies, randomizing headers, retrying with backoff, and if retry rates recur like once in every 10 scraping attempts, then the target's anti-bot stack is winning.
Thankfully, Decodo's Web Scraping API handles JavaScript rendering, proxy rotation, header fingerprinting, and CAPTCHA solving within a single request. To use Decodo's Web Scraping API with node-fetch, you simply swap your target URL for the Decodo API endpoint and include your credentials in the Authorization header. Your parsing pipeline (Cheerio plus your selectors) doesn't change at all. Get your Decodo Web Scraping API details here. Here is a sample implementation of a Web Scraping API with node-fetch and Cheerio:
Note that Decodo residential proxies are the right choice when you want to keep request logic in your own code while outsourcing only the IP pool. That fits projects already invested in their own retry and header stack. For more, see how to bypass anti-bot systems.
node-fetch can't rotate IPs
Your fetch calls are clean. Your single IP isn't. Decodo's residential proxies rotate through 115M+ addresses so your scraper doesn't get flagged after the first loop.
Legal and ethical considerations for web scraping with node-fetch
Before sending a request, run through these checks. They take about 5 minutes and can save you from a legal dispute.
- Check robots.txt first. The robots.txt file lists which paths a site asks crawlers to avoid. node-fetch can retrieve it directly:
Use the robots-parser package to programmatically check whether a URL is allowed before adding it to your queue, as shown below.
- Read the Terms of Service. Many sites prohibit automated access even when robots.txt is silent. Public, non-logged-in data is generally safer to scrape than data behind authentication walls
- Respect rate limits and Retry-After headers. When a server sends a Retry-After header with a 429 response, read it and wait that long before retrying. Aggressive scraping that ignores rate limits can constitute a denial-of-service in some jurisdictions
- Handle personal data carefully. Avoid collecting PII (Personally Identifiable Information, such as names, email addresses, or phone numbers) without a lawful basis under GDPR (EU), CCPA (California), or similar data protection laws
- Cache and deduplicate. Don't re-fetch the same URL hourly when daily is enough. It's cheaper, faster, and more respectful of the target's infrastructure
When in doubt, prefer the site's official API or a licensed data feed over scraping. See Is web scraping legal? for a full breakdown of laws and cases, and how to check if a website allows scraping for a practical pre-scrape checklist.
Conclusion
Node Fetch and Cheerio form a lightweight web scraping toolkit for Node.js that does not require heavy browser automation. Although Node.js has a built-in Fetch API, node-fetch is still useful for CommonJS projects, offers compatible plugins like fetch-cookie, and is resilient for web scraping.
Despite Node fetch being versatile for advanced web scraping and efficient parallel fetching, it lacks proxy support, CAPTCHA handling, and JS rendering. Hence, use Decodo’s rotating residential proxies to defeat proxy-level anti-scraping bots.
Then, easily escalate to Decodo’s Web Scraping API for proxy rotations, CAPTCHA, header fingerprinting, and JS rendering with node-fetch. This is the right path for production-grade web scraping with Node.js.
When fetch alone isn't enough
JS rendering, CAPTCHAs, and anti-bot detection. Decodo's Web Scraping API handles everything node-fetch can't and returns structured data in one call.
About the author

Lukas Mikelionis
Senior Account Manager
Lukas is a seasoned enterprise sales professional with extensive experience in the SaaS industry. Throughout his career, he has built strong relationships with Fortune 500 technology companies, developing a deep understanding of complex enterprise needs and strategic account management.
Connect with Lukas via LinkedIn.
All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.


