What is curl and why use it for web scraping?

curl ("Client URL") is a command-line tool that transfers data using various network protocols. It supports HTTP(S), FTP, and about 20 other protocols, making it incredibly versatile for fetching data from the web. Initially released in 1997, curl has become a standard utility pre-installed on Linux, macOS, and modern Windows systems, meaning you can start scraping immediately without installing anything.

Why developers love curl for scraping

The main appeal is simplicity. You can fetch a webpage's HTML with a single command, without the need for IDEs or fancy tools. It's simple, fast, and doesn't consume much memory compared to complex, browser-based tools. For quick data extraction tasks or testing APIs, curl gets the job done in seconds. It's also perfect for automation – write a curl command in a bash script, schedule it with cron, and you've got a basic scraper running on autopilot.

curl shines when you're dealing with static HTML pages, simple API calls, or need to test how a server responds to different request headers. If you're extracting data that's already present in the initial HTML response, curl handles it effortlessly.

When curl isn't enough

Here's where it gets tricky. Modern websites love JavaScript, but curl doesn't execute JavaScript. If the data you need is loaded dynamically after the page renders, curl will fetch just the base HTML while the actual content stays hidden. Sites with heavy anti-bot protections, CAPTCHAs, or complex authentication flows can also be a headache with curl alone.

For these scenarios, you'll want to reach for headless browsers like Puppeteer or Playwright, or consider using a dedicated solution like Decodo's Web Scraping API that handles JavaScript rendering and anti-bot measures automatically.

Getting started: Installing and setting up curl

Checking if curl is already installed

Before downloading anything, curl might be curled up somewhere in your system already. Open your terminal tool and type:

curl - - version

The result should look something similar to this:

curl 8.7 .1 ( x86_64 - apple - darwin24 . 0 ) libcurl / 8.7 .1 ( SecureTransport ) LibreSSL / 3.3 .6 zlib / 1.2 .12 nghttp2 / 1.64 .0 Release - Date : 2024 - 03 - 27 Protocols : dict file ftp ftps gopher gophers http https imap imaps ipfs ipns ldap ldaps mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp Features : alt - svc AsynchDNS GSS - API HSTS HTTP2 HTTPS - proxy IPv6 Kerberos Largefile libz MultiSSL NTLM SPNEGO SSL threadsafe UnixSockets

If you see a similar response with version information and a list of supported protocols, you're already good to go.

Most modern systems ship with curl pre-installed, so there's a decent chance you can skip straight to scraping. If, for some reason, your system doesn't have curl, follow the steps below to install it based on your operating system.

Installation by operating system

Linux. Most distributions include curl by default. If yours doesn't, install it using your package manager:

sudo apt - get update & & sudo apt - get install curl sudo dnf install curl sudo pacman - S curl

macOS. curl should be pre-installed on macOS. If you need to download or update to the latest version, use Homebrew:

brew install curl

Windows . Windows 10 (build 1803 or later) includes curl natively. Open Command Prompt or PowerShell and type curl --version to confirm. If it's missing or you're on an older version, download the Windows binary from the official curl website . Extract the files and add the folder to your system's PATH environment variable so you can run curl from any directory.

. Windows 10 (build 1803 or later) includes curl natively. Open Command Prompt or PowerShell and type to confirm. If it's missing or you're on an older version, download the Windows binary from the . Extract the files and add the folder to your system's PATH environment variable so you can run curl from any directory. Other systems. If you're using a less popular operating system or want to download curl manually, you can find a version from the official downloads page .

Verifying your installation

Run a quick test to make sure everything works:

curl https : // ip . decodo . com /

You should see HTML content printed directly to your terminal. If you get an error about SSL certificates or connection issues, check your network settings or firewall. Once you see that HTML dump, you're ready to start scraping.

Basic curl commands for web scraping

Understanding curl syntax

A curl command follows a simple structure:

curl [ option ] [ parameter ( s ) ] [ URL ]

The URL is the only required part – everything else is optional flags that modify the request behavior. The order also rarely matters, meaning that [options] can be written after the [URL] as well. Options typically start with a single dash (-o) for short form or double dash (--output) for long form. They are followed by additional parameters that add extra clarification or context. You can stack multiple options in a single command, which you'll do constantly when scraping.

Fetching a webpage's HTML

The most basic scraping command is a simple GET request:

curl https : // ip . decodo . com /

This prints the entire HTML response straight to your terminal. You'll see all the raw HTML tags, scripts, and content – exactly what the server sends back. It's useful for quick checks, but scrolling through walls of HTML in your terminal to find what you need is like trying to find a needle in a haystack.

Saving output to a file

Make sure you know where your terminal is currently running commands. Check your working directory with the pwd command, then use cd to move to where you want your test files to live. Create a new folder with mkdir, then enter it with cd folder_name. This way, you won't have trouble locating where your files are being placed.

Instead of cluttering your terminal, save the HTML to a file you can actually work with:

curl https : // example . com - o example . html

The -o flag writes the output to whatever filename you specify after it. If you want curl to name the file based on the URL automatically, use -O:

curl - O https : // example . com / data . html

This saves the file as data.html in your current directory.

Following redirects

Many websites redirect you from one URL to another – think HTTP to HTTPS, or shortened URLs that bounce you to the actual destination. By default, curl doesn't follow these redirects, so you won't get any meaningful content by running this:

curl http : // decodo . com /

On its own, this command will return nothing. If you add the --verbose flag (a request to provide detailed, extra information about the terminal's process), you'll see "HTTP/1.1 301 Moved Permanently". The line means that the content you're trying to access is no longer there and has been moved elsewhere (most likely to HTTPS).

Add the -L flag to tell curl to follow redirects automatically:

curl - L http : // decodo . com /

Now curl chases the redirect chain until it reaches the final destination and fetches the real content. This is essential for scraping, since you rarely want the redirect page itself – you want where it's sending you.

These basic commands cover most of the simple scraping tasks. Once you're comfortable with GET requests, saving files, and handling redirects, you're ready to tackle more sophisticated scenarios.

Advanced web scraping with curl

Customizing requests

Real-world scraping means disguising your requests to look like they're coming from a regular browser, not a command-line tool. Websites check request headers to identify bots, and a default curl request screams "automated tool" from a mile away.

Setting custom headers

The -H flag lets you add custom headers to your request. The most important one is the User-Agent, which identifies what browser you're using:

curl - H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" \ http : // httpbin . org / headers

The above command tells the site that you're using a Windows 10 operating system on a 64-bit machine. AppleWebKit is the reported rendering engine of Chrome and most Chromium-based browsers (although they actually use Blink). Pay no attention to Mozilla/5.0, as it's a legacy token that no longer works, and most browsers just include it for compatibility.

The test request is sent to HTTPBin, a handy website for testing your requests. You will get a JSON response that sends the information you provided back to you, so you know that it went through.

Without a realistic User-Agent, many sites will serve you different (often broken) content or block you entirely. That's why you'll need to include a lot of them, as a real browser would. You can stack multiple headers in one command:

curl - H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)" \ - H "Referer: https://decodo.com" \ - H "Accept-Language: en-US,en;q=0.9" \ http : // httpbin . org / headers

The Referer header tells the server where you "came from," which some sites check before serving content. The Accept-Language header tells what language and locale the client prefers. These are common headers that are sent by browsers to websites, making them strong identifiers of legitimate users.

Working with cookies

Cookies maintain session state between requests. This is what helps sites remember you, your set preferences, login status, and more. Save cookies from a response using -c:

curl - c cookies . txt https : // httpbin . org / cookies / set / decodo - test - cookie / 67

We're using an HTTPBin URL to set a custom "decodo-test-cookie" with the value "67". You can do this with a real site too, but most of them provide cookies through JavaScript – something curl can't handle.

Then send those cookies back in subsequent requests with -b:

curl - b cookies . txt http : // httpbin . org / cookies

This is crucial for scraping pages that require you to stay logged in or maintain a session.

Sending POST requests

curl isn't limited to just GET requests. Forms, logins, and API endpoints often need POST requests with data. Use -X POST and -d to send form data:

curl - X POST http : // httpbin . org / post \ - d "query=decodo+web+scraping" \ - d "limit=50"

HTTPBin will return JSON data by default. For sites that don't return it in a readable format, you can specify the Content-Type:

curl - X POST http : // httpbin . org / post \ - H "Content-Type: application/json" \ - d '{"keyword": "scraping", "count": 100}'

This pattern works for most API interactions where you need to submit data to get results back.

HTTP authentication

Some sites use basic HTTP authentication. Handle them with the -u flag:

curl - u username : password http : // httpbin . org / basic - auth / user / pass

curl encodes your credentials and includes them in the Authorization header automatically. For sites that don't use basic auth, you'll need to scrape the login form and submit credentials via POST instead.

Handling pagination and multiple requests

The real power of curl shows up when you automate it with shell scripts. Most scraping jobs involve fetching multiple pages: product listings, search results, or paginated data. A simple bash loop handles this elegantly. Create a new file (touch file_name.sh in the terminal, or create it manually) and write the following command in it:

for page in { 1. .5 } ; do curl - L "https://scrapeme.live/shop/page/$page/" \ - o "page_$page.html" sleep 2 done

Save the file and run it through the terminal with:

bash file_name . sh

This fetches pages 1 through 5, saves each to a separate file, and waits 2 seconds between requests to avoid overwhelming the server. An -L option is often used in pagination, as the first page will usually redirect to the default link without a page number in the URL.

You can also read URLs from a file. Create a file named urls.txt and enter several URLs you want to scrape:

http : // ip . decodo . com / http : // scrapeme . live / shop / http : // httpbin . org /

Make sure they're separated by a new line, including the last one (notice the empty 4th line). Then, in a different bash (.sh) file, write the following script:

while read url ; do curl "$url" - o "$(basename $url).html" sleep 1 done < urls . txt

Run it in your terminal as before. The script will scrape the listed websites and create a new file for each of them.

For more complex workflows like scraping data, extracting specific values, and then using those values in subsequent requests, you'll want to combine curl with other command-line tools or script it in Python. But for straightforward multi-page scraping, a bash loop with curl gets you surprisingly far.

Avoiding blocks and bans

Getting blocked is a scraper's nightmare. Websites deploy increasingly sophisticated anti-bot measures, and a few careless requests can get your IP banned for hours or days. Here's how to stay undetected during scraping.

Using proxies

Proxies are your first line of defense. They route your requests through different IP addresses, making it look like the traffic comes from multiple users instead of one relentless bot hammering the server. With curl, setting up a proxy is possible with the -x option:

curl - x proxy - host : port https : // example . com

But what if your proxy is unreliable or gets banned as well? This is where Decodo's rotating residential proxies become essential. Residential proxies use real and reliable IP addresses from actual devices, making your requests virtually indistinguishable from legitimate traffic. Even in the event one fails, the rotating nature will just switch to the next IP address, and you can continue scraping as usual.

Setting up Decodo's residential proxies with curl is simple:

curl - U "username:password" - x "gate.decodo.com:7000" "https://ip.decodo.com/json"

Replace username:password with your Decodo credentials, and you're routing requests through a pool of residential IPs that rotate automatically. For high-volume scraping, this setup is non-negotiable.

Rotating user-agents and headers

We covered User-Agent headers earlier, but it's worth emphasizing: rotating them between requests makes your traffic pattern look more organic. Create a list of common User-Agent strings and cycle through them:

USER_AGENTS = ( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36" ) for page in { 1. .5 } ; do AGENT = $ { USER_AGENTS [ $RANDOM % $ { curl - H "User-Agent: $AGENT" "https://example.com/page/$page" sleep 2 done

Mix in other headers like Accept-Language, Referer, and Accept-Encoding to further randomize your fingerprint. The goal is to avoid sending identical requests that may seem bot-like.

Handling CAPTCHAs and anti-bot measures

Here's where curl hits a wall. Modern websites deploy CAPTCHAs, browser fingerprinting, JavaScript challenges, and behavioral analysis that curl simply can't handle. curl doesn't execute JavaScript, can't solve CAPTCHAs, and lacks the browser environment these systems expect.

This is precisely what Decodo's Web Scraping API was built for. It handles JavaScript rendering, bypasses anti-bot protections, handles user-agent and header rotation, and bypasses CAPTCHAs, all behind the scenes. You send a simple API request, and Decodo returns clean HTML:

curl - - request 'POST' \ - - url 'https://scraper-api.decodo.com/v2/scrape' \ - - header 'Accept: application/json' \ - - header 'Authorization: Basic [your basic auth token]' \ - - header 'Content-Type: application/json' \ - - data ' { "url" : "https://ip.decodo.com" , "headless" : "html" } '

The API uses headless browsers and proxies under the hood, so you get all the benefits of sophisticated scraping infrastructure without building it yourself. For sites with heavy anti-bot measures, this approach is far more reliable than trying to outsmart CAPTCHAs with raw curl commands.

When you're scraping at scale or facing aggressive bot detection, tools like Decodo's API aren't just convenient – they're the difference between a scraper that works and one that gets blocked after three requests.