Back to blog

Rust Web Scraping: Step-by-Step Tutorial With Code Examples

Python is usually the first choice for web scraping, but it can struggle in high-throughput scenarios where you’re fetching many pages concurrently or need stronger reliability. That’s where Rust comes in. In this tutorial, you’ll build a Hacker News scraper in Rust, covering setup, JSON output, and scaling, along with where Rust excels, where it adds friction, and when to offload to a managed scraping API.

TL;DR

  • Rust's core scraping stack includes three crates: reqwest (HTTP), scraper (HTML parsing), and tokio (async runtime).
  • Native async concurrency via Tokio using tokio::spawn and semaphores lets you scrape multiple pages simultaneously without GIL constraints.
  • reqwest only fetches raw HTML and doesn’t execute JavaScript, so pages built with frameworks like React or Next.js may return an empty shell.
  • Concurrency improves throughput, but without rate limiting or IP rotation, it increases the risk of being blocked.
  • Rust’s main advantage is predictable performance and stability under load, but tradeoffs include slower iteration and stricter compile-time checks.
  • For targets with aggressive bot protection or JavaScript rendering, external scraping APIs handle that layer so your Rust code can stay focused on parsing.

Advantages and disadvantages of using Rust for web scraping

Before setting everything up, it’s worth understanding what Rust actually gives you in scraping workloads and where it adds friction.

Advantages

  1. Performance. Rust compiles to native binaries with zero-cost abstractions and no garbage collector. For scrapers running against hundreds of thousands of pages, that predictability in latency and throughput is real.
  2. Native async concurrency. Native async concurrency via Tokio lets you scrape many pages simultaneously without the GIL constraints. You can run hundreds or even thousands of lightweight tasks, with the runtime efficiently scheduling them.
  3. Strong typing. Rust’s type system also helps catch mismatches early. It guarantees type correctness within your code, so if a struct expects a u32 and you provide a String, that’s a compile-time error.
  4. Single binary deployment. A Rust scraper compiles to a single executable with minimal runtime dependencies, which makes it easy to ship and run in containers or on servers without worrying about setting up an environment.

Disadvantages

  1. Steeper learning curve. Ownership, borrowing, lifetimes, and async patterns introduce cognitive overhead that slows down early development. What might take an afternoon in Python can take significantly longer the first time in Rust.
  2. Lacking headless browser ecosystem. There’s no equivalent to tools like Playwright or Selenium in Python. For JavaScript-heavy sites, you’ll often need to rely on external services, run a separate browser process, or integrate with APIs rather than using a fully native Rust solution.
  3. Slower compile times. Compared to scripting languages, the feedback loop is longer, which makes quick experimentation and debugging less flexible.
  4. Not justified for small jobs. If you’re scraping a single site or running occasional jobs, Python, or even a simple curl pipeline, will get you there faster. Rust starts to make sense when you’re dealing with large-scale concurrency or production systems where performance and reliability matter more than speed.

When Rust web scraping makes sense

Rust isn’t the default choice for web scraping, and for most cases, it doesn’t need to be. It starts to make sense when you’re dealing with:

  • Building high-throughput data pipelines that run continuously in production
  • Scraping thousands of pages concurrently
  • Teams already working in Rust who want a unified stack
  • Performance-sensitive applications where compute or memory costs matter at scale

What you need to scrape websites in Rust

This tutorial assumes you're comfortable with basic Rust syntax such as structsenumsfunctions, and error handling with Result. You should know how Cargo works, including adding dependencies and running cargo build and cargo run, and have a basic understanding of HTTP and HTML structure.

Core crates

  • reqwest is the standard async HTTP client for Rust. It supports GET and POST requests, headers, cookies, redirects, and TLS, and is widely used in scraping projects.
  • Scraper is an HTML parser built on top of HTML5ever that supports CSS selectors. It’s similar in role to BeautifulSoup in Python for navigating parsed HTML.
  • Tokio is the async runtime that powers concurrent execution. It’s required when using async reqwest and most modern Rust networking code.
  • Serde and serde_json are used for serializing extracted data into structured formats like JSON.

Installing Rust and creating the project

If you don't have Rust installed yet, rustup is the standard installer. On macOS or Linux, open your terminal and run:

# macOS / Linux
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

On Windows, download and run rustup-init.exe from https://rustup.rs, and follow the prompts.

With rustup installed, run these two commands to make Cargo (Rust's package manager and build tool) available in your current shell session:

source $HOME/.cargo/env
rustup default stable

Running these commands will trigger rustup to download and install the core Rust components. Your terminal will confirm each one as it installs, then display a final line showing the version of the stable toolchain that's now active.

Once that's done, verify both tools are working:

rustc --version
cargo --version

If both commands print version numbers, you're set.

With your dependencies sorted, create a new project directory and navigate into it:

cargo new hacker-news-scraper
cd hacker-news-scraper

This gives you a minimal project with src/main.rs and Cargo.toml files.

Open Cargo.toml and add your dependencies as shown below:

[dependencies]
reqwest = { version = "0.12", features = ["rustls-tls"] }
tokio = { version = "1", features = ["full"] }
scraper = "0.22"
serde = { version = "1", features = ["derive"] }
serde_json = "1"

Save the file, then run using this command:

cargo build

Cargo will download and compile everything for you.

Before writing any Rust, open your browser dev tools on Hacker News and identify the CSS selectors you’ll need for titles, links, and metadata. Test them in the console first, as it’s much faster to get selectors right here than debugging them later in the code.

Once that’s done, you’re ready to start writing the scraper in main.rs.

How to scrape websites in Rust: a full working example

We’re going to build a scraper that fetches Hacker News, extracts structured data from each post, paginates across multiple pages, and saves everything to a JSON file.

Hacker News is a good target for a few reasons:

  • It’s static HTML (no JavaScript rendering)
  • The structure is consistent
  • The data maps cleanly to a typed Rust struct

We're using Hacker News because it's a real site with a real DOM challenge worth explaining, specifically the sibling row structure where post metadata lives in a separate <tr> from the title. 

You'll see what that means shortly.

The full working example (start here)

Before breaking things down step by step, here’s the complete scraper.

All of this code goes into a single file: src/main.rs.

Note: If you haven’t added the dependencies in Cargo.toml, do that first, otherwise the scraper won’t compile.

use reqwest::Client;
use scraper::{Html, Selector, ElementRef};
use serde::Serialize;
use std::time::{SystemTime, UNIX_EPOCH};
#[derive(Debug, Serialize)]
struct HackerNewsPost {
title: String,
url: Option<String>,
domain: Option<String>,
score: u32,
comment_count: u32,
}
async fn fetch_page(client: &Client, page: u32) -> Result<String, reqwest::Error> {
let url = format!("https://news.ycombinator.com?p={}", page);
client
.get(&url)
.send()
.await?
.error_for_status()?
.text()
.await
}
fn parse_posts(html: &str) -> Vec<HackerNewsPost> {
let document = Html::parse_document(html);
let row_sel = Selector::parse("tr.athing").unwrap();
let title_sel = Selector::parse("span.titleline > a").unwrap();
let domain_sel = Selector::parse("span.sitestr").unwrap();
let score_sel = Selector::parse("span.score").unwrap();
let subtext_sel = Selector::parse("td.subtext").unwrap();
let link_sel = Selector::parse("a").unwrap();
let mut posts = Vec::new();
for row in document.select(&row_sel) {
let title_el = match row.select(&title_sel).next() {
Some(el) => el,
None => continue,
};
let title = title_el.text().collect::<String>().trim().to_string();
let url = title_el.value().attr("href").map(|s| s.to_string());
let domain = row
.select(&domain_sel)
.next()
.map(|el| el.text().collect::<String>().trim().to_string());
// 👇 FIX: navigate to sibling row instead of using invalid CSS selector
let subtext = row
.next_sibling()
.and_then(ElementRef::wrap)
.and_then(|sib| sib.select(&subtext_sel).next());
let score = subtext
.and_then(|st| st.select(&score_sel).next())
.and_then(|el| {
el.text()
.collect::<String>()
.trim_end_matches(" points")
.trim()
.parse::<u32>()
.ok()
})
.unwrap_or(0);
let comment_count = subtext
.and_then(|st| st.select(&link_sel).last())
.and_then(|el| {
el.text()
.collect::<String>()
.split_whitespace()
.next()
.and_then(|n| n.parse::<u32>().ok())
})
.unwrap_or(0);
posts.push(HackerNewsPost {
title,
url,
domain,
score,
comment_count,
});
}
posts
}
fn save_to_json(posts: &[HackerNewsPost]) -> Result<(), Box<dyn std::error::Error>> {
let json = serde_json::to_string_pretty(posts)?;
let timestamp = SystemTime::now()
.duration_since(UNIX_EPOCH)?
.as_secs();
let filename = format!("hn_posts_{}.json", timestamp);
std::fs::write(&filename, json)?;
println!("Saved {} posts to {}", posts.len(), filename);
Ok(())
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::builder()
.user_agent("Mozilla/5.0 (compatible; Rust scraper/1.0)")
.build()?;
let mut all_posts = Vec::new();
for page in 1..=3 {
println!("Fetching page {}...", page);
let html = fetch_page(&client, page).await?;
let posts = parse_posts(&html);
if posts.is_empty() {
println!("No posts found, stopping.");
break;
}
println!("Found {} posts", posts.len());
all_posts.extend(posts);
tokio::time::sleep(tokio::time::Duration::from_millis(500)).await;
}
save_to_json(&all_posts)?;
Ok(())
}

Run it with the following command:

cargo run

You should see it fetch multiple pages and generate a JSON file in your project directory.

Now, let’s break down the code and the results it delivers.

Step 1: Defining the data structure

First, we define what a post should look like. This sets a clear structure for the data we want to extract.

Using Option<String> for fields like url and domain accounts for cases where those values don’t exist, such as internal Hacker News posts. The Serialize derive allows us to convert this struct into JSON later without extra work. At this stage, we are not scraping yet, just defining what a valid output should look like.

Step 2: Fetching the page

This function is responsible for fetching the HTML of a given page. One important detail is that the Client is passed in rather than created inside the function. The client manages connections internally, so reusing them avoids unnecessary overhead. 

The .error_for_status() call ensures that failed HTTP responses do not silently pass through. Without it, a rate limit or error page could be treated as valid HTML and break your parser later. The ? operator keeps the function clean. Instead of manually checking errors at every step, it returns early if something goes wrong.

Step 3: Parsing the HTML

f you inspect Hacker News in your browser, you’ll notice each post is split across two rows. The title lives in one <tr>, while comments sit in the next one. That’s the main challenge. The parser handles this by first selecting each post row (tr.athing) to extract the title, URL, and domain, then moving to the next sibling row to grab the metadata.

Because HTML isn’t always consistent, the code uses and_then to safely chain operations without crashing when something is missing, and unwrap_or(0) to provide sensible defaults. You’ll also see trim() used throughout to clean up extra whitespace so the final output stays consistent.

Step 4: Looping through pages

This is where everything gets tied together. For each page, the code fetches the HTML, parses it into structured posts, and adds those posts to a growing list.

The print statements show the progress, and the short delay between requests helps avoid hitting the server too aggressively.

Step 5: Saving the results

Once all the data is collected, it’s converted into JSON and written to a file. serde_json::to_string_pretty() handles the conversion from Rust structs into readable JSON, so you don’t have to format anything manually.

The timestamp in the filename ensures each run creates a new file instead of overwriting the previous one. At this point, the pipeline is complete. You’ve taken raw HTML, turned it into structured data, and saved it in a format you can actually use.

Scaling up with async concurrency in Rust

The scraper we've built so far is sequential. It fetches page 1, waits for the response, parses it, and then moves to page 2 – rinse and repeat. It’s simple and predictable, but also inefficient, because most of the time is spent waiting on network responses.

To fix that, we’ll stop awaiting each request inside the loop and will instead create all the requests upfront, then wait for them together. 

In practice, this means replacing the existing page loop inside main.rs with a concurrent version:

use futures::future::join_all;
let tasks: Vec<_> = (1..=3)
.map(|page| fetch_page(&client, page))
.collect();
let results = join_all(tasks).await;
for result in results {
if let Ok(html) = result {
let posts = parse_posts(&html);
all_posts.extend(posts);
}
}

Now, all requests are sent at once, and the program waits for all of them to complete before processing the results. This works well for small batches, but firing everything at once doesn’t scale safely. 

If you try to fetch too many pages at the same time, you’ll hit rate limits or get blocked. To avoid that, you can cap how many requests run at once using a semaphore, again replacing the same loop in main.rs.

A semaphore is a way to control access to a limited resource. In our case, the resource is how many requests can run at the same time. You define a limit of how many tasks are allowed to execute concurrently, and the rest will wait until a free request slot becomes available.

use tokio::sync::Semaphore;
use std::sync::Arc;
let semaphore = Arc::new(Semaphore::new(5));
let tasks: Vec<_> = (1..=10)
.map(|page| {
let permit = semaphore.clone().acquire_owned();
let client = client.clone();
tokio::spawn(async move {
let _permit = permit.await.unwrap();
fetch_page(&client, page).await
})
})
.collect();

Here, even though many tasks are created, only 5 will run at the same time. Each task must request permission before it runs, and once it finishes, that permit is released so another task can proceed.

It’s important to understand that concurrency improves speed, but it doesn’t solve the problem of detection. If all requests come from a single IP, sending them in parallel makes your scraper easier to block. 

In practice, this is where rotating proxies prove extremely useful, distributing requests across multiple IPs, so the traffic looks more like real users rather than a single automated client.

Collect data without interruptions

Decodo's residential proxy pool covers 115M+ ethically-sourced IPs across 195+ locations, built for scrapers that need to stay undetected at scale.

Handling dynamic and JavaScript-rendered pages

Everything we’ve done so far works because Hacker News is a static site. The HTML you get from the request is the same HTML you see in the browser. But that’s not how most modern websites work.

With frameworks like React or Next.js, the initial HTML is often just a shell. The real content is loaded later with JavaScript. If you try to scrape those pages with reqwest, you’ll get an empty <div id="root"> or <div id="app"> as a result, even though the page looks fully populated in your browser.

The easiest way to spot this is to compare what your scraper receives with what you see in the browser. If the content is missing in the raw HTML, but visible on the page, it means it’s being rendered dynamically. You can confirm this in the Network tab by looking for Fetch or XHR requests that return JSON, which is often where the real data lives.

Once you’ve identified that, you have a few options:

Option 1: Find the underlying API

Many sites fetch their data from JSON endpoints behind the scenes. If you can locate that request in the Network tab, you can call it directly from Rust using reqwest and parse the response with serde_json. This is faster, more reliable, and avoids HTML parsing entirely.

Option 2: Using fantoccini for WebDriver-based control

If there’s no accessible API, the next option is using a real browser. In Rust, that typically means a WebDriver client like fantoccini, which connects to a running ChromeDriver or GeckoDriver instance and executes JavaScript just like a normal browser. This gives you access to fully rendered pages, but it’s slower and adds setup complexity.

Option 3: Using Decodo's Web Scraping API

The final option is to use our Web Scraping API to handle rendering for you. It can fetch the page, execute JavaScript, and return fully rendered HTML that you can parse into JSON, Markdown, and other formats with the same scraper setup. This keeps your Rust code simple and avoids managing a browser process yourself.

Handling anti-bot protections and blocks in Rust

On most sites, you’ll eventually run into anti-bot measures. 

The usual signs include HTTP 429 responses when you’re rate-limited, 403 responses when your requests are flagged, or pages that return a 200 status but contain CAPTCHA challenges or incomplete data. These soft blocks are easy to miss if you’re only checking status codes.

The first step is making your requests look normal. That means setting a realistic User-Agent, including standard headers like Accept and Accept-Language, and maintaining cookies across requests. On top of that, you need basic retry logic. Instead of failing immediately, retry requests with a short backoff, especially for 429 and server errors, but cap the number of attempts.

As you scale, the main constraint becomes your IP. Sending many requests from a single source is easy to detect, especially when combined with concurrency. Using rotating residential proxies helps distribute requests across real user IPs, making traffic patterns harder to detect.

Using a service like Decodo residential proxies allows requests to be distributed across multiple IP addresses, making the traffic pattern look far more like real users. This becomes especially important when running concurrent scrapers, where request volume increases quickly.

Don’t want to manage infrastructure?

Decodo's Web Scraping API takes care of the fingerprinting, proxy rotation, and CAPTCHA bypassing, ensuring your data collection doesn’t stop.

Best practices and common pitfalls

Apply these best practices as you build, and watch out for the pitfalls that tend to surface once your scraper is running at scale.

1. Don’t parse selectors inside loops. Avoid calling Selector::parse() inside a hot loop. Every time you call it, Rust validates and compiles the selector string, which is unnecessary work if the selector never changes. The better approach is to compile your selectors once before the loop and reuse them. It’s a small change, but it removes wasted CPU cycles and makes your parser more efficient.

2. Avoid .unwrap() in DOM traversal. When extracting elements from the DOM, it’s tempting to use .unwrap() because it’s quick and works during testing. The problem is that real pages are not stable and their elements can change. Using safer patterns like .and_then() or matching on Option forces you to handle those cases properly. The code gets longer, but far more resilient in production.

3. Reuse your HTTP clientreqwest::Client is designed to manage connections efficiently, including connection pooling and reuse. Creating a new client for every request throws away those benefits and adds unnecessary overhead. Instead, you should create a single client and reuse it across your entire scraper. Cloning the client is cheap, so you can safely pass it around or share it across tasks without performance issues.

4. Be mindful of async ownership ('static). When working with tokio::spawn, each task must own the data it uses. You can’t borrow local variables from the surrounding scope because the task may outlive them. The fix is usually to move owned data into the task using async move, or to wrap shared state in something like Arc. Once you understand that spawned tasks need full ownership, these errors become much easier to reason about. 

5. Expect slower iteration, but more stable results. Rust can feel slower to work with at first, especially compared to Python. The compiler forces you to handle errors, lifetimes, and edge cases upfront, which can be frustrating during development. But that upfront friction is what prevents silent failures later. Once your scraper compiles and runs correctly, it tends to be far more stable and predictable over time.

Final thoughts

We’ve built a complete scraper in Rust, from fetching and parsing Hacker News to handling pagination, adding concurrency, and exporting structured data to JSON. Along the way, you’ve seen how Rust handles data modeling, error handling, and performance in a way that’s very different from more scripting-focused approaches.

Rust is a strong choice for scraping pipelines where performance and reliability matter. It forces you to think about edge cases early, which can slow down development at first, but leads to far more stable systems once everything is in place. 

At the same time, not every problem is worth solving from scratch. When dealing with JavaScript-heavy sites or more advanced anti-bot measures, it often makes sense to offload the harder parts. 

Using something like the Decodo Web Scraping API lets you keep your Rust code focused on parsing and processing, while the underlying infrastructure handles rendering, fingerprinting, and access.

About the author

Lukas Mikelionis

Senior Account Manager

Lukas is a seasoned enterprise sales professional with extensive experience in the SaaS industry. Throughout his career, he has built strong relationships with Fortune 500 technology companies, developing a deep understanding of complex enterprise needs and strategic account management.


Connect with Lukas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Is Rust a good language for web scraping?

Rust is a strong choice for production-grade, high-throughput scraping pipelines where performance, memory efficiency, and concurrency matter. It requires more upfront setup than Python, but the result is a more reliable and scalable scraper, especially for long-running or large-scale workloads.

What is the best HTTP client for web scraping in Rust?

reqwest is the standard choice of HTTP client for web scraping using Rust. It’s async-first, well-maintained, and integrates naturally with Tokio. It supports TLS via rustls, handles redirects and cookies, and works well for most scraping scenarios.

Can Rust scrape JavaScript-rendered websites?

No, Rust on its own can’t scrape dynamic websites. But there are other options that can enable scraping JavaScript-rendered sites:

  1. Using an underlying JSON API
  2. Using fantoccini with an external WebDriver process
  3. Using our Web Scraping API, which has JavaScript rendering built in

How do I avoid getting blocked when scraping with Rust?

Set realistic browser-style headers, implement exponential backoff for rate limit responses, respect request delays between pages, and use rotating residential proxies to distribute requests across IP addresses. For targets with advanced bot management, a managed scraping API is the better option as it handles TLS fingerprinting and HTTP/2 settings.

How does Rust web scraping compare to Python?

Rust allows for faster, more memory-efficient scrapers and handles concurrency without the GIL limitations of Python. Python, on the other hand, has a richer ecosystem of purpose-built scraping libraries and is faster to prototype with. For most developers, Python is the better starting point, but Rust becomes worthwhile when Python's performance or concurrency model creates bottlenecks.

How to Scrape Google News With Python

Keeping up with everything happening around the world can feel overwhelming. With countless news sites competing for your attention using catchy headlines, it’s hard to find what you need among celebrity tea and what the Kardashians were up to this week. Fortunately, there’s a handy tool called Google News that makes it easier to stay informed by helping you filter out the noise and focus on essential information. Let’s explore how you can use Google News together with Python to get the key updates delivered right to you.

No-Code Web Scraper With Playwright MCP: How to Scrape Any Website With Playwright MCP

Playwright MCP is one of the most accessible ways to get started if you need data from a website but do not want to write scraping code. It enables an AI application or agent to control a browser, interact with web pages, and extract content just like a regular user would. In this article, you’ll learn what Playwright MCP is, how to set it up, and how to use it to scrape websites with natural language.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved