Back to blog

Elixir Web Scraping: A Practical Step-by-Step Guide

Elixir web scraping solves one of the hardest problems in high-volume data collection: concurrency without thread overhead. The BEAM virtual machine (Erlang's runtime) runs each HTTP request as a lightweight process, not an OS thread, so you can fetch thousands of pages concurrently. If a process crashes, the supervisor restarts it automatically. This guide builds a complete Elixir scraper from scratch, covering static pages, paginated targets, JavaScript-heavy sites, and anti-bot countermeasures.

Elixir Web Scraping

TL;DR: Build a simple Elixir scraper in minutes

Here's a complete Elixir web scraping script using two dependencies (Req and Floki) targeting the Hacker News "Who Is Hiring?" thread at https://news.ycombinator.com/item?id=40224213, a real, publicly accessible, plain-HTML page. 

Start by installing Elixir with Homebrew:

brew install elixir

Create a new Elixir project:

mix new scraper_project
cd scraper_project

Then, replace the last few lines in the mix.exs file with the following:

defp deps do
[
{:req, "~> 0.5"},
{:floki, "~> 0.36"}
]
end

Install the dependencies:

mix deps.get

Then create lib/quick_scrape.ex:

defmodule QuickScrape do
def run do
# Fetch the HN "Who Is Hiring?" thread
url = "https://news.ycombinator.com/item?id=40224213"
case Req.get(url) do
{:ok, %{status: 200, body: html}} ->
# Parse into a queryable tree
{:ok, doc} = Floki.parse_document(html)
# Extract first 5 job listing text blocks
doc
|> Floki.find(".commtext")
|> Enum.take(5)
|> Enum.map(fn node ->
node
|> Floki.text()
|> String.slice(0, 120)
end)
|> Enum.each(&IO.puts/1)
{:ok, %{status: status}} ->
IO.puts("Request failed with status #{status}")
{:error, reason} ->
IO.inspect(reason, label: "Request failed")
end
end
end

Run the scraper:

iex -S mix

Then inside the interactive shell:

QuickScrape.run()

The output will look like this:

Stripe | Software Engineer, Payments Infrastructure | Remote | Full-time...
Shopify | Senior Backend Developer | Toronto or Remote | $160K-$210K...
Linear | Product Engineer | San Francisco | Competitive salary...
Vercel | Infrastructure Engineer | Remote (US) | Full-time...
Anthropic | ML Researcher | San Francisco | Full-time...

Note: Hacker News may occasionally rate limit requests and return a 429 Too Many Requests status code. If that happens, wait a bit and try running the scraper again.

That’s how Elixir web scraping works end-to-end. The rest of this guide builds on this foundation – adding pagination, JavaScript rendering, anti-bot handling, and persistent storage for production use.

Why choose Elixir for web scraping

Elixir runs on the BEAM virtual machine, which gives scrapers something Python and Ruby web scraping can't match natively: each HTTP request runs in its own isolated, lightweight process. Thousands of concurrent fetches become straightforward. 

When a process crashes, the supervisor automatically restarts it. Elixir’s pattern matching also keeps parsing logic clean and concise. Combined with libraries like Crawly for large-scale crawling, Floki for HTML parsing, and Req or HTTPoison for HTTP requests, the Elixir ecosystem covers everything needed to build production-grade web scrapers.

The BEAM virtual machine

Each HTTP request runs in its own lightweight BEAM process, using around 2 KB of memory, compared to 1 MB or more per OS thread. You can run tens of thousands of concurrent fetches on a single machine without configuring thread pools, adjusting limits, or working around the Python GIL.

The BEAM scheduler automatically distributes process execution across CPU cores. When you're scraping large sites with thousands of pages, this difference becomes the main reason to choose scraping data with Elixir over Python alternatives.

Fault tolerance through supervisors

If a request process crashes, whether due to a timeout, a malformed response, or an unexpected HTML structure that your parser doesn't handle, the supervisor restarts it automatically. 

The crash is isolated to that single process. No other concurrent fetch is affected. Your scraper keeps running without manual intervention, which matters enormously on long crawl jobs that run for hours across millions of URLs. 

Python scrapers need explicit retry logic and exception handlers for the same resilience. In Elixir, the supervision tree handles it structurally. You define what should happen when things go wrong once, at the application level, and the runtime enforces it for every worker process.

Pattern matching makes parsers readable

Elixir's pattern matching lets you write extraction logic that reads like a specification. For instance, when you match on {:ok, %Req.Response{status: 200, body: body}}, you're simultaneously checking the result type, verifying the HTTP status, and binding the body to a variable, all in one expression. 

The equivalent Python code requires separate isinstance checks, status comparisons, and variable assignments.

For HTML parsing, this means your error handling and your happy-path logic sit side by side in readable case blocks. The parser communicates its intent clearly, which makes debugging significantly easier when a site changes its HTML structure and your selectors stop returning expected data.

The Elixir scraping ecosystem

Web scraping with Elixir and Crawly gives you a full spider framework with URL deduplication, concurrent request management, middleware for headers and rate limiting, and configurable pipelines for output. 

Scraping data with Elixir and Floki handles HTML parsing with a clean, well-documented CSS selector API. HTTPoison wraps the battle-tested hackney HTTP client. Req brings modern features like built-in retries, compression, and middleware composition. Jason handles JSON encoding and decoding efficiently.

These libraries are all actively maintained and production-ready. The ecosystem is smaller than Python's, but the core scraping stack is solid and covers the vast majority of real-world scraping use cases.

Here are the trade-offs you must know when using Elixir for web scraping:

  • Smaller ecosystem than Python. Elixir has fewer third-party scraping/parsing tools compared to stacks like Scrapy or Beautiful Soup.
  • Less learning material. There are fewer tutorials, guides, and community examples available on Elixir.
  • Steeper initial learning curve. Especially for teams unfamiliar with functional programming concepts.
  • Different programming paradigm. Elixir’s use of immutable data, pipe operators, and pattern matching takes time to get used to.
  • Slower initial productivity. As a developer, you may need time before you can think naturally in Elixir’s data-flow style.

These drawbacks are minimal for teams already experienced or familiar with Elixir. Despite ecosystem gaps, Elixir’s strong concurrency model can outweigh downsides for large-scale scraping systems.

Prerequisites and project setup

Before starting Elixir web scraping, you need a working development environment and a structured project.

Install Elixir, create a Mix project with a supervision tree, add your dependencies, and organize files so your scraper remains maintainable as it grows.

Note: You need to get the project structure right so that every code block that follows compiles and runs without surprises. We’ll use a Hacker News jobs scraper as the running example.

Elixir web scraping environment setup

Elixir compiles to BEAM bytecode, so it requires Erlang to run. You need to install both using asdf, a version manager that handles multiple runtime versions without polluting your system paths.

Install Elixir and Erlang:

# Install asdf (macOS via Homebrew)
brew install asdf
# Add Erlang and Elixir plugins
asdf plugin add erlang
asdf plugin add elixir
# Install compatible versions (Elixir 1.16 requires OTP 26)
asdf install erlang 26.2.2
asdf install elixir 1.16.2-otp-26
# Set as global defaults
asdf global erlang 26.2.2
asdf global elixir 1.16.2-otp-26
# Verify the installation
elixir --version
# Elixir 1.16.2 (compiled with Erlang/OTP 26)

This is a setup script for installing and managing specific versions of Erlang and Elixir on macOS.

This is what’s happening in the script above:

  • brew install asdf uses Homebrew, a macOS package manager, to install asdf, a version manager that lets you install and switch between multiple versions of programming languages.
  • The Add Erlang and Elixir plugins block tells asdf you want to manage Erlang and Elixir. Each language needs a plugin so asdf knows how to install and switch versions. In this case, the script is installing compatible versions of Erlang 26.2.2 and Elixir 1.16.2-otp-26.
  • Set as global defaults block makes these versions the default on your system. Anytime you run Elixir or Erlang, these versions will be used unless overridden locally. 
  • Verify installation:  elixir –-version checks that everything worked. Expected output confirms both versions of Elixir and Erlang it’s using.

In short, the script installs a version manager, uses it to install compatible versions of Erlang and Elixir, sets them as defaults, and then confirms everything is working. 

On Ubuntu/Debian, install via the official Erlang Solutions repository or use asdf with the same commands. On Windows, use the official installers from elixir-lang.org – they bundle Erlang automatically.

Now, create your project with a supervision tree included. The --sup flag generates a supervisor module, which is the foundation of Elixir's fault-tolerance model:

mix new job_scraper --sup
cd job_scraper

Mix is Elixir's build tool. It handles project scaffolding, dependency management, compilation, and running tests. The mix.exs file at the root of your project is where you declare dependencies and configure your application. Think of it as the equivalent of Python's pyproject.toml or Node's package.json.

Dependencies

Open mix.exs and replace the deps/0 function with the following. Each library covers a specific part of the scraping stack:

defp deps do
[
{:crawly, "~> 0.17"}, # Spider framework: request scheduling, deduplication, pipelines
{:floki, "~> 0.36"}, # HTML parsing with CSS selector support
{:req, "~> 0.5"}, # Modern HTTP client with built-in retries
{:httpoison, "~> 2.0"}, # Alternative HTTP client (wraps hackney)
{:jason, "~> 1.4"} # JSON encoding and decoding
]
end

Then, install the dependencies by running:

mix deps.get

Mix downloads and compiles each dependency from Hex, Elixir's package registry. The first run takes a minute or two. Subsequent runs only fetch changed packages. After installation, run mix compile to verify everything compiles cleanly before writing any scraping code.

Here's what each library does in practice:

  • Crawly. The spider framework that manages URL queues, deduplication, concurrent request scheduling, middleware, and output pipelines.
  • Floki. The standard HTML parser for Elixir that turns raw HTML strings into a queryable tree you can traverse with CSS selectors.
  • Req. The modern HTTP client you'll use in most examples. Composable, supports middleware, and handles retries cleanly. Recommended for new projects.
  • HTTPoison. Widely-used alternative HTTP client worth knowing. Wider community adoption and more Stack Overflow answers.
  • Jason. Fast JSON encoding. You'll use it to write scraped data as JSON output or to decode API responses.

In most cases, you don’t need both Req and HTTPoison. If you’re starting fresh, stick with :req. It’s cleaner and actively evolving.

Only keep :httpoison if a library you use requires it, or you already have code written around it.

Project structure

Organize the project so each concern lives in its own folder. This will make it fast to find and fix broken selectors when a target site changes its HTML.

job_scraper/
├── lib/
│ └── job_scraper/
│ ├── application.ex # OTP Application + supervisor tree
│ ├── spiders/ # Crawly spider modules
│ │ └── hn_spider.ex # Hacker News jobs spider
│ ├── parsers/ # HTML extraction logic
│ │ └── hn_parser.ex # Selectors for HN HTML structure
│ └── pipeline/ # Post-extraction processing
│ ├── csv_writer.ex
│ └── database_writer.ex
├── config/
│ └── config.exs # Crawly + app configuration
└── mix.exs

The spiders/ folder defines crawl behavior: where to start, what domain to stay in, and how to extract data. 

The parsers/ folder isolates all CSS selectors in one place, so when a site changes its HTML, you update a single file rather than hunting through spider code. 

The pipeline/ folder handles what happens after extraction: writing to CSV, JSON files, or a database.

This separation is important when you maintain scrapers for multiple targets, or when the same data source needs to feed different output formats.

Sending HTTP requests with Elixir

Every Elixir web scraping workflow starts with an HTTP GET request. Elixir offers 3 mature HTTP client libraries (HTTPoisonReq, and Finch) to help send requests efficiently while handling headers, retries, and responses. Choosing the right one depends on your scraping scale and how much control you need over connection management. 

This section compares all 3, then walks you through a complete working example with proper error handling and browser-realistic headers.

Library

Best for

Key strength

Trade-off

HTTPoison

General-purpose scraping

Wide adoption, great docs

Older API, less composable

Req

Most new scraping projects

Built-in retries, middleware

Fewer tutorials online

Finch

High-volume, single-domain crawls

Direct connection pool control

Lower-level, more config needed

Note: Use Req for most Elixir HTTP requests in scraping projects. Its middleware architecture makes it easy to add retry logic, set default headers, and handle compression without boilerplate. 

Fall back to Finch only when you need direct connection pool control at very high request volumes, for example, crawling a single large domain where reusing TCP connections per-IP matters.

Meanwhile, use HTTPoison if your team is more familiar with it or you need the broader range of community examples online.

Making a basic GET request

Here's a complete fetcher module targeting quotes.toscrape.com, a site built for scraping practice with stable, predictable HTML. 

This is the module you'll reuse throughout the guide. It wraps Req with browser-realistic headers and structured error returns so every HTTP failure is handled explicitly rather than crashing your spider mid-crawl.

Install the dependency first if you haven't already:

# mix.exs deps already include {:req, "~> 0.5"} -- run:
mix deps.get

Create lib/job_scraper/fetcher.ex:

defmodule JobScraper.Fetcher do
@browser_headers [
{"User-Agent",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 " <>
"(KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"},
{"Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},
{"Accept-Language", "en-US,en;q=0.9"},
{"Accept-Encoding", "gzip, deflate, br"}
]
def fetch(url, retries \\ 3)
def fetch(_url, 0), do: {:error, :max_retries_exceeded}
def fetch(url, retries) do
case Req.get(url,
headers: @browser_headers,
max_redirects: 5,
receive_timeout: 10_000,
connect_options: [timeout: 5_000]
) do
{:ok, %Req.Response{status: 200, body: body}} ->
{:ok, body}
{:ok, %Req.Response{status: 429}} ->
:timer.sleep(1000)
fetch(url, retries - 1)
{:ok, %Req.Response{status: status}} when status in 500..599 ->
fetch(url, retries - 1)
{:ok, %Req.Response{status: status}} ->
{:error, {:http_error, status}}
{:error, exception} ->
fetch(url, retries - 1)
end
end
end

The pattern match on {:ok, %Req.Response{status: 200, body: body}} does 3 things simultaneously: checks the result wrapper is :ok, verifies the HTTP status is exactly 200, and binds the response body to the body variable. 

Non-200 responses each hit their own branch: a 429 (rate limited) error is handled differently from a 404 (not found).

Always set a realistic User-Agent. Default library User-Agents like hackney/1.18.1 appear in every request made by Elixir's standard HTTP stack, and most sites maintain blocklists of known bot strings. 

Include Accept and Accept-Language headers to make your requests look closer to what a real browser sends.

Test the fetcher by running it in IEx  to verify the response before using it inside a spider:

iex -S mix
# Fetch a page
{:ok, html} = JobScraper.Fetcher.fetch("https://quotes.toscrape.com")
# Check first 200 characters
String.slice(html, 0, 200)

Parsing HTML with Floki

This section covers scraping data with Elixir and Floki.

Floki is the standard HTML parser for Elixir web scraping. It converts raw HTML strings into a queryable tree, lets you select nodes by class or element, and extracts text or attribute values in clean, chainable function calls using CSS selectors. 

Once you understand 4 core functions, which are parse_document/1find/2text/1, and attribute/2, you can extract any data from any HTML page.

Parse a document

Web scraping with Elixir requires parsing HTML into structured data.

Pass the HTML body string from your fetcher directly to Floki.parse_document/1. This function returns a tagged tuple – {:ok, doc} on success or an error if the HTML is malformed. In practice, Floki handles even poorly-formed HTML gracefully because it uses a lenient parser:

{:ok, html} = JobScraper.Fetcher.fetch("https://quotes.toscrape.com")
{:ok, doc} = Floki.parse_document(html)
# doc is now a nested list representing the HTML tree
# Example structure: [{"html", [], [{"head", [], [...]}, {"body", [], [...]}]}]

Every subsequent Floki function takes doc as its first argument and returns nodes matching your selector. Treat doc as your parsed document object; equivalent to what you'd get from BeautifulSoup's BeautifulSoup(html, 'html.parser') call in Python.

Select nodes with CSS selectors

Use Floki.find/2 with any CSS selector to pull matching nodes. Use the same selectors you'd verify in browser DevTools – right-click an element, select Inspect, and copy the selector.

Here's a complete example extracting all quotes, authors, and tags from quotes.toscrape.com

{:ok, html} = JobScraper.Fetcher.fetch("https://quotes.toscrape.com")
{:ok, doc} = Floki.parse_document(html)
# Each .quote div contains one complete quote block
quote_nodes = Floki.find(doc, ".quote")
# Extract text from each quote's nested .text span
quote_texts = Floki.find(doc, ".quote .text")
# Extract author names
authors = Floki.find(doc, ".quote .author")
# Extract all tag links inside quotes
tags = Floki.find(doc, ".quote .tags .tag")
IO.inspect(length(quote_nodes), label: "Quotes found")
# Quotes found: 10

CSS selectors in Floki work exactly like they do in browser DevTools. Class selectors (.quote), element selectors (a), descendant selectors (.quote .text), and attribute selectors ([href]) all work as expected. 

Open browser DevTools on your target page, right-click an element, and copy the selector to verify it before writing it in code.

Extract text content

Call Floki.text/1 on any node or list of nodes to get their plain text with all HTML tags stripped. When you pass it a list of matched nodes, it concatenates all their text. You'll need to map over the list individually.

# Get text from a single node
first_quote_node = Floki.find(doc, ".quote .text") |> List.first()
first_quote_text = Floki.text(first_quote_node)
# => "“The world as we have created it is a process of our thinking.”"
# Get text from each node separately
all_texts =
doc
|> Floki.find(".quote .text")
|> Enum.map(&Floki.text/1)
|> Enum.map(&String.trim/1)
# Print all quote texts
Enum.each(all_texts, &IO.puts/1)

Always pipe through String.trim/1 after Floki.text/1. HTML whitespace – indentation, newlines around inline elements, often ends up in the text content and will pollute your output if you don't strip it.

Extract attribute values

Use Floki.attribute/2 to pull HTML attribute values from matched nodes. The most common use case is extracting href values from anchor tags for link discovery. You need those URLs to queue the next requests in your spider:

# Extract all href values from anchor tags
all_hrefs =
doc
|> Floki.find("a")
|> Floki.attribute("href")
# => ["/", "/login", "/tag/love/", "/author/Albert-Einstein/", ...]
# Extract the src attribute from all images
image_urls =
doc
|> Floki.find("img")
|> Floki.attribute("src")
# Extract data attributes
job_ids =
doc
|> Floki.find(".job-listing[data-id]")
|> Floki.attribute("data-id")

Chain Floki calls with the pipe operator

Elixir's pipe operator (|>) makes extraction pipelines read left to right. The output of one function becomes the input of the next. 

For scraping data with Elixir and Floki, this is the idiomatic style: select a parent node, then scope all child selections to that node. 

Here's a complete extractor for quote cards on quotes.toscrape.com:

defmodule JobScraper.Parsers.QuoteParser do
def extract_all(doc) do
doc
|> Floki.find(".quote")
|> Enum.map(&extract_quote/1)
end
defp extract_quote(quote_node) do
%{
text: quote_node |> Floki.find(".text") |> Floki.text() |> String.trim(),
author: quote_node |> Floki.find(".author") |> Floki.text() |> String.trim(),
link: quote_node |> Floki.find(".author ~ a") |> Floki.attribute("href") |> List.first(),
tags: quote_node |> Floki.find(".tag") |> Enum.map(&Floki.text/1)
}
end
end

Each extract_quote/1 call receives a single quote node and extracts all sub-fields from it. Scoping your Floki calls to a parent node, rather than searching the whole document, prevents selector collisions when the same class name appears in different contexts on the page.

Handle missing elements safely

When Floki.find/2 matches nothing, it returns an empty list instead of raising an error. Calling Floki.text/1 on an empty list returns an empty string, which is usually acceptable. But calling List.first/1 on an empty list returns nil, which causes a crash if you then try to call a string function on it. Guard against this pattern:

# Unsafe -- crashes if selector matches nothing
href = doc |> Floki.find("a.next") |> Floki.attribute("href") |> List.first()
URI.merge(base_url, href) # crashes if href is nil
# Safe -- handle nil explicitly
defp safe_next_url(doc, base_url) do
case doc |> Floki.find("a.next") |> Floki.attribute("href") |> List.first() do
nil -> nil # No next page
href -> URI.merge(base_url, href) |> URI.to_string()
end
end
# Or use Enum.empty?/1 before extracting
nodes = Floki.find(doc, ".job-title")
title = if Enum.empty?(nodes), do: nil, else: Floki.text(nodes)

XPath traversal for complex selectors

You need to choose the right selector for web scraping.

CSS selectors cover most cases, but some HTML structures require traversal that CSS can't express, such as selecting a parent element, finding elements by text content, or navigating sibling relationships. Floki supports XPath via Floki.xpath/2 for these situations:

# Find links whose text contains 'Next'
next_links = Floki.xpath(doc, "//a[contains(text(), 'Next')]")
# Select the parent <li> of a specific <a> tag
parent_items = Floki.xpath(doc, "//a[@class='job-title']/parent::li")
# Find all elements with a specific data attribute value
featured = Floki.xpath(doc, "//*[@data-featured='true']")

Use XPath sparingly. CSS selectors are faster and more readable for the common cases. Reserve XPath for the specific situations where CSS genuinely can't express what you need.

Building a web crawler with Crawly

Crawly is the Elixir scraping framework that turns a single-page fetcher into a full crawl engine. Web scraping with Elixir and Crawly gives you URL deduplication (so you never request the same page twice), concurrent request management across multiple pages, middleware for headers and rate limiting, and pipelines that process extracted data before writing it to disk or a database. 

Meanwhile, a Crawly spider is a single Elixir module with 3 required callbacks that define where to crawl, what domain to stay in, and how to extract data from each response.

We’ll use the Open Library search results page (openlibrary.org) for book metadata extraction.

Spider structure

Here is a spider targeting openlibrary.org, a real, publicly accessible library catalog with clean, stable HTML and no aggressive bot detection. It extracts book metadata and follows pagination. Create lib/job_scraper/spiders/open_library_spider.ex:

defmodule JobScraper.Spiders.OpenLibrarySpider do
use Crawly.Spider
# base_url/0 -- domain scoping. Crawly only follows links within this domain.
# This prevents your spider from accidentally crawling the entire internet.
@impl Crawly.Spider
def base_url(), do: "https://openlibrary.org"
# init/0 -- seed URLs. Where your crawl begins.
@impl Crawly.Spider
def init() do
[start_urls: ["https://openlibrary.org/search?q=elixir+programming&sort=rating"]]
end
# parse_item/1 -- core logic. Called for every fetched page.
# Returns items (extracted data) and requests (URLs to crawl next).
@impl Crawly.Spider
def parse_item(response) do
{:ok, doc} = Floki.parse_document(response.body)
books = doc
|> Floki.find(".searchResultItem")
|> Enum.map(&extract_book/1)
|> Enum.reject(fn b -> is_nil(b.title) or b.title == "" end)
# Discover the next page link
next_requests =
doc
|> Floki.find("a.ChoosePage[rel='next']")
|> Floki.attribute("href")
|> List.first()
|> case do
nil -> []
path ->
full_url = "https://openlibrary.org" <> path
[Crawly.Utils.request_from_url(full_url)]
end
%Crawly.ParsedItem{items: books, requests: next_requests}
end
defp extract_book(node) do
%{
title: node |> Floki.find(".resultTitle a") |> Floki.text() |> String.trim(),
author: node |> Floki.find(".bookauthor a") |> Floki.text() |> String.trim(),
year: node |> Floki.find(".publishedYear") |> Floki.text() |> String.trim(),
url: node |> Floki.find(".resultTitle a") |> Floki.attribute("href") |> List.first()
}
end
end

This is an Elixir web scraper (spider) built using the Crawly library to crawl Open Library search results and extract information about books related to “Elixir programming.”

The parse_item/1 callback does 2 jobs every time it runs: extract structured data from the current page, and return a list of new URLs to crawl. 

The %Crawly.ParsedItem{} struct bundles both. items contains your extracted data maps, while requests contains Crawly.Request structs for the next URLs.

Crawly calls parse_item/1 for each URL in its request queue. 

When requests is an empty list, the spider has nowhere new to go and will eventually finish. When items is empty, but requests is non-empty, Crawly keeps crawling without extracting – useful for navigation-only pages.

Configuration in config.exs

Crawly's behavior is configured in config/config.exs. This is where you set User-Agent rotation, output pipelines, concurrency limits, and middleware:

Open config/config.exs and add Crawly settings:

import Config
config :crawly,
middlewares: [
# Stay within the base_url domain
Crawly.Middlewares.DomainFilter,
# Skip URLs already in the queue
Crawly.Middlewares.UniqueRequest,
# Rotate User-Agent on every request
{Crawly.Middlewares.UserAgent, user_agents: [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/121.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/121.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
]},
Crawly.Middlewares.RequestOptions
],
pipelines: [
# Validate required fields before writing
{Crawly.Pipelines.Validate, fields: [:title, :url]},
# Drop duplicates based on URL
{Crawly.Pipelines.DuplicatesFilter, item_id: :url},
# Write one JSON object per line to /tmp
{Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
],
# Maximum concurrent requests to any single domain
concurrent_requests_per_domain: 4,
# Stop after 500 items (remove for production full crawls)
closespider_itemcount: 500

Here’s something you should know:

Start concurrent_requests_per_domain at 2 to 4 for any new target. Crawling faster than the site allows can trigger rate limiting and IP blocks. Increase it only after confirming the site can handle your request rate without degrading. 

The WriteToFile pipeline writes 1 JSON object per line to /tmp — open the .jl file while the crawl runs to watch results appear in real time. The Validate pipeline drops items missing required fields before they reach storage to keep your output clean, even when individual pages have incomplete data.

The closespider_itemcount setting is useful during development to limit how long a test run takes.

Running the spider

Start an IEx session with your project loaded and use Crawly's engine API to control the spider:

# Open the interactive Elixir shell with your project loaded
iex -S mix
# Start the spider -- Crawly begins fetching and extracting
Crawly.Engine.start_spider(JobScraper.Spiders.OpenLibrarySpider)
# Check how many items have been extracted so far
Crawly.Engine.get_spider_stats(JobScraper.Spiders.OpenLibrarySpider)
# => %{items_count: 47, requests_done: 5, requests_queued: 3}
# Graceful shutdown when you're done
Crawly.Engine.stop_spider(JobScraper.Spiders.OpenLibrarySpider)

Crawly writes output to /tmp/*.jl by default (based on your config). Each line in the output file is a complete JSON object representing one extracted item. Open the file in a text editor or pipe it through jq to inspect results as the crawl runs:

cat /tmp/OpenLibrarySpider.jl | head -5

Scraping paginated websites with Elixir

Web scraping pagination is the most common structural challenge in web scraping. Most real-world targets, including job boards, product catalogs, and search results, split their data across multiple pages.

Web scraping in Elixir handles pagination naturally because your spider's parse_item/1 callback returns new URLs as data, and Crawly queues them automatically. There are 3 distinct pagination patterns, and each needs a different strategy:

Pattern 1: Page number parameters (?page=N)

The simplest pagination pattern puts the page number directly in the URL as a query parameter. When you can see the total number of pages on the first page (usually shown as "Page 1 of N" or a total results count), generate all page URLs upfront from your init/0 callback. When you don't, crawl until the items list is empty:

# Option A: Generate all page URLs from init/0 if you know the page count
def init() do
pages = Enum.map(1..20, fn n ->
"https://news.ycombinator.com/jobs?page=#{n}"
end)
[start_urls: pages]
end
# Option B: Crawl until results stop (when page count is unknown)
def parse_item(response) do
{:ok, doc} = Floki.parse_document(response.body)
items = JobScraper.Parsers.HNParser.parse(doc)
# Build next page URL only if this page returned results
next_requests =
if Enum.empty?(items) do
[] # No results = we've gone past the last page
else
current_page = extract_page_number(response.request.url)
[Crawly.Utils.request_from_url(
"https://news.ycombinator.com/jobs?page=#{current_page + 1}"
)]
end
%Crawly.ParsedItem{items: items, requests: next_requests}
end
defp extract_page_number(url) do
uri = URI.parse(url)
params = URI.decode_query(uri.query || "")
String.to_integer(Map.get(params, "page", "1"))
end

Option B is more robust for targets where the total page count changes over time, like a job board that adds new listings daily. The spider naturally stops when it reaches a page with no results.

Many sites render a "Next" or ">>" button you can follow. This pattern works even when you have no idea how many pages exist. Extract its href, build an absolute URL, and return it as the next request. The spider will stop naturally when no Next link exists on the final page:

def parse_item(response) do
{:ok, doc} = Floki.parse_document(response.body)
items = extract_items(doc)
# Try multiple selector variants -- sites use different markup
next_requests =
doc
|> Floki.find("a[rel='next'], .next-page > a, li.next a, a.next")
|> Floki.attribute("href")
|> List.first()
|> case do
nil -> [] # No Next link found -- this is the last page
href ->
# Convert relative path to absolute URL
absolute =
response.request.url
|> URI.parse()
|> URI.merge(href)
|> URI.to_string()
[Crawly.Utils.request_from_url(absolute)]
end
%Crawly.ParsedItem{items: items, requests: next_requests}
end

The URI.merge/2 call handles relative paths correctly – whether the href is /page/2?page=2, or a full absolute URL, URI.merge resolves it against the current page's URL.

Pattern 3: Cursor or token-based pagination (APIs/infinite scroll)

Some modern targets, especially API-backed frontends and social feeds, use cursor tokens rather than page numbers.

Extract the cursor from the current response, embed it in the next request URL, and continue until the cursor is absent.

Always check the Network tab in browser DevTools first. Many infinite-scroll pages call a REST API you can hit directly without JavaScript rendering:

def parse_item(response) do
# Many APIs return JSON even when you browse to them
data = Jason.decode!(response.body)
# Extract and format items from the JSON payload
items = data["results"] |> Enum.map(&format_item/1)
# Extract the next cursor value -- nil means we're on the last page
next_requests =
case Map.get(data, "next_page_token") do
nil -> []
cursor ->
next_url =
"https://hacker-news.firebaseio.com/v0/jobstories.json" <>
"?cursor=#{cursor}&limit=50"
[Crawly.Utils.request_from_url(next_url)]
end
%Crawly.ParsedItem{items: items, requests: next_requests}
end
defp format_item(raw) do
%{
id: raw["id"],
title: raw["title"],
url: raw["url"]
}
end

Infinite scroll pages that don't expose an API require JavaScript rendering (we’ll cover this in the next section).

Scraping JavaScript-rendered pages

JavaScript rendering is one of the hardest challenges in Elixir web scraping. Elixir's HTTP clients, including HTTPoison, Req, and Finch, all fetch raw HTML. None of them can execute JavaScript. 

When a target page loads its content dynamically after the initial HTML response, your scraper returns empty containers. Web scraping with Elixir addresses this through external rendering services rather than native headless browser support. 

Here are specific signs that can tell you a site requires JavaScript rendering:

  • You get a 200 response, but Floki finds empty containers where data should be.
  • Data only appears after scrolling or clicking.
  • Content lives inside <script type="application/ld+json"> blocks.

Check the browser DevTools Network tab for XHR/Fetch calls. If the JavaScript is loading data from a REST API, you can call that API directly from Elixir and skip rendering entirely.

Some pages load different content depending on the User-Agent, serving a static fallback to bots. Try setting a mobile User-Agent to see if you get a simpler, static page.

Here are practical approaches for scraping JavaScript-rendered pages:

Option 1: Splash integration with Crawly

Splash is a lightweight JavaScript rendering service that runs in Docker. Crawly supports it natively via the fetcher configuration option. 

When you route requests through Splash, it renders the page in a real browser engine (Qt WebKit) and returns the fully rendered HTML to your spider. Run Splash in Docker, then switch one line in your config:

# Step 1: Start Splash in Docker
# docker run -p 8050:8050 scrapinghub/splash
# Step 2: Configure Crawly to use Splash as the fetcher
# config/config.exs
config :crawly,
fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050"]},
# ... rest of your Crawly config

With this configuration, your spider code doesn't change at all. Crawly routes each request through Splash, which renders the JavaScript, then returns the populated HTML to your parse_item/1 callback as usual.

Note: Splash works well for pages with simple JavaScript rendering requirements. However, it can struggle with modern single-page applications that use sophisticated client-side routing or that run specific anti-bot JavaScript challenges. 

It also adds operational complexity since you need Docker running, and Splash itself needs to be monitored.

Instead of managing a headless browser, route requests through the Decodo Web Scraping API. The API handles JavaScript rendering, proxy rotation, CAPTCHA solving, and geo-targeting in a single API call. From Elixir, it's just an HTTP POST to a different endpoint. Your Floki parsing code stays the same.

It also handles automatic retries, TLS fingerprint rotation, and real browser fingerprints. You get rendered HTML back without maintaining any browser infrastructure.

Create a new Mix project if you haven’t:

mix new job_scraper
cd job_scraper

Update mix.exs:

defp deps do
[
{:req, "~> 0.5"},
{:floki, "~> 0.36"}
]
end

Run mix deps.get, then create lib/job_scraper/decodo_script.ex and paste the full script:

defmodule JobScraper.DecodoScript do
@api_url "https://scraper-api.decodo.com/v2/scrape"
def run do
url = "https://quotes.toscrape.com/js/"
case fetch_rendered(url) do
{:ok, html} ->
IO.puts("Page fetched successfully\n")
quotes = extract_quotes(html)
IO.puts("Extracted Quotes:\n")
Enum.each(quotes, fn quote ->
IO.puts("- #{quote}")
end)
{:error, reason} ->
IO.puts("Failed: #{reason}")
end
end
def fetch_rendered(target_url) do
username = System.get_env("DECODO_USERNAME")
password = System.get_env("DECODO_PASSWORD")
payload = %{url: target_url, headless: "html", geo: "us"}
case Req.post(@api_url, json: payload, auth: {username, password}) do
{:ok, %Req.Response{status: 200, body: body}} -> {:ok, body["html"]}
{:ok, %Req.Response{status: status}} -> {:error, "API error HTTP #{status}"}
{:error, reason} -> {:error, inspect(reason)}
end
end
def extract_quotes(html) do
{:ok, document} = Floki.parse_document(html)
document
|> Floki.find(".quote .text")
|> Enum.map(fn element -> element |> Floki.text() |> String.trim()
end)
end
end
JobScraper.DecodoScript.run()

Set credentials:

export DECODO_USERNAME=your_username
export DECODO_PASSWORD=your_password

Run the script with:

mix run lib/job_scraper/decodo_script.ex

Expected console output will look like this:

Page fetched successfully
Extracted Quotes:
- “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.
- “It is our choices, Harry, that show what we truly are, far more than our abilities.
- “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.
- “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.
- “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.

Skip the boilerplate

Decodo's Web Scraping API handles proxies, CAPTCHAs, and anti-bot detection so your code stays short and your requests actually land.

Handling anti-scraping mechanisms

Anti-scraping techniques have become sophisticated over the past few years. A basic User-Agent rotation and a small delay between requests aren't enough to stay undetected with Elixir web scraping at scale. 

You need to understand exactly what a target deploys (rate limiting, TLS fingerprinting, behavioral scoring) and how to counter it in Elixir so you can build scrapers that hold up rather than break silently after a few hundred requests.

Common anti-scraping mechanisms

  • Rate limiting and IP blocking. Servers track request frequency per IP address. When you exceed the threshold, you'll likely see 429 responses, or the site silently serves a page that looks correct but contains placeholder data or fewer results than normal. IP blocks are often temporary (15-minute to 24-hour windows), but repeated violations can escalate to permanent bans.
  • User-agent fingerprinting. Sites maintain blocklists of User-Agent strings associated with known scraping libraries. Erlang's default hackney User-Agent (hackney/1.18.1) is on most of these lists. Any request with a library-style User-Agent string gets flagged immediately.
  • CAPTCHA challenges. These are typically triggered after sustained high-frequency traffic from the same IP. CAPTCHA requires human interaction or a solving service. Standard image recognition or audio CAPTCHAs can be solved programmatically, but reCAPTCHA v3 uses behavioral scoring that can't be bypassed with just a solver.
  • TLS fingerprinting. Advanced sites inspect the TLS handshake signature, not just HTTP headers. Erlang's TLS client produces a different signature and extension set compared to Chrome's TLS stack. Cloudflare's bot detection uses this signal. Therefore, sending browser-realistic HTTP headers doesn't help if the TLS fingerprint screams "Erlang."
  • Behavioral analysis. Machine learning systems score requests based on timing patterns, header completeness, navigation sequences, and mouse/keyboard event data injected via JavaScript. A scraper that hits 50 pages in 2 seconds with perfectly uniform timing looks nothing like a human user who takes 3-8 seconds per page with random variation.

How to overcome these anti-scraping mechanisms in Elixir

Apply these mitigations in order. Start simple and escalate.

Inside your Elixir project using Crawly, lib/job_scraper/spiders/hn_spider.ex add this code as a as a spider module file:

defmodule JobScraper.Spiders.HNSpider do
use Crawly.Spider
# Rotate between realistic browser UA strings
@user_agents [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.3 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
]
def parse_item(response) do
# Randomized delay -- fixed delays are easier to detect than random ranges
:timer.sleep(Enum.random(2000..5000))
{:ok, doc} = Floki.parse_document(response.body)
items = JobScraper.Parsers.HNParser.parse(doc)
requests = discover_links(doc, response.request.url)
%Crawly.ParsedItem{items: items, requests: requests}
end
# Include all standard browser headers with each request
defp full_headers do
[
{"User-Agent", Enum.random(@user_agents)},
{"Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"},
{"Accept-Language", "en-US,en;q=0.9"},
{"Accept-Encoding", "gzip, deflate, br"},
{"Referer", "https://www.google.com/"},
{"DNT", "1"}
]
end
end

Proxy rotation for IP-level blocks

When an IP is blocked, header tuning won't help. You need a different IP address. Residential proxies route traffic through real consumer IPs, which are far harder for sites to detect and block than data center IPs. Configure Decodo's residential proxies directly in your Req requests:

defmodule JobScraper.ProxyFetcher do
@proxy_host "gate.decodo.com"
@proxy_port 7000
def fetch(url) do
proxy_username = "YOUR_PROXY_USERNAME"
proxy_password = "YOUR_PROXY_PASSWORD"
proxy_url =
"http://#{proxy_username}:#{proxy_password}@#{@proxy_host}:#{@proxy_port}"
headers = [
{"User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/121.0.0.0 Safari/537.36"},
{"Accept", "text/html,application/xhtml+xml;q=0.9,*/*;q=0.8"}
]
case Req.get(url,
headers: headers,
connect_options: [proxy: proxy_url]) do
{:ok, %Req.Response{status: 200, body: body}} -> {:ok, body}
{:ok, %Req.Response{status: status}} -> {:error, "HTTP #{status}"}
{:error, reason} -> {:error, inspect(reason)}
end
end
end

Run it exactly as before. Swap JobScraper.Fetcher.fetch for JobScraper.ProxyFetcher.fetch anywhere in your spider code.

Each request through Decodo's gateway rotates to a different residential IP automatically. From the target site's perspective, each request comes from a different home internet connection in the configured country. 

For targets that block aggressively, the Decodo Web Scraping API handles proxy rotation, CAPTCHA solving, and retry logic transparently. This means you focus on parsing, not infrastructure.

Storing and structuring scraped data

Extracting data is only half the job. You need to save your scraped data reliably and in a format that fits your downstream use case – whether that's a spreadsheet for a one-off analysis, a JSON feed for an API, or a production database that gets queried daily. Here are three storage options for storing your scraped data. Pick the simplest one that meets your requirements.

Option 1: CSV export

CSV is the fastest path from raw scrape to a format any analyst can open in Excel or Google Sheets. Use the nimble_csv library to stream rows directly to a file without loading everything into memory. This is critical for large scrape jobs.

Add the dependency to mix.exs:

# mix.exs
{:nimble_csv, "~> 1.2"}

Then run mix deps.get and create the writer module:

NimbleCSV.define(JobScraper.CSV, separator: ",", escape: "\"")
defmodule JobScraper.Pipeline.CSVWriter do
alias JobScraper.CSV
def write(items, path \\ "/tmp/jobs.csv") do
headers = [["title", "company", "location", "url", "scraped_at"]]
now = DateTime.utc_now() |> DateTime.to_iso8601()
stream =
items
|> Stream.map(fn item ->
[
item[:title] || "",
item[:company] || "",
item[:location] || "",
item[:url] || "",
now
]
end)
(headers |> Stream.concat(stream))
|> CSV.dump_to_stream()
|> Stream.into(File.stream!(path))
|> Stream.run()
IO.puts("Wrote items to #{path}")
end
end
# Usage
JobScraper.Pipeline.CSVWriter.write(scraped_jobs)

Run it:

iex -S mix
JobScraper.Pipeline.CSVWriter.write(scraped_items)
# => Wrote 142 rows to /tmp/jobs.csv

The Stream.into/2 pipe means each row gets written as it's processed rather than buffering all rows in memory first.

Practically, CSV export is best for one-off data scraping tasks, reporting, or importing into spreadsheet tools. If you need to query or filter the data programmatically after collection, this method might not help you.

Option 2: JSON output

Crawly's WriteToFile pipeline outputs JSON Lines by default – 1 JSON object per line, which is memory-efficient for large files and trivial to process line-by-line in any downstream tool. 

For structured nested output collected in your own pipeline, encode the full results list directly. JSON is the most practical output format for Elixir HTTP requests that return API data. There's no parsing step needed if the target responds in JSON.

defmodule JobScraper.Pipeline.JSONWriter do
def write_pretty(items, path \\ "/tmp/jobs.json") do
json = Jason.encode!(items, pretty: true)
File.write!(path, json)
IO.puts("Saved #{length(items)} items to #{path}")
end
# For large datasets, write JSON Lines (one object per line)
def write_jsonl(items, path \\ "/tmp/jobs.jl") do
File.open!(path, [:write], fn file ->
Enum.each(items, fn item ->
IO.write(file, Jason.encode!(item) <> "\n")
end)
end)
IO.puts("Wrote #{length(items)} lines to #{path}")
end
end

This storage option is best for API consumption, feeding data pipelines, or when you need nested data structures (arrays within objects) that CSV can't represent without extra encoding.

Option 3: Database storage with Ecto

For production pipelines that run repeatedly, storing directly to a database with Ecto is the right approach. It enables deduplication by primary key, structured querying after collection, and integration with the rest of your Elixir application. 

Add the dependencies to mix deps.get:

# For PostgreSQL (production)
{:ecto_sql, "~> 3.11"},
{:postgrex, "~> 0.17"}
# Or SQLite (local development -- simpler, no server needed)
{:ecto_sql, "~> 3.11"},
{:exqlite, "~> 0.19"}

Configure your Repo in config/config.exs and create the database:

# config/config.exs
config :job_scraper, JobScraper.Repo,
adapter: Ecto.Adapters.SQLite3,
database: "priv/repo/job_scraper.db"
# lib/job_scraper/repo.ex
defmodule JobScraper.Repo do
use Ecto.Repo, otp_app: :job_scraper, adapter: Ecto.Adapters.SQLite3
end
# Then run:
mix ecto.create
mix ecto.gen.migration create_jobs
mix ecto.migrate

Define a schema that maps to your scraped data structure in lib/job_scraper/job.ex:

defmodule JobScraper.Job do
use Ecto.Schema
import Ecto.Changeset
schema "jobs" do
field :title, :string
field :company, :string
field :location, :string
field :url, :string
field :scraped_at, :utc_datetime
timestamps()
end

Use batch inserts for efficiency. Inserting one record at a time during a large crawl can be a bottleneck to your pipeline. Create lib/job_scraper/pipeline/database_writer.ex:

defmodule JobScraper.Pipeline.DatabaseWriter do
alias JobScraper.{Repo, Job}
def insert_batch(items) do
now = DateTime.utc_now() |> DateTime.truncate(:second)
records = Enum.map(items, fn item ->
%{
title: String.trim(item[:title] || ""),
company: String.trim(item[:company] || ""),
location: String.trim(item[:location] || ""),
url: item[:url],
scraped_at: now,
inserted_at: now,
updated_at: now
}
end)
{count, _} = Repo.insert_all(Job, records,
on_conflict: :nothing, # Skip URLs already in the database
conflict_target: :url
)
IO.puts("Inserted #{count} new jobs (#{length(records) - count} duplicates skipped)")
end
end

Run it:

iex -S mix
JobScraper.Pipeline.DatabaseWriter.insert_batch(scraped_items)
# => Inserted 138 new jobs (4 duplicates skipped)

Start with SQLite locally, switch to PostgreSQL for production. This storage option is ideal for recurring scrape jobs, production pipelines, or any case where you need to query the data after collection. 

Data quality considerations

Raw HTML rarely produces clean data without some normalization. You need to build these data cleaning steps into your extraction pipeline before any record hits storage:

  • Trim whitespace from every text field – Floki.text often returns strings with leading/trailing newlines from HTML indentation.
  • Deduplicate by URL before insert using on_conflict: :nothing.  
  • Log records that fail Ecto validation rather than crashing the pipeline – validation failures usually mean a selector stopped matching after the site changed its HTML.
  • Normalize text fields: trim excess whitespace, collapse multiple spaces, strip non-printable characters that sometimes appear in scraped text.

Scaling and advanced techniques

Elixir web scraping scales in ways that Python scrapers can't match without significant additional infrastructure. The BEAM concurrency model lets you run thousands of parallel requests from a single process, restart failed workers automatically, and maintain long-running crawl jobs without manual intervention. 

Concurrent scraping with Task.async_stream

For a known list of URLs, like all 50 pages of a job boardTask.async_stream/3 spawns one supervised task per URL and collects results concurrently. It's simpler than Crawly for targeted, finite crawls where you already know every URL upfront:

defmodule JobScraper.ConcurrentFetcher do
def fetch_all(urls, concurrency \\ 10) do
urls
|> Task.async_stream(
fn url ->
# Stagger requests with randomized delay
:timer.sleep(Enum.random(500..2000))
JobScraper.Fetcher.fetch(url)
end,
max_concurrency: concurrency,
timeout: 30_000,
on_timeout: :kill_task
)
|> Enum.reduce([], fn result, acc ->
case result do
{:ok, {:ok, html}} -> [html | acc]
{:ok, {:error, _}} -> acc # Log and skip failed requests
{:exit, _reason} -> acc # Skip timed-out tasks
end
end)
end
end
# Fetch all 20 pages of HN jobs concurrently with 8 parallel workers
urls = Enum.map(1..20, &"https://news.ycombinator.com/jobs?page=#{&1}")
html_pages = JobScraper.ConcurrentFetcher.fetch_all(urls, 8)
IO.puts("Fetched #{length(html_pages)} pages")

Set max_concurrency to match your proxy pool size and the target's rate limits. Task.async_stream returns {:ok, result} or {:exit, reason} for each task. Always pattern-match on both. 

The Enum.reduce in the example above silently skips failed tasks and timeouts, so the batch completes regardless of individual failures. 

Use Task.Supervisor.async_stream_nolink/4 instead when you need finer control over task supervision. It prevents task crashes from propagating to the caller process.

Supervision trees for resilient scrapers

For long-running production scrapers, wrap your spider processes under a DynamicSupervisor so they restart automatically if they crash. Separate the crawler process from the storage process so database slowness doesn't block fetching:

defmodule JobScraper.Application do
use Application
def start(_type, _args) do
children = [
# Database connection pool
JobScraper.Repo,
# Supervisor for dynamically started spider processes
{DynamicSupervisor,
name: JobScraper.SpiderSupervisor,
strategy: :one_for_one},
# Supervised task pool for concurrent fetching
{Task.Supervisor, name: JobScraper.TaskSupervisor}
]
Supervisor.start_link(children, strategy: :one_for_one)
end
end

The :one_for_one strategy means that if the database pool crashes, only it gets restarted – the spider supervisor keeps running. If a spider process crashes, only that spider restarts – not the database connection pool. In other words, each process fails and recovers in isolation.

Modularizing the scraper for maintainability

Sites change their HTML. When a target updates its structure, you want to find and fix the broken selector in under a minute. Extract all CSS selectors into dedicated parser modules so there's one obvious place to look:

Define a scraper behaviour with shared callbacks so all your spiders implement a consistent interface: init/0parse_item/1, and a base_url/0. Use application config or environment variables to manage per-environment settings, including rate limits, proxy credentials, and database URLs without hardcoding anything in spider modules:

# lib/job_scraper/parsers/hn_parser.ex
defmodule JobScraper.Parsers.HNParser do
@moduledoc """
Selectors for Hacker News Who's Hiring threads.
Update selectors here when HN changes its HTML structure.
Last verified: 2024-06-01
"""
# Selector constants -- easy to find and update
@listing_selector ".commtext"
@title_selector "b:first-child"
def parse(doc) do
doc
|> Floki.find(@listing_selector)
|> Enum.map(&extract_listing/1)
|> Enum.reject(&is_nil/1)
end
defp extract_listing(node) do
text = Floki.text(node) |> String.trim()
if String.length(text) > 20, do: %{text: text}, else: nil
end
end

Define a common behaviour module for all spiders so they share a consistent interface. This makes it straightforward to add new targets: implement the behaviour, drop the module in spiders/, and the orchestration layer picks it up without changes.

Scheduling and automation

For recurring scrape jobs like collecting new listings every morning and refreshing pricing data hourly, use the quantum library to trigger scrape jobs on a schedule. It's a cron-like scheduler that runs natively inside your Elixir supervision tree:

# mix.exs
{:quantum, "~> 3.5"},
# config/config.exs
config :job_scraper, JobScraper.Scheduler,
jobs: [
# Run the HN spider every Monday at 9:00 AM UTC
{{:cron, "0 9 * * MON"},
{JobScraper.Runners.HNRunner, :run, []}},
# Refresh Open Library data every Sunday at midnight
{{:cron, "0 0 * * SUN"},
{JobScraper.Runners.LibraryRunner, :run, []}}
]
# lib/job_scraper/scheduler.ex
defmodule JobScraper.Scheduler do
use Quantum, otp_app: :job_scraper
end

Add the scheduler to your supervision tree in application.ex and it starts automatically when your app boots. Combine it with a GenServer state machine to track run history, prevent overlapping runs when a slow crawl runs past its scheduled interval, and send alerts on consecutive failures.

Web scraping with Elixir and Crawly, plus a Quantum scheduler, gives you a fully automated data collection pipeline that runs without any manual trigger.

Final thoughts

Scraping data from websites is harder than it looks. Sites change their HTML, deploy new anti-bot systems, alter pagination structure, and start serving different content to different IPs, all without notice. 

Web scraping in Elixir handles the structural challenges well: the supervision tree restarts failed processes, the BEAM scheduler distributes concurrent fetches across CPU cores, and pattern matching keeps parsing logic readable as selectors get complex.

Additionally, Elixir web scraping handles concurrency structurally better than Python. The BEAM model makes thousands of parallel fetches natural. Supervision trees keep long jobs running without babysitting. The trade-off is a smaller ecosystem with fewer third-party integrations and less community documentation than Python's Scrapy stack.

Don't forget that a reliable proxy infrastructure is crucial for building highly efficient, low-maintenance scraping pipelines capable of overcoming the web's most rigid defenses. For larger projects where maintenance becomes more work than the data is worth, Decodo's Web Scraping API removes the proxy and anti-bot setup entirely.

Scraping shouldn't be this hard

Replace proxy configs, retry logic, and fingerprint workarounds with a single API call that returns clean data.

About the author

Justinas Tamasevicius

Director of Engineering

Justinas Tamaševičius is Director of Engineering with over two decades of expertise in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.

Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Can Elixir scrape JavaScript websites?

Yes, but Elixir's HTTP clients can't execute JavaScript natively. To scrape JavaScript-rendered pages with Elixir, you have 2 options.

First, integrate Splash, a Docker-based JavaScript rendering service that Crawly supports natively via its fetcher configuration.

Second, use the Decodo Web Scraping API, which handles JavaScript rendering, proxy rotation, and CAPTCHA solving in a single API call with no browser infrastructure to manage.

What is the best HTTP client for Elixir web scraping?

Req is the best HTTP client for most Elixir web scraping projects. It offers built-in retry logic, automatic compression handling, composable middleware, and a clean API designed for modern Elixir development.

Use HTTPoison if you need wider community documentation and examples. It's the most widely adopted Elixir HTTP client and has the most Stack Overflow coverage. Use Finch when you're crawling a single domain at very high volume and need direct connection pool control to maximize TCP connection reuse.

Is Crawly better than Scrapy for web scraping?

Crawly and Scrapy solve the same problem differently.

Crawly runs on the BEAM VM, so concurrent requests are lightweight processes – you can run thousands in parallel without thread pool management.

Scrapy runs on Python's asyncio and has a much larger ecosystem of extensions, middlewares, and community plugins.

Crawly is better if your team already uses Elixir and needs serious concurrency. Scrapy is better if your team uses Python and needs broad third-party library support or extensive community resources.

Is Elixir good for large-scale web scraping?

Yes, Elixir is well-suited for large-scale web scraping. The BEAM VM runs each HTTP request as a lightweight process costing roughly 2KB of memory, so you can handle thousands of concurrent connections without the threading overhead that limits Python scrapers.

Supervisor trees automatically restart failed processes, keeping long crawl jobs running without manual intervention. The main limitation is ecosystem breadth. Python's scraping ecosystem is larger and offers more third-party integrations.

The Best Coding Language for Web Scraping in 2026

Web scraping is a powerful way to collect publicly accessible data for research, monitoring, and analysis, but the tools you choose can greatly influence the results. In this article, we review six of the most popular programming languages for web scraping, breaking down their key characteristics, strengths, and limitations. To make the comparison practical, each section also includes a simple code example that highlights the language’s syntax and overall approach to basic scraping tasks.

Web Scraping at Scale Explained

Scraping projects usually start simple: a Python script, the Beautiful Soup parsing library, and a list of URLs. That's enough for small jobs. Once you're past a few hundred thousand pages, you start hitting problems: timeouts, IP bans, parsers returning empty fields because someone changed a div to a span. At that point, it's not a coding problem anymore, it's an infrastructure problem. This guide covers the architecture, proxy management, anti-bot evasion, pipelines, costs, compliance, where the industry is headed, and build vs. buy decisions.

Mastering Web Scraping Pagination: Techniques, Challenges, and Python Solutions

Pagination is the system websites use to split large datasets across multiple pages for faster loading and better navigation. In web scraping, handling pagination is essential to capture complete datasets rather than just the first page of results. This guide explains what pagination is, the challenges it creates, and how to handle it efficiently with Python.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved