Elixir Web Scraping: A Practical Step-by-Step Guide
Elixir web scraping solves one of the hardest problems in high-volume data collection: concurrency without thread overhead. The BEAM virtual machine (Erlang's runtime) runs each HTTP request as a lightweight process, not an OS thread, so you can fetch thousands of pages concurrently. If a process crashes, the supervisor restarts it automatically. This guide builds a complete Elixir scraper from scratch, covering static pages, paginated targets, JavaScript-heavy sites, and anti-bot countermeasures.
Justinas Tamasevicius
Last updated: May 19, 2026
25 min read

TL;DR: Build a simple Elixir scraper in minutes
Here's a complete Elixir web scraping script using two dependencies (Req and Floki) targeting the Hacker News "Who Is Hiring?" thread at https://news.ycombinator.com/item?id=40224213, a real, publicly accessible, plain-HTML page.
Start by installing Elixir with Homebrew:
Create a new Elixir project:
Then, replace the last few lines in the mix.exs file with the following:
Install the dependencies:
Then create lib/quick_scrape.ex:
Run the scraper:
Then inside the interactive shell:
The output will look like this:
Note: Hacker News may occasionally rate limit requests and return a 429 Too Many Requests status code. If that happens, wait a bit and try running the scraper again.
That’s how Elixir web scraping works end-to-end. The rest of this guide builds on this foundation – adding pagination, JavaScript rendering, anti-bot handling, and persistent storage for production use.
Why choose Elixir for web scraping
Elixir runs on the BEAM virtual machine, which gives scrapers something Python and Ruby web scraping can't match natively: each HTTP request runs in its own isolated, lightweight process. Thousands of concurrent fetches become straightforward.
When a process crashes, the supervisor automatically restarts it. Elixir’s pattern matching also keeps parsing logic clean and concise. Combined with libraries like Crawly for large-scale crawling, Floki for HTML parsing, and Req or HTTPoison for HTTP requests, the Elixir ecosystem covers everything needed to build production-grade web scrapers.
The BEAM virtual machine
Each HTTP request runs in its own lightweight BEAM process, using around 2 KB of memory, compared to 1 MB or more per OS thread. You can run tens of thousands of concurrent fetches on a single machine without configuring thread pools, adjusting limits, or working around the Python GIL.
The BEAM scheduler automatically distributes process execution across CPU cores. When you're scraping large sites with thousands of pages, this difference becomes the main reason to choose scraping data with Elixir over Python alternatives.
Fault tolerance through supervisors
If a request process crashes, whether due to a timeout, a malformed response, or an unexpected HTML structure that your parser doesn't handle, the supervisor restarts it automatically.
The crash is isolated to that single process. No other concurrent fetch is affected. Your scraper keeps running without manual intervention, which matters enormously on long crawl jobs that run for hours across millions of URLs.
Python scrapers need explicit retry logic and exception handlers for the same resilience. In Elixir, the supervision tree handles it structurally. You define what should happen when things go wrong once, at the application level, and the runtime enforces it for every worker process.
Pattern matching makes parsers readable
Elixir's pattern matching lets you write extraction logic that reads like a specification. For instance, when you match on {:ok, %Req.Response{status: 200, body: body}}, you're simultaneously checking the result type, verifying the HTTP status, and binding the body to a variable, all in one expression.
The equivalent Python code requires separate isinstance checks, status comparisons, and variable assignments.
For HTML parsing, this means your error handling and your happy-path logic sit side by side in readable case blocks. The parser communicates its intent clearly, which makes debugging significantly easier when a site changes its HTML structure and your selectors stop returning expected data.
The Elixir scraping ecosystem
Web scraping with Elixir and Crawly gives you a full spider framework with URL deduplication, concurrent request management, middleware for headers and rate limiting, and configurable pipelines for output.
Scraping data with Elixir and Floki handles HTML parsing with a clean, well-documented CSS selector API. HTTPoison wraps the battle-tested hackney HTTP client. Req brings modern features like built-in retries, compression, and middleware composition. Jason handles JSON encoding and decoding efficiently.
These libraries are all actively maintained and production-ready. The ecosystem is smaller than Python's, but the core scraping stack is solid and covers the vast majority of real-world scraping use cases.
Here are the trade-offs you must know when using Elixir for web scraping:
- Smaller ecosystem than Python. Elixir has fewer third-party scraping/parsing tools compared to stacks like Scrapy or Beautiful Soup.
- Less learning material. There are fewer tutorials, guides, and community examples available on Elixir.
- Steeper initial learning curve. Especially for teams unfamiliar with functional programming concepts.
- Different programming paradigm. Elixir’s use of immutable data, pipe operators, and pattern matching takes time to get used to.
- Slower initial productivity. As a developer, you may need time before you can think naturally in Elixir’s data-flow style.
These drawbacks are minimal for teams already experienced or familiar with Elixir. Despite ecosystem gaps, Elixir’s strong concurrency model can outweigh downsides for large-scale scraping systems.
Prerequisites and project setup
Before starting Elixir web scraping, you need a working development environment and a structured project.
Install Elixir, create a Mix project with a supervision tree, add your dependencies, and organize files so your scraper remains maintainable as it grows.
Note: You need to get the project structure right so that every code block that follows compiles and runs without surprises. We’ll use a Hacker News jobs scraper as the running example.
Elixir web scraping environment setup
Elixir compiles to BEAM bytecode, so it requires Erlang to run. You need to install both using asdf, a version manager that handles multiple runtime versions without polluting your system paths.
Install Elixir and Erlang:
This is a setup script for installing and managing specific versions of Erlang and Elixir on macOS.
This is what’s happening in the script above:
- brew install asdf uses Homebrew, a macOS package manager, to install asdf, a version manager that lets you install and switch between multiple versions of programming languages.
- The Add Erlang and Elixir plugins block tells asdf you want to manage Erlang and Elixir. Each language needs a plugin so asdf knows how to install and switch versions. In this case, the script is installing compatible versions of Erlang 26.2.2 and Elixir 1.16.2-otp-26.
- Set as global defaults block makes these versions the default on your system. Anytime you run Elixir or Erlang, these versions will be used unless overridden locally.
- Verify installation: elixir –-version checks that everything worked. Expected output confirms both versions of Elixir and Erlang it’s using.
In short, the script installs a version manager, uses it to install compatible versions of Erlang and Elixir, sets them as defaults, and then confirms everything is working.
On Ubuntu/Debian, install via the official Erlang Solutions repository or use asdf with the same commands. On Windows, use the official installers from elixir-lang.org – they bundle Erlang automatically.
Now, create your project with a supervision tree included. The --sup flag generates a supervisor module, which is the foundation of Elixir's fault-tolerance model:
Mix is Elixir's build tool. It handles project scaffolding, dependency management, compilation, and running tests. The mix.exs file at the root of your project is where you declare dependencies and configure your application. Think of it as the equivalent of Python's pyproject.toml or Node's package.json.
Dependencies
Open mix.exs and replace the deps/0 function with the following. Each library covers a specific part of the scraping stack:
Then, install the dependencies by running:
Mix downloads and compiles each dependency from Hex, Elixir's package registry. The first run takes a minute or two. Subsequent runs only fetch changed packages. After installation, run mix compile to verify everything compiles cleanly before writing any scraping code.
Here's what each library does in practice:
- Crawly. The spider framework that manages URL queues, deduplication, concurrent request scheduling, middleware, and output pipelines.
- Floki. The standard HTML parser for Elixir that turns raw HTML strings into a queryable tree you can traverse with CSS selectors.
- Req. The modern HTTP client you'll use in most examples. Composable, supports middleware, and handles retries cleanly. Recommended for new projects.
- HTTPoison. Widely-used alternative HTTP client worth knowing. Wider community adoption and more Stack Overflow answers.
- Jason. Fast JSON encoding. You'll use it to write scraped data as JSON output or to decode API responses.
In most cases, you don’t need both Req and HTTPoison. If you’re starting fresh, stick with :req. It’s cleaner and actively evolving.
Only keep :httpoison if a library you use requires it, or you already have code written around it.
Project structure
Organize the project so each concern lives in its own folder. This will make it fast to find and fix broken selectors when a target site changes its HTML.
The spiders/ folder defines crawl behavior: where to start, what domain to stay in, and how to extract data.
The parsers/ folder isolates all CSS selectors in one place, so when a site changes its HTML, you update a single file rather than hunting through spider code.
The pipeline/ folder handles what happens after extraction: writing to CSV, JSON files, or a database.
This separation is important when you maintain scrapers for multiple targets, or when the same data source needs to feed different output formats.
Sending HTTP requests with Elixir
Every Elixir web scraping workflow starts with an HTTP GET request. Elixir offers 3 mature HTTP client libraries (HTTPoison, Req, and Finch) to help send requests efficiently while handling headers, retries, and responses. Choosing the right one depends on your scraping scale and how much control you need over connection management.
This section compares all 3, then walks you through a complete working example with proper error handling and browser-realistic headers.
Library
Best for
Key strength
Trade-off
HTTPoison
General-purpose scraping
Wide adoption, great docs
Older API, less composable
Req
Most new scraping projects
Built-in retries, middleware
Fewer tutorials online
Finch
High-volume, single-domain crawls
Direct connection pool control
Lower-level, more config needed
Note: Use Req for most Elixir HTTP requests in scraping projects. Its middleware architecture makes it easy to add retry logic, set default headers, and handle compression without boilerplate.
Fall back to Finch only when you need direct connection pool control at very high request volumes, for example, crawling a single large domain where reusing TCP connections per-IP matters.
Meanwhile, use HTTPoison if your team is more familiar with it or you need the broader range of community examples online.
Making a basic GET request
Here's a complete fetcher module targeting quotes.toscrape.com, a site built for scraping practice with stable, predictable HTML.
This is the module you'll reuse throughout the guide. It wraps Req with browser-realistic headers and structured error returns so every HTTP failure is handled explicitly rather than crashing your spider mid-crawl.
Install the dependency first if you haven't already:
Create lib/job_scraper/fetcher.ex:
The pattern match on {:ok, %Req.Response{status: 200, body: body}} does 3 things simultaneously: checks the result wrapper is :ok, verifies the HTTP status is exactly 200, and binds the response body to the body variable.
Non-200 responses each hit their own branch: a 429 (rate limited) error is handled differently from a 404 (not found).
Always set a realistic User-Agent. Default library User-Agents like hackney/1.18.1 appear in every request made by Elixir's standard HTTP stack, and most sites maintain blocklists of known bot strings.
Include Accept and Accept-Language headers to make your requests look closer to what a real browser sends.
Test the fetcher by running it in IEx to verify the response before using it inside a spider:
Parsing HTML with Floki
This section covers scraping data with Elixir and Floki.
Floki is the standard HTML parser for Elixir web scraping. It converts raw HTML strings into a queryable tree, lets you select nodes by class or element, and extracts text or attribute values in clean, chainable function calls using CSS selectors.
Once you understand 4 core functions, which are parse_document/1, find/2, text/1, and attribute/2, you can extract any data from any HTML page.
Parse a document
Web scraping with Elixir requires parsing HTML into structured data.
Pass the HTML body string from your fetcher directly to Floki.parse_document/1. This function returns a tagged tuple – {:ok, doc} on success or an error if the HTML is malformed. In practice, Floki handles even poorly-formed HTML gracefully because it uses a lenient parser:
Every subsequent Floki function takes doc as its first argument and returns nodes matching your selector. Treat doc as your parsed document object; equivalent to what you'd get from BeautifulSoup's BeautifulSoup(html, 'html.parser') call in Python.
Select nodes with CSS selectors
Use Floki.find/2 with any CSS selector to pull matching nodes. Use the same selectors you'd verify in browser DevTools – right-click an element, select Inspect, and copy the selector.
Here's a complete example extracting all quotes, authors, and tags from quotes.toscrape.com:
CSS selectors in Floki work exactly like they do in browser DevTools. Class selectors (.quote), element selectors (a), descendant selectors (.quote .text), and attribute selectors ([href]) all work as expected.
Open browser DevTools on your target page, right-click an element, and copy the selector to verify it before writing it in code.
Extract text content
Call Floki.text/1 on any node or list of nodes to get their plain text with all HTML tags stripped. When you pass it a list of matched nodes, it concatenates all their text. You'll need to map over the list individually.
Always pipe through String.trim/1 after Floki.text/1. HTML whitespace – indentation, newlines around inline elements, often ends up in the text content and will pollute your output if you don't strip it.
Extract attribute values
Use Floki.attribute/2 to pull HTML attribute values from matched nodes. The most common use case is extracting href values from anchor tags for link discovery. You need those URLs to queue the next requests in your spider:
Chain Floki calls with the pipe operator
Elixir's pipe operator (|>) makes extraction pipelines read left to right. The output of one function becomes the input of the next.
For scraping data with Elixir and Floki, this is the idiomatic style: select a parent node, then scope all child selections to that node.
Here's a complete extractor for quote cards on quotes.toscrape.com:
Each extract_quote/1 call receives a single quote node and extracts all sub-fields from it. Scoping your Floki calls to a parent node, rather than searching the whole document, prevents selector collisions when the same class name appears in different contexts on the page.
Handle missing elements safely
When Floki.find/2 matches nothing, it returns an empty list instead of raising an error. Calling Floki.text/1 on an empty list returns an empty string, which is usually acceptable. But calling List.first/1 on an empty list returns nil, which causes a crash if you then try to call a string function on it. Guard against this pattern:
XPath traversal for complex selectors
You need to choose the right selector for web scraping.
CSS selectors cover most cases, but some HTML structures require traversal that CSS can't express, such as selecting a parent element, finding elements by text content, or navigating sibling relationships. Floki supports XPath via Floki.xpath/2 for these situations:
Use XPath sparingly. CSS selectors are faster and more readable for the common cases. Reserve XPath for the specific situations where CSS genuinely can't express what you need.
Building a web crawler with Crawly
Crawly is the Elixir scraping framework that turns a single-page fetcher into a full crawl engine. Web scraping with Elixir and Crawly gives you URL deduplication (so you never request the same page twice), concurrent request management across multiple pages, middleware for headers and rate limiting, and pipelines that process extracted data before writing it to disk or a database.
Meanwhile, a Crawly spider is a single Elixir module with 3 required callbacks that define where to crawl, what domain to stay in, and how to extract data from each response.
We’ll use the Open Library search results page (openlibrary.org) for book metadata extraction.
Spider structure
Here is a spider targeting openlibrary.org, a real, publicly accessible library catalog with clean, stable HTML and no aggressive bot detection. It extracts book metadata and follows pagination. Create lib/job_scraper/spiders/open_library_spider.ex:
This is an Elixir web scraper (spider) built using the Crawly library to crawl Open Library search results and extract information about books related to “Elixir programming.”
The parse_item/1 callback does 2 jobs every time it runs: extract structured data from the current page, and return a list of new URLs to crawl.
The %Crawly.ParsedItem{} struct bundles both. items contains your extracted data maps, while requests contains Crawly.Request structs for the next URLs.
Crawly calls parse_item/1 for each URL in its request queue.
When requests is an empty list, the spider has nowhere new to go and will eventually finish. When items is empty, but requests is non-empty, Crawly keeps crawling without extracting – useful for navigation-only pages.
Configuration in config.exs
Crawly's behavior is configured in config/config.exs. This is where you set User-Agent rotation, output pipelines, concurrency limits, and middleware:
Open config/config.exs and add Crawly settings:
Here’s something you should know:
Start concurrent_requests_per_domain at 2 to 4 for any new target. Crawling faster than the site allows can trigger rate limiting and IP blocks. Increase it only after confirming the site can handle your request rate without degrading.
The WriteToFile pipeline writes 1 JSON object per line to /tmp — open the .jl file while the crawl runs to watch results appear in real time. The Validate pipeline drops items missing required fields before they reach storage to keep your output clean, even when individual pages have incomplete data.
The closespider_itemcount setting is useful during development to limit how long a test run takes.
Running the spider
Start an IEx session with your project loaded and use Crawly's engine API to control the spider:
Crawly writes output to /tmp/*.jl by default (based on your config). Each line in the output file is a complete JSON object representing one extracted item. Open the file in a text editor or pipe it through jq to inspect results as the crawl runs:
Scraping paginated websites with Elixir
Web scraping pagination is the most common structural challenge in web scraping. Most real-world targets, including job boards, product catalogs, and search results, split their data across multiple pages.
Web scraping in Elixir handles pagination naturally because your spider's parse_item/1 callback returns new URLs as data, and Crawly queues them automatically. There are 3 distinct pagination patterns, and each needs a different strategy:
Pattern 1: Page number parameters (?page=N)
The simplest pagination pattern puts the page number directly in the URL as a query parameter. When you can see the total number of pages on the first page (usually shown as "Page 1 of N" or a total results count), generate all page URLs upfront from your init/0 callback. When you don't, crawl until the items list is empty:
Option B is more robust for targets where the total page count changes over time, like a job board that adds new listings daily. The spider naturally stops when it reaches a page with no results.
Pattern 2: Next page link discovery
Many sites render a "Next" or ">>" button you can follow. This pattern works even when you have no idea how many pages exist. Extract its href, build an absolute URL, and return it as the next request. The spider will stop naturally when no Next link exists on the final page:
The URI.merge/2 call handles relative paths correctly – whether the href is /page/2, ?page=2, or a full absolute URL, URI.merge resolves it against the current page's URL.
Pattern 3: Cursor or token-based pagination (APIs/infinite scroll)
Some modern targets, especially API-backed frontends and social feeds, use cursor tokens rather than page numbers.
Extract the cursor from the current response, embed it in the next request URL, and continue until the cursor is absent.
Always check the Network tab in browser DevTools first. Many infinite-scroll pages call a REST API you can hit directly without JavaScript rendering:
Infinite scroll pages that don't expose an API require JavaScript rendering (we’ll cover this in the next section).
Scraping JavaScript-rendered pages
JavaScript rendering is one of the hardest challenges in Elixir web scraping. Elixir's HTTP clients, including HTTPoison, Req, and Finch, all fetch raw HTML. None of them can execute JavaScript.
When a target page loads its content dynamically after the initial HTML response, your scraper returns empty containers. Web scraping with Elixir addresses this through external rendering services rather than native headless browser support.
Here are specific signs that can tell you a site requires JavaScript rendering:
- You get a 200 response, but Floki finds empty containers where data should be.
- Data only appears after scrolling or clicking.
- Content lives inside <script type="application/ld+json"> blocks.
Check the browser DevTools Network tab for XHR/Fetch calls. If the JavaScript is loading data from a REST API, you can call that API directly from Elixir and skip rendering entirely.
Some pages load different content depending on the User-Agent, serving a static fallback to bots. Try setting a mobile User-Agent to see if you get a simpler, static page.
Here are practical approaches for scraping JavaScript-rendered pages:
Option 1: Splash integration with Crawly
Splash is a lightweight JavaScript rendering service that runs in Docker. Crawly supports it natively via the fetcher configuration option.
When you route requests through Splash, it renders the page in a real browser engine (Qt WebKit) and returns the fully rendered HTML to your spider. Run Splash in Docker, then switch one line in your config:
With this configuration, your spider code doesn't change at all. Crawly routes each request through Splash, which renders the JavaScript, then returns the populated HTML to your parse_item/1 callback as usual.
Note: Splash works well for pages with simple JavaScript rendering requirements. However, it can struggle with modern single-page applications that use sophisticated client-side routing or that run specific anti-bot JavaScript challenges.
It also adds operational complexity since you need Docker running, and Splash itself needs to be monitored.
Option 2: Decodo Web Scraping API (recommended for production)
Instead of managing a headless browser, route requests through the Decodo Web Scraping API. The API handles JavaScript rendering, proxy rotation, CAPTCHA solving, and geo-targeting in a single API call. From Elixir, it's just an HTTP POST to a different endpoint. Your Floki parsing code stays the same.
It also handles automatic retries, TLS fingerprint rotation, and real browser fingerprints. You get rendered HTML back without maintaining any browser infrastructure.
Create a new Mix project if you haven’t:
Update mix.exs:
Run mix deps.get, then create lib/job_scraper/decodo_script.ex and paste the full script:
Set credentials:
Run the script with:
Expected console output will look like this:
Skip the boilerplate
Decodo's Web Scraping API handles proxies, CAPTCHAs, and anti-bot detection so your code stays short and your requests actually land.
Handling anti-scraping mechanisms
Anti-scraping techniques have become sophisticated over the past few years. A basic User-Agent rotation and a small delay between requests aren't enough to stay undetected with Elixir web scraping at scale.
You need to understand exactly what a target deploys (rate limiting, TLS fingerprinting, behavioral scoring) and how to counter it in Elixir so you can build scrapers that hold up rather than break silently after a few hundred requests.
Common anti-scraping mechanisms
- Rate limiting and IP blocking. Servers track request frequency per IP address. When you exceed the threshold, you'll likely see 429 responses, or the site silently serves a page that looks correct but contains placeholder data or fewer results than normal. IP blocks are often temporary (15-minute to 24-hour windows), but repeated violations can escalate to permanent bans.
- User-agent fingerprinting. Sites maintain blocklists of User-Agent strings associated with known scraping libraries. Erlang's default hackney User-Agent (hackney/1.18.1) is on most of these lists. Any request with a library-style User-Agent string gets flagged immediately.
- CAPTCHA challenges. These are typically triggered after sustained high-frequency traffic from the same IP. CAPTCHA requires human interaction or a solving service. Standard image recognition or audio CAPTCHAs can be solved programmatically, but reCAPTCHA v3 uses behavioral scoring that can't be bypassed with just a solver.
- TLS fingerprinting. Advanced sites inspect the TLS handshake signature, not just HTTP headers. Erlang's TLS client produces a different signature and extension set compared to Chrome's TLS stack. Cloudflare's bot detection uses this signal. Therefore, sending browser-realistic HTTP headers doesn't help if the TLS fingerprint screams "Erlang."
- Behavioral analysis. Machine learning systems score requests based on timing patterns, header completeness, navigation sequences, and mouse/keyboard event data injected via JavaScript. A scraper that hits 50 pages in 2 seconds with perfectly uniform timing looks nothing like a human user who takes 3-8 seconds per page with random variation.
How to overcome these anti-scraping mechanisms in Elixir
Apply these mitigations in order. Start simple and escalate.
Inside your Elixir project using Crawly, lib/job_scraper/spiders/hn_spider.ex add this code as a as a spider module file:
Proxy rotation for IP-level blocks
When an IP is blocked, header tuning won't help. You need a different IP address. Residential proxies route traffic through real consumer IPs, which are far harder for sites to detect and block than data center IPs. Configure Decodo's residential proxies directly in your Req requests:
Run it exactly as before. Swap JobScraper.Fetcher.fetch for JobScraper.ProxyFetcher.fetch anywhere in your spider code.
Each request through Decodo's gateway rotates to a different residential IP automatically. From the target site's perspective, each request comes from a different home internet connection in the configured country.
For targets that block aggressively, the Decodo Web Scraping API handles proxy rotation, CAPTCHA solving, and retry logic transparently. This means you focus on parsing, not infrastructure.
Storing and structuring scraped data
Extracting data is only half the job. You need to save your scraped data reliably and in a format that fits your downstream use case – whether that's a spreadsheet for a one-off analysis, a JSON feed for an API, or a production database that gets queried daily. Here are three storage options for storing your scraped data. Pick the simplest one that meets your requirements.
Option 1: CSV export
CSV is the fastest path from raw scrape to a format any analyst can open in Excel or Google Sheets. Use the nimble_csv library to stream rows directly to a file without loading everything into memory. This is critical for large scrape jobs.
Add the dependency to mix.exs:
Then run mix deps.get and create the writer module:
Run it:
The Stream.into/2 pipe means each row gets written as it's processed rather than buffering all rows in memory first.
Practically, CSV export is best for one-off data scraping tasks, reporting, or importing into spreadsheet tools. If you need to query or filter the data programmatically after collection, this method might not help you.
Option 2: JSON output
Crawly's WriteToFile pipeline outputs JSON Lines by default – 1 JSON object per line, which is memory-efficient for large files and trivial to process line-by-line in any downstream tool.
For structured nested output collected in your own pipeline, encode the full results list directly. JSON is the most practical output format for Elixir HTTP requests that return API data. There's no parsing step needed if the target responds in JSON.
This storage option is best for API consumption, feeding data pipelines, or when you need nested data structures (arrays within objects) that CSV can't represent without extra encoding.
Option 3: Database storage with Ecto
For production pipelines that run repeatedly, storing directly to a database with Ecto is the right approach. It enables deduplication by primary key, structured querying after collection, and integration with the rest of your Elixir application.
Add the dependencies to mix deps.get:
Configure your Repo in config/config.exs and create the database:
Define a schema that maps to your scraped data structure in lib/job_scraper/job.ex:
Use batch inserts for efficiency. Inserting one record at a time during a large crawl can be a bottleneck to your pipeline. Create lib/job_scraper/pipeline/database_writer.ex:
Run it:
Start with SQLite locally, switch to PostgreSQL for production. This storage option is ideal for recurring scrape jobs, production pipelines, or any case where you need to query the data after collection.
Data quality considerations
Raw HTML rarely produces clean data without some normalization. You need to build these data cleaning steps into your extraction pipeline before any record hits storage:
- Trim whitespace from every text field – Floki.text often returns strings with leading/trailing newlines from HTML indentation.
- Deduplicate by URL before insert using on_conflict: :nothing.
- Log records that fail Ecto validation rather than crashing the pipeline – validation failures usually mean a selector stopped matching after the site changed its HTML.
- Normalize text fields: trim excess whitespace, collapse multiple spaces, strip non-printable characters that sometimes appear in scraped text.
Scaling and advanced techniques
Elixir web scraping scales in ways that Python scrapers can't match without significant additional infrastructure. The BEAM concurrency model lets you run thousands of parallel requests from a single process, restart failed workers automatically, and maintain long-running crawl jobs without manual intervention.
Concurrent scraping with Task.async_stream
For a known list of URLs, like all 50 pages of a job board, Task.async_stream/3 spawns one supervised task per URL and collects results concurrently. It's simpler than Crawly for targeted, finite crawls where you already know every URL upfront:
Set max_concurrency to match your proxy pool size and the target's rate limits. Task.async_stream returns {:ok, result} or {:exit, reason} for each task. Always pattern-match on both.
The Enum.reduce in the example above silently skips failed tasks and timeouts, so the batch completes regardless of individual failures.
Use Task.Supervisor.async_stream_nolink/4 instead when you need finer control over task supervision. It prevents task crashes from propagating to the caller process.
Supervision trees for resilient scrapers
For long-running production scrapers, wrap your spider processes under a DynamicSupervisor so they restart automatically if they crash. Separate the crawler process from the storage process so database slowness doesn't block fetching:
The :one_for_one strategy means that if the database pool crashes, only it gets restarted – the spider supervisor keeps running. If a spider process crashes, only that spider restarts – not the database connection pool. In other words, each process fails and recovers in isolation.
Modularizing the scraper for maintainability
Sites change their HTML. When a target updates its structure, you want to find and fix the broken selector in under a minute. Extract all CSS selectors into dedicated parser modules so there's one obvious place to look:
Define a scraper behaviour with shared callbacks so all your spiders implement a consistent interface: init/0, parse_item/1, and a base_url/0. Use application config or environment variables to manage per-environment settings, including rate limits, proxy credentials, and database URLs without hardcoding anything in spider modules:
Define a common behaviour module for all spiders so they share a consistent interface. This makes it straightforward to add new targets: implement the behaviour, drop the module in spiders/, and the orchestration layer picks it up without changes.
Scheduling and automation
For recurring scrape jobs like collecting new listings every morning and refreshing pricing data hourly, use the quantum library to trigger scrape jobs on a schedule. It's a cron-like scheduler that runs natively inside your Elixir supervision tree:
Add the scheduler to your supervision tree in application.ex and it starts automatically when your app boots. Combine it with a GenServer state machine to track run history, prevent overlapping runs when a slow crawl runs past its scheduled interval, and send alerts on consecutive failures.
Web scraping with Elixir and Crawly, plus a Quantum scheduler, gives you a fully automated data collection pipeline that runs without any manual trigger.
Final thoughts
Scraping data from websites is harder than it looks. Sites change their HTML, deploy new anti-bot systems, alter pagination structure, and start serving different content to different IPs, all without notice.
Web scraping in Elixir handles the structural challenges well: the supervision tree restarts failed processes, the BEAM scheduler distributes concurrent fetches across CPU cores, and pattern matching keeps parsing logic readable as selectors get complex.
Additionally, Elixir web scraping handles concurrency structurally better than Python. The BEAM model makes thousands of parallel fetches natural. Supervision trees keep long jobs running without babysitting. The trade-off is a smaller ecosystem with fewer third-party integrations and less community documentation than Python's Scrapy stack.
Don't forget that a reliable proxy infrastructure is crucial for building highly efficient, low-maintenance scraping pipelines capable of overcoming the web's most rigid defenses. For larger projects where maintenance becomes more work than the data is worth, Decodo's Web Scraping API removes the proxy and anti-bot setup entirely.
Scraping shouldn't be this hard
Replace proxy configs, retry logic, and fingerprint workarounds with a single API call that returns clean data.
About the author

Justinas Tamasevicius
Director of Engineering
Justinas Tamaševičius is Director of Engineering with over two decades of expertise in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.
Connect with Justinas via LinkedIn.
All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.


