Crawlee

Crawlee is an open-source web scraping and browser automation library built for Node.js. It provides a high-level framework for building scalable crawlers that can handle both static and dynamic websites. Crawlee supports headless browsers like Puppeteer and Playwright as well as HTTP-based scraping, allowing developers to switch between scraping strategies based on the complexity of the target website.

Also known as: Crawling framework, Crawlee.js

Comparisons

Crawlee vs. Puppeteer: Puppeteer controls a browser, while Crawlee builds on top of Puppeteer to manage tasks like queueing URLs, session handling, proxy rotation, and error retries.
Crawlee vs. Cheerio: Cheerio is for parsing static HTML, whereas Crawlee handles full crawling workflows, including scraping dynamic pages and managing scraping state.

Pros

Scalable: Built to handle large-scale crawling with request queuing, concurrency, and auto-scaling.

Pluggable architecture: Supports various scraping strategies (e.g., HTTP, headless browsers).

Smart defaults: Built-in proxy rotation, session management, and request retries.
Developer-friendly: Offers robust APIs and integrates well with the Apify platform.

Cons

Heavier footprint: More complex and resource-intensive than lightweight scrapers.

Learning curve: Requires understanding of concepts like request queues and datasets.

Example

A developer sets up a simple web crawler to extract page titles from a list of URLs using Crawlee:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
  async requestHandler({ request, $ }) {
    const title = $('title').text();
    console.log(`Title of ${request.url}: ${title}`);
  },
});

await crawler.run([
  { url: 'https://example.com' },
  { url: 'https://another-site.com' },
]);

This example demonstrates how Crawlee streamlines the process of crawling and extracting data from multiple pages with minimal setup.