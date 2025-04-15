Crawlee
Crawlee is an open-source web scraping and browser automation library built for Node.js. It provides a high-level framework for building scalable crawlers that can handle both static and dynamic websites. Crawlee supports headless browsers like Puppeteer and Playwright as well as HTTP-based scraping, allowing developers to switch between scraping strategies based on the complexity of the target website.
Also known as: Crawling framework, Crawlee.js
Comparisons
- Crawlee vs. Puppeteer: Puppeteer controls a browser, while Crawlee builds on top of Puppeteer to manage tasks like queueing URLs, session handling, proxy rotation, and error retries.
- Crawlee vs. Cheerio: Cheerio is for parsing static HTML, whereas Crawlee handles full crawling workflows, including scraping dynamic pages and managing scraping state.
Pros
- Scalable: Built to handle large-scale crawling with request queuing, concurrency, and auto-scaling.
- Pluggable architecture: Supports various scraping strategies (e.g., HTTP, headless browsers).
- Smart defaults: Built-in proxy rotation, session management, and request retries.
- Developer-friendly: Offers robust APIs and integrates well with the Apify platform.
Cons
- Heavier footprint: More complex and resource-intensive than lightweight scrapers.
- Learning curve: Requires understanding of concepts like request queues and datasets.
Example
A developer sets up a simple web crawler to extract page titles from a list of URLs using Crawlee:
import { CheerioCrawler } from 'crawlee';const crawler = new CheerioCrawler({async requestHandler({ request, $ }) {const title = $('title').text();console.log(`Title of ${request.url}: ${title}`);},});await crawler.run([{ url: 'https://example.com' },{ url: 'https://another-site.com' },]);
This example demonstrates how Crawlee streamlines the process of crawling and extracting data from multiple pages with minimal setup.