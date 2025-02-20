What is a web crawler in Python?

As of January 2025, there are at least 3.98 billion indexed web pages on the internet, and most of that data is publicly accessible. However, accessing it at scale is a different challenge entirely.

Whether you need to track prices across hundreds of product pages, audit a site with thousands of URLs, or build a research dataset spread across dozens of domains, opening pages one by one and copying content out of them is a huge time sink that grows linearly with the amount of data you need.

A web crawler solves this by automating the discovery and retrieval of pages systematically. Not to be confused with a scraper (we'll get to the distinction), a crawler's primary function is discovery: starting from a URL, fetching the page, extracting all the links on it, and repeating that process across as many pages as you define.

Search engines run this at planetary scale. You can run a version of the same process in Python with a few dozen lines of code.

But the problem, as with most things that sound simple, is in the details. That includes fetching the right pages in the right order, avoiding duplicates, staying within rate limits, handling errors cleanly, and eventually scaling beyond a single machine.

Web crawling vs. web scraping: What's the difference?

Web crawling and web scraping get used interchangeably, but they describe different parts of the same workflow.

Web crawling is the process of discovering pages. A crawler starts with one or more seed URLs, fetches those pages, finds every link on them, adds those links to a queue, and keeps going. The output of a crawl is typically a list of URLs and the raw HTML of the pages it visited.

Web scraping is the process of extracting structured data from specific pages. A scraper knows where it's going and what it wants: product prices, review text, job titles, contact information, etc. The output is structured data, usually written to a file or database.

In practice, most real-world data collection projects combine both. You crawl to discover the pages you care about, then scrape those pages to extract the data you need. To learn more about the differences between web crawling and web scraping, read our guide.

Common use cases for Python web crawlers

Before getting into the code aspect, it's worth understanding what people actually build crawlers for. The use cases span industries and project sizes.

Search engine indexing

Search engine indexing is the canonical example of web crawling at scale. Google's crawler (often referred to as Googlebot) discovers new content by following links across billions of pages. When you publish a new post and it shows up in search results a few days later, that means a crawler found it. Smaller teams build internal versions of this for site search, documentation indexing, or content audits.

SEO analysis and monitoring

Crawlers are genuinely useful for SEO work. You can map a site's internal link structure, catch broken links and redirect chains, track how pages change over time, and understand how content is distributed across a domain. For competitor research, crawling a rival site gives you a structural view of what they're prioritizing without having to click through it manually.

Price monitoring and market intelligence

Retailers use crawlers to keep tabs on competitor pricing across thousands of product pages on eCommerce platforms, often on a daily or even hourly schedule. Without automated discovery, you'd have to manually maintain a list of every URL you want to monitor, which breaks down fast as product catalogs grow or change.

Academic and research data collection

Researchers across fields rely on crawlers to gather data at a scale that manual collection simply can't match. Training data for ML models, NLP research corpora, and economic datasets all depend on systematic collection of publicly available web data. Government data portals, academic repositories, and news archives are common targets.

Lead generation and business intelligence

​​Business directories, job boards, and professional listings are all fair game for crawlers when you're building a sales pipeline or doing market research. It's generally faster and more targeted than buying pre-built contact lists.

Preparing your crawler environment

Let’s move to the practical side and prepare a clean Python environment to keep dependencies predictable and isolated.

Python 3.10 or higher is recommended. Most modern libraries have dropped support for older versions, and anything below 3.10 will create compatibility issues as you add dependencies.

Set up a virtual environment before installing anything:

python - m venv crawler - env source crawler - env / bin / activate

Or if you're using uv, which handles dependency resolution faster than plain pip:

uv init my - crawler cd my - crawler uv add requests beautifulsoup4 lxml

For the basic crawler in the next section, you need these three packages:

pip install requests beautifulsoup4 lxml

Requests handles HTTP communication. Beautiful Soup parses the HTML. lxml is a faster parser backend for Beautiful Soup that's worth using over Python's built-in html.parser for anything beyond a quick prototype.

For VS Code users, the Python extension handles autocomplete, linting, and debugging well. PyCharm is the heavier-weight alternative with more built-in support for virtual environments and project management.

Building a basic crawler with Requests and Beautiful Soup

Let's build a crawler that systematically visits pages on Quotes to Scrape, a site designed for exactly this kind of practice. It has clean HTML, pagination, author pages, and tag pages, which gives enough link variety to make it a realistic test without any anti-bot headaches.