How To Find All URLs on a Domain

Whether you're running an SEO audit, planning a site migration, or hunting down broken links, there's one task you'll inevitably face – finding every URL on a website. It sounds simple, but it isn't. Search engines don't index everything, sitemaps are often outdated, and dynamic pages hide behind JavaScript. This guide walks you through every major discovery method, from quick Google search operators and no-code scrapers to custom Python scripts.

Justinas Tamasevicius

Last updated: Feb 09, 2026

16 min read

Method 1: The technical basics (sitemaps & robots.txt)

Before reaching for enterprise crawling tools, start with what websites give you directly: their robots.txt file and XML sitemaps. These are the fastest ways to get a baseline list of a website's pages. They're specifically designed to declare a site's structure to search engines and crawlers.

Checking robots.txt

The robots.txt file lives at the root of every domain. To find it, simply append /robots.txt to any URL:

https://www.producthunt.com/robots.txt

Open this URL in your browser, and you'll see a plain text file with directives that tell crawlers which pages they're allowed (or not allowed) to access. Here's what a typical robots.txt looks like:

What to look for:

Directive

What it tells you

User-agent:

Different rules for different crawlers (Googlebot, Bingbot, etc.)

Allow:

Exceptions to disallow rules.

Disallow:

Paths the site doesn't want crawled, which often reveal hidden sections like /auth/, /admin/, or /internal/.

Sitemap:

Direct links to the site's XML sitemaps, your primary source for URL discovery.

Locating and parsing XML sitemaps

If robots.txt doesn't list a sitemap (many sites forget to include it), try these common locations:

/sitemap.xml
/sitemap_index.xml
/sitemap.xml.gz (compressed version)
/sitemaps/sitemap.xml
/wp-sitemap.xml (WordPress sites)
/sitemap.txt (plain text list of URLs, common on Shopify)

Open the sitemap URL in your browser. You'll see XML that looks like this:

Key elements:

<loc> – the actual URL (this is what you're after)
<lastmod> – when the page was last modified
<changefreq> and <priority> – suggest how often a page changes and its relative importance, but Google officially ignores both tags. Don't rely on them for SEO decisions.

Handling sitemap index files

Because the standard protocol limits a single sitemap to 50,000 URLs (or 50MB), large websites use a sitemap index that points to multiple sub-sitemaps:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-01-20</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-01-19</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-categories.xml</loc>
    <lastmod>2026-01-18</lastmod>
  </sitemap>
</sitemapindex>

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-01-20</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-01-19</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-categories.xml</loc>
    <lastmod>2026-01-18</lastmod>
  </sitemap>
</sitemapindex>

You'll need to fetch each sub-sitemap and extract the URLs from each one. (See Method 4 for a Python script that handles this automatically.)

Quick sitemap parsing (no code required)

If you just need to grab URLs from a sitemap without writing code, several free tools can help:

Screaming Frog has a dedicated sitemap mode (covered in Method 2).

Google Sheets – use IMPORTXML() function:

=IMPORTXML("https://example.com/sitemap.xml", "//loc")

Note: this formula times out on large sitemaps (2,000+ URLs) and fails on sites with anti-bot protection like Cloudflare.

The limitation of sitemaps: they only contain what the site owner has chosen to include. Sitemaps are often outdated, incomplete, or missing pages that exist but were never added. That's why you'll often need to combine sitemap parsing with actual crawling.

Method 2: SEO crawling tools (no coding required)

If you want to find all pages on a website without writing a single line of code, dedicated crawling tools are your best bet. They systematically follow every link on a site and build a URL inventory.

However, crawlers can't find "orphan pages", pages with no internal links pointing to them. To discover orphan pages, cross-reference with Google Search Console, Analytics data, or parse your server access logs for Googlebot activity.

Screaming Frog SEO Spider

Screaming Frog is one of the most widely adopted industry standards for website crawling. It's a desktop application that works on Windows, Mac, and Linux.

How to use it:

Download and install from the official website.
Enter the target URL in the search bar at the top.
Click Start.
The crawler will instantly begin following every internal link, cataloging every reachable URL (pages, resources, documents), and presenting them in a structured table.

Exporting your URLs:

Once the crawl completes, click the Export button on the Internal tab to download a CSV containing every discovered URL.

Limitations:

The free version caps crawling at 500 URLs.
For larger sites, you'll need a paid license (currently ~$279/year).
No built-in proxy rotation: because it runs on your desktop, it uses your local IP address. If you try to crawl a site with advanced anti-bot protection (like Cloudflare or Datadome), your IP will get blocked instantly.
Heavy on local RAM: JavaScript-rendered crawling requires launching headless Chrome instances on your machine, which can take more computer resources on large sites.

Skip the crawling hassle

Get rendered HTML with zero infrastructure management.

Try free

Method 3: Google search operators & no-code automation

Sometimes you don't need every URL, just a quick way to see every page on a website that Google knows about, or a specific subset of pages. Google search operators give you that instantly.

Google search operators (the quickest method)

Open Google Search and type:

site:decodo.com

This returns every page Google has indexed for that domain. The number shown under the search bar gives you an approximate count of indexed URLs.

Advanced operators to narrow your search:

Operator

What it does

Example

site:

Shows all indexed pages from a domain

site:example.com

site: inurl:

Finds pages with specific URL patterns

site:example.com inurl:blog

site: intitle:

Finds pages with specific words in the title

site:example.com intitle:pricing

site: filetype:

Finds specific file types

site:example.com filetype:pdf

site: -inurl:

Excludes pages with certain URL patterns

site:example.com -inurl:login

Useful combinations:

# Find all blog and content pages
site:decodo.com inurl:blog

# Find technical documentation and guides
site:decodo.com inurl:docs

# Find all scraping and proxy solution pages
site:decodo.com inurl:scraping OR inurl:proxies

# Find core pages (excluding login, legal, and support)
site:decodo.com -inurl:login -inurl:legal -inurl:register

The limitation: Google only shows what it's indexed. Pages blocked by robots.txt, marked as noindex, or simply not yet discovered won't appear. This method also caps display results at around 400 URLs, so very large sites will be incomplete. To extract these URLs at scale without getting blocked by Google's CAPTCHAs, use a dedicated SERP Scraper API.

No-code automation with n8n

If you need to regularly monitor URLs or automate the collection process without coding, workflow automation tools like n8n can help.

Example workflow for sitemap monitoring:

Schedule Trigger – runs the workflow daily or weekly
HTTP Request node – fetches the sitemap XML
XML node – parses XML to JSON and extracts all <loc> elements
Item Lists node (Split Out) – splits the array into individual items (critical: without this, you get 1 row with all URLs)
Google Sheets node – saves each URL as a separate row

If you don't want to build this from scratch, download the pre-built n8n JSON template and import it directly into your n8n workspace.

This approach lets you track URL changes over time, new pages added, and old pages removed, without manual intervention.

What n8n can do:

Fetch and parse sitemaps on a schedule
Send alerts when new pages are added or when pages disappear
Export data to Google Sheets, Airtable, or databases

Note: n8n's XML-to-JSON node loads the entire sitemap into memory. For large sitemaps (50,000+ URLs), this can crash or timeout the workflow execution. Split oversized sitemaps into batches or use a dedicated script (see Method 4).

Limitations of native n8n HTTP requests:

Struggle with JavaScript-heavy sites (SPAs built with React, Vue, or Angular)
Can be blocked by anti-bot protections
Limited control compared to custom scripts

The solution: if you hit these walls, pair n8n with a scraping API that handles JS rendering and anti-bot bypasses. Decodo offers a native n8n integration for this.

You can import this workflow directly into your n8n workspace.

Use no-code when you need ongoing monitoring rather than a one-time crawl. Native n8n HTTP requests work well for sites with reliable sitemaps or simple HTML. For JS-heavy or protected sites, pair n8n with a scraping API that handles rendering and anti-bot protection.

Method 4: Building custom scripts (Python & Node.js)

When off-the-shelf tools hit their limits, the site is too large, too complex, or requires specific logic, it's time to build your own crawler. Building your own crawler also means you're responsible for managing server infrastructure, rotating proxies, and constantly updating selectors when target sites change their structure.

Solution A: Sitemap extraction script

This script fetches a sitemap and extracts all URLs, handling regular sitemaps, sitemap index files, and gzip-compressed sitemaps .xml.gz.

Required libraries:

pip install requests beautifulsoup4 lxml

Here’s the code:

import requests
from bs4 import BeautifulSoup
import csv
import sys
import gzip


def fetch_sitemap_urls(sitemap_url, all_urls=None, visited=None):
    if all_urls is None:
        all_urls = []
    if visited is None:
        visited = set()

    if sitemap_url in visited:
        return all_urls
    visited.add(sitemap_url)

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
    }

    try:
        response = requests.get(sitemap_url, headers=headers, timeout=30)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Error fetching {sitemap_url}: {e}")
        return all_urls

    content = response.content
    if (
        sitemap_url.endswith(".gz")
        or response.headers.get("Content-Encoding") == "gzip"
    ):
        try:
            content = gzip.decompress(content)
        except gzip.BadGzipFile:
            pass

    soup = BeautifulSoup(content, "lxml-xml")

    sitemap_tags = soup.find_all("sitemap")
    if sitemap_tags:
        print(f"Found sitemap index with {len(sitemap_tags)} sub-sitemaps")
        for sitemap in sitemap_tags:
            loc = sitemap.find("loc")
            if loc and loc.text:
                print(f"  Processing: {loc.text.strip()}")
                fetch_sitemap_urls(loc.text.strip(), all_urls, visited)
    else:
        url_tags = soup.find_all("url")
        for url in url_tags:
            loc = url.find("loc")
            if loc and loc.text:
                all_urls.append(loc.text.strip())
        print(f"Found {len(url_tags)} URLs in {sitemap_url}")

    return all_urls


def save_to_csv(urls, filename="urls.csv"):
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["URL"])
        for url in urls:
            writer.writerow([url])
    print(f"Saved {len(urls)} URLs to {filename}")


if __name__ == "__main__":
    args = sys.argv[1:]
    sitemap_url = args[0] if args else "https://example.com/sitemap.xml"
    output_file = args[1] if len(args) > 1 else "urls.csv"

    urls = fetch_sitemap_urls(sitemap_url)
    if urls:
        save_to_csv(urls, output_file)
    print(f"\nTotal URLs found: {len(urls)}")

import requests
from bs4 import BeautifulSoup
import csv
import sys
import gzip


def fetch_sitemap_urls(sitemap_url, all_urls=None, visited=None):
    if all_urls is None:
        all_urls = []
    if visited is None:
        visited = set()

    if sitemap_url in visited:
        return all_urls
    visited.add(sitemap_url)

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
    }

    try:
        response = requests.get(sitemap_url, headers=headers, timeout=30)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Error fetching {sitemap_url}: {e}")
        return all_urls

    content = response.content
    if (
        sitemap_url.endswith(".gz")
        or response.headers.get("Content-Encoding") == "gzip"
    ):
        try:
            content = gzip.decompress(content)
        except gzip.BadGzipFile:
            pass

    soup = BeautifulSoup(content, "lxml-xml")

    sitemap_tags = soup.find_all("sitemap")
    if sitemap_tags:
        print(f"Found sitemap index with {len(sitemap_tags)} sub-sitemaps")
        for sitemap in sitemap_tags:
            loc = sitemap.find("loc")
            if loc and loc.text:
                print(f"  Processing: {loc.text.strip()}")
                fetch_sitemap_urls(loc.text.strip(), all_urls, visited)
    else:
        url_tags = soup.find_all("url")
        for url in url_tags:
            loc = url.find("loc")
            if loc and loc.text:
                all_urls.append(loc.text.strip())
        print(f"Found {len(url_tags)} URLs in {sitemap_url}")

    return all_urls


def save_to_csv(urls, filename="urls.csv"):
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["URL"])
        for url in urls:
            writer.writerow([url])
    print(f"Saved {len(urls)} URLs to {filename}")


if __name__ == "__main__":
    args = sys.argv[1:]
    sitemap_url = args[0] if args else "https://example.com/sitemap.xml"
    output_file = args[1] if len(args) > 1 else "urls.csv"

    urls = fetch_sitemap_urls(sitemap_url)
    if urls:
        save_to_csv(urls, output_file)
    print(f"\nTotal URLs found: {len(urls)}")

What this script does:

Fetches the sitemap XML with proper headers.
Automatically decompresses gzip-compressed sitemaps .xml.gz.
Tracks visited sitemaps to prevent infinite loops from circular references.
Detects sitemap index files and recursively processes sub-sitemaps.
Saves everything to a CSV file.

Usage:

python sitemap_parser.py <sitemap_url> [output_file]

Examples:

python sitemap_parser.py https://example.com/sitemap.xml 
python sitemap_parser.py https://example.com/sitemap.xml output.csv 
python sitemap_parser.py https://example.com/sitemap.xml.gz compressed.csv

Alternative parsing libraries:

xmltodict – converts XML to Python dictionaries, which some developers find easier to work with than Beautiful Soup for pure XML parsing.
ultimate-sitemap-parser – a specialized library that handles edge cases like compressed sitemaps, malformed XML, and automatic sitemap discovery from robots.txt.

If you prefer JavaScript for sitemap parsing, here's a modern script using ESM imports and native fetch (Node.js 22+).

Required packages:

npm install cheerio

Here’s the code:

import { writeFileSync } from 'fs';
import { gunzipSync } from 'zlib';
import * as cheerio from 'cheerio';

const USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36';

async function fetchUrl(url) {
    const response = await fetch(url, {
        headers: { 'User-Agent': USER_AGENT, 'Accept-Encoding': 'gzip, deflate' },
        signal: AbortSignal.timeout(30000)
    });

    if (!response.ok) throw new Error(`HTTP ${response.status}`);

    const buffer = await response.arrayBuffer();
    let content;

    if (url.endsWith('.gz') || response.headers.get('content-encoding') === 'gzip') {
        try {
            content = gunzipSync(Buffer.from(buffer)).toString('utf-8');
        } catch {
            content = Buffer.from(buffer).toString('utf-8');
        }
    } else {
        content = Buffer.from(buffer).toString('utf-8');
    }

    return content;
}

async function fetchSitemapUrls(sitemapUrl, allUrls = [], visited = new Set()) {
    if (visited.has(sitemapUrl)) return allUrls;
    visited.add(sitemapUrl);

    let content;
    try {
        content = await fetchUrl(sitemapUrl);
    } catch (err) {
        console.log(`Error fetching ${sitemapUrl}: ${err.message}`);
        return allUrls;
    }

    const $ = cheerio.load(content, { xmlMode: true });

    const sitemaps = $('sitemap loc');
    if (sitemaps.length > 0) {
        console.log(`Found sitemap index with ${sitemaps.length} sub-sitemaps`);
        for (const el of sitemaps.toArray()) {
            const loc = $(el).text().trim();
            if (loc) {
                console.log(`  Processing: ${loc}`);
                await fetchSitemapUrls(loc, allUrls, visited);
            }
        }
    } else {
        const urls = $('url loc');
        urls.each((_, el) => {
            const loc = $(el).text().trim();
            if (loc) allUrls.push(loc);
        });
        console.log(`Found ${urls.length} URLs in ${sitemapUrl}`);
    }

    return allUrls;
}

function saveToCsv(urls, filename = 'urls.csv') {
    writeFileSync(filename, 'URL\n' + urls.join('\n'));
    console.log(`Saved ${urls.length} URLs to ${filename}`);
}

const args = process.argv.slice(2);
const url = args[0] || 'https://example.com/sitemap.xml';
const output = args[1] || 'urls.csv';

const urls = await fetchSitemapUrls(url);
if (urls.length > 0) saveToCsv(urls, output);
console.log(`\nTotal URLs found: ${urls.length}`);

export { fetchSitemapUrls };

import { writeFileSync } from 'fs';
import { gunzipSync } from 'zlib';
import * as cheerio from 'cheerio';

const USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36';

async function fetchUrl(url) {
    const response = await fetch(url, {
        headers: { 'User-Agent': USER_AGENT, 'Accept-Encoding': 'gzip, deflate' },
        signal: AbortSignal.timeout(30000)
    });

    if (!response.ok) throw new Error(`HTTP ${response.status}`);

    const buffer = await response.arrayBuffer();
    let content;

    if (url.endsWith('.gz') || response.headers.get('content-encoding') === 'gzip') {
        try {
            content = gunzipSync(Buffer.from(buffer)).toString('utf-8');
        } catch {
            content = Buffer.from(buffer).toString('utf-8');
        }
    } else {
        content = Buffer.from(buffer).toString('utf-8');
    }

    return content;
}

async function fetchSitemapUrls(sitemapUrl, allUrls = [], visited = new Set()) {
    if (visited.has(sitemapUrl)) return allUrls;
    visited.add(sitemapUrl);

    let content;
    try {
        content = await fetchUrl(sitemapUrl);
    } catch (err) {
        console.log(`Error fetching ${sitemapUrl}: ${err.message}`);
        return allUrls;
    }

    const $ = cheerio.load(content, { xmlMode: true });

    const sitemaps = $('sitemap loc');
    if (sitemaps.length > 0) {
        console.log(`Found sitemap index with ${sitemaps.length} sub-sitemaps`);
        for (const el of sitemaps.toArray()) {
            const loc = $(el).text().trim();
            if (loc) {
                console.log(`  Processing: ${loc}`);
                await fetchSitemapUrls(loc, allUrls, visited);
            }
        }
    } else {
        const urls = $('url loc');
        urls.each((_, el) => {
            const loc = $(el).text().trim();
            if (loc) allUrls.push(loc);
        });
        console.log(`Found ${urls.length} URLs in ${sitemapUrl}`);
    }

    return allUrls;
}

function saveToCsv(urls, filename = 'urls.csv') {
    writeFileSync(filename, 'URL\n' + urls.join('\n'));
    console.log(`Saved ${urls.length} URLs to ${filename}`);
}

const args = process.argv.slice(2);
const url = args[0] || 'https://example.com/sitemap.xml';
const output = args[1] || 'urls.csv';

const urls = await fetchSitemapUrls(url);
if (urls.length > 0) saveToCsv(urls, output);
console.log(`\nTotal URLs found: ${urls.length}`);

export { fetchSitemapUrls };

Usage:

node sitemap_parser.js <sitemap_url> [output_file]

Examples:

node sitemap_parser.js https://example.com/sitemap.xml
node sitemap_parser.js https://example.com/sitemap.xml output.csv
node sitemap_parser.js https://example.com/sitemap.xml.gz compressed.csv

Note: for massive sitemaps (100k+ URLs), this in-memory approach may exceed Node's default heap limits. For enterprise production, refactor this to use Node.js Stream's response.body.getReader() to process chunks dynamically.

Solution B: Full site crawler with Scrapy

When a site doesn't have a sitemap, or the sitemap is incomplete, you need to crawl the site by following links. Scrapy is a powerful, production-ready framework that handles concurrency, rate limiting, and link extraction efficiently.

Required libraries:

pip install scrapy

Here’s the code:

import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urlparse
import csv
import time
import sys


def crawl_site(start_url, max_pages=500, output_file="crawled.csv"):
    found_urls = []

    class CollectorSpider(scrapy.Spider):
        name = "collector"

        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.start_urls = [start_url]
            self.domain = urlparse(start_url).netloc
            self.visited = set()
            self.page_count = 0

        def parse(self, response):
            if self.page_count >= max_pages:
                return

            if response.url in self.visited:
                return

            self.visited.add(response.url)
            self.page_count += 1
            found_urls.append(response.url)
            print(f"[{self.page_count}/{max_pages}] {response.status} - {response.url}")

            for href in response.css("a::attr(href)").getall():
                if href and not href.startswith(
                    ("javascript:", "mailto:", "tel:", "#")
                ):
                    full_url = response.urljoin(href)
                    parsed = urlparse(full_url)

                    if parsed.netloc != self.domain:
                        continue

                    skip_ext = [
                        ".jpg",
                        ".jpeg",
                        ".png",
                        ".gif",
                        ".pdf",
                        ".css",
                        ".js",
                        ".ico",
                        ".svg",
                        ".woff",
                        ".woff2",
                        ".ttf",
                        ".mp4",
                        ".webp",
                    ]
                    if any(parsed.path.lower().endswith(ext) for ext in skip_ext):
                        continue

                    skip_paths = [
                        "/wp-admin",
                        "/wp-includes",
                        "/cart",
                        "/checkout",
                        "/api/",
                    ]
                    if any(skip in parsed.path.lower() for skip in skip_paths):
                        continue

                    if full_url not in self.visited and self.page_count < max_pages:
                        yield scrapy.Request(full_url, callback=self.parse)

    print(f"Starting crawl of {start_url}")
    print(f"Max pages: {max_pages}")
    print("-" * 50)

    start_time = time.time()

    process = CrawlerProcess(
        settings={
            "LOG_ENABLED": False,
            "USER_AGENT": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36",
            "CONCURRENT_REQUESTS": 16,
            "CONCURRENT_REQUESTS_PER_DOMAIN": 16,
            "DOWNLOAD_DELAY": 0.1,
            "ROBOTSTXT_OBEY": False,
            "COOKIES_ENABLED": False,
        }
    )

    process.crawl(CollectorSpider)
    process.start()

    elapsed = time.time() - start_time

    print("-" * 50)
    print(f"Crawl complete. Found {len(found_urls)} pages in {elapsed:.2f}s")

    if found_urls:
        with open(output_file, "w", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(["URL"])
            for url in sorted(found_urls):
                writer.writerow([url])
        print(f"Saved to {output_file}")

    return found_urls


if __name__ == "__main__":
    args = sys.argv[1:]
    url = args[0] if args else "https://example.com"
    max_pages = int(args[1]) if len(args) > 1 else 500
    output = args[2] if len(args) > 2 else "crawled.csv"
    crawl_site(url, max_pages, output)

import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urlparse
import csv
import time
import sys


def crawl_site(start_url, max_pages=500, output_file="crawled.csv"):
    found_urls = []

    class CollectorSpider(scrapy.Spider):
        name = "collector"

        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.start_urls = [start_url]
            self.domain = urlparse(start_url).netloc
            self.visited = set()
            self.page_count = 0

        def parse(self, response):
            if self.page_count >= max_pages:
                return

            if response.url in self.visited:
                return

            self.visited.add(response.url)
            self.page_count += 1
            found_urls.append(response.url)
            print(f"[{self.page_count}/{max_pages}] {response.status} - {response.url}")

            for href in response.css("a::attr(href)").getall():
                if href and not href.startswith(
                    ("javascript:", "mailto:", "tel:", "#")
                ):
                    full_url = response.urljoin(href)
                    parsed = urlparse(full_url)

                    if parsed.netloc != self.domain:
                        continue

                    skip_ext = [
                        ".jpg",
                        ".jpeg",
                        ".png",
                        ".gif",
                        ".pdf",
                        ".css",
                        ".js",
                        ".ico",
                        ".svg",
                        ".woff",
                        ".woff2",
                        ".ttf",
                        ".mp4",
                        ".webp",
                    ]
                    if any(parsed.path.lower().endswith(ext) for ext in skip_ext):
                        continue

                    skip_paths = [
                        "/wp-admin",
                        "/wp-includes",
                        "/cart",
                        "/checkout",
                        "/api/",
                    ]
                    if any(skip in parsed.path.lower() for skip in skip_paths):
                        continue

                    if full_url not in self.visited and self.page_count < max_pages:
                        yield scrapy.Request(full_url, callback=self.parse)

    print(f"Starting crawl of {start_url}")
    print(f"Max pages: {max_pages}")
    print("-" * 50)

    start_time = time.time()

    process = CrawlerProcess(
        settings={
            "LOG_ENABLED": False,
            "USER_AGENT": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36",
            "CONCURRENT_REQUESTS": 16,
            "CONCURRENT_REQUESTS_PER_DOMAIN": 16,
            "DOWNLOAD_DELAY": 0.1,
            "ROBOTSTXT_OBEY": False,
            "COOKIES_ENABLED": False,
        }
    )

    process.crawl(CollectorSpider)
    process.start()

    elapsed = time.time() - start_time

    print("-" * 50)
    print(f"Crawl complete. Found {len(found_urls)} pages in {elapsed:.2f}s")

    if found_urls:
        with open(output_file, "w", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(["URL"])
            for url in sorted(found_urls):
                writer.writerow([url])
        print(f"Saved to {output_file}")

    return found_urls


if __name__ == "__main__":
    args = sys.argv[1:]
    url = args[0] if args else "https://example.com"
    max_pages = int(args[1]) if len(args) > 1 else 500
    output = args[2] if len(args) > 2 else "crawled.csv"
    crawl_site(url, max_pages, output)

Key features of this Scrapy crawler:

High concurrency – handles 16 simultaneous requests by default.
Built-in rate limiting – configurable download delay between requests.
Configurable robots.txt handling – set to False in this example for maximum discovery coverage.
Automatic link extraction – uses CSS selectors for fast parsing.
Domain filtering – only follows links within the same domain.
Smart filtering – skips images, CSS, JavaScript, and admin paths.

Usage:

python site_crawler.py <url> [max_pages] [output_file]

Examples:

python site_crawler.py https://example.com
python site_crawler.py https://example.com 100
python site_crawler.py https://example.com 500 output.csv

Why Scrapy over Requests? Scrapy is purpose-built for web crawling. It handles connection pooling, automatic retries, redirect following, and concurrent requests out of the box. For serious crawling projects, Scrapy is significantly faster and more robust than standard synchronous implementations using requests.

Note: Run this as a standalone script (e.g., python site_crawler.py), not as an imported module inside another Python file. Scrapy's internal event loop shuts down permanently after a crawl finishes, so calling crawl_site() a second time in the same process will throw an error.

Solution C: Fast JavaScript crawling with Crawlee

The Scrapy crawler above works great for traditional websites, but it can't see content rendered by JavaScript. For Single Page Applications (SPAs), you need a headless browser.

Crawlee is a modern crawling library that wraps Playwright with auto-scaling parallel execution. It's 5x to 10x faster than basic Playwright because it runs multiple browser pages simultaneously.

Required libraries:

pip install 'crawlee[playwright]'
playwright install chromium

Here's the code:

import asyncio
from urllib.parse import urlparse
from crawlee.crawlers import PlaywrightCrawler
from crawlee import ConcurrencySettings
import csv
import time
import sys


async def crawl_js_site(start_url, max_pages=100, output_file="js_crawled.csv"):
    domain = urlparse(start_url).netloc
    found_urls = []
    visited = set()

    print(f"Starting JS crawl of {start_url}")
    print(f"Max pages: {max_pages}")
    print("-" * 50)

    start_time = time.time()

    crawler = PlaywrightCrawler(
        max_requests_per_crawl=max_pages,
        concurrency_settings=ConcurrencySettings(min_concurrency=5, max_concurrency=20),
        headless=True,
        browser_type="chromium",
    )

    @crawler.router.default_handler
    async def handler(context):
        url = context.request.url
        parsed = urlparse(url)
        normalized = (
            f"{parsed.scheme}://{parsed.netloc}{parsed.path.rstrip('/') or '/'}"
        )

        if normalized in visited:
            return
        visited.add(normalized)

        found_urls.append(normalized)
        print(f"[{len(found_urls)}/{max_pages}] {url}")

        links = await context.page.eval_on_selector_all(
            "a[href]", "els => els.map(e => e.href)"
        )

        for link in links:
            try:
                p = urlparse(link)
                if p.netloc != domain:
                    continue

                skip = [
                    ".jpg",
                    ".jpeg",
                    ".png",
                    ".gif",
                    ".pdf",
                    ".css",
                    ".js",
                    ".ico",
                    ".svg",
                    ".mp4",
                    ".webp",
                    ".zip",
                ]
                if any(p.path.lower().endswith(ext) for ext in skip):
                    continue

                clean = f"{p.scheme}://{p.netloc}{p.path.rstrip('/') or '/'}"
                if clean not in visited and len(found_urls) < max_pages:
                    await context.add_requests([clean])
            except:
                pass

    await crawler.run([start_url])

    elapsed = time.time() - start_time
    print("-" * 50)
    print(f"Crawl complete. Found {len(found_urls)} pages in {elapsed:.2f}s")
    print(f"Speed: {len(found_urls)/elapsed:.2f} pages/sec")

    if found_urls:
        with open(output_file, "w", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(["URL"])
            for url in sorted(found_urls):
                writer.writerow([url])
        print(f"Saved to {output_file}")

    return found_urls


if __name__ == "__main__":
    args = sys.argv[1:]
    url = args[0] if args else "https://example.com"
    max_pages = int(args[1]) if len(args) > 1 else 100
    output = args[2] if len(args) > 2 else "js_crawled.csv"
    asyncio.run(crawl_js_site(url, max_pages, output))

import asyncio
from urllib.parse import urlparse
from crawlee.crawlers import PlaywrightCrawler
from crawlee import ConcurrencySettings
import csv
import time
import sys


async def crawl_js_site(start_url, max_pages=100, output_file="js_crawled.csv"):
    domain = urlparse(start_url).netloc
    found_urls = []
    visited = set()

    print(f"Starting JS crawl of {start_url}")
    print(f"Max pages: {max_pages}")
    print("-" * 50)

    start_time = time.time()

    crawler = PlaywrightCrawler(
        max_requests_per_crawl=max_pages,
        concurrency_settings=ConcurrencySettings(min_concurrency=5, max_concurrency=20),
        headless=True,
        browser_type="chromium",
    )

    @crawler.router.default_handler
    async def handler(context):
        url = context.request.url
        parsed = urlparse(url)
        normalized = (
            f"{parsed.scheme}://{parsed.netloc}{parsed.path.rstrip('/') or '/'}"
        )

        if normalized in visited:
            return
        visited.add(normalized)

        found_urls.append(normalized)
        print(f"[{len(found_urls)}/{max_pages}] {url}")

        links = await context.page.eval_on_selector_all(
            "a[href]", "els => els.map(e => e.href)"
        )

        for link in links:
            try:
                p = urlparse(link)
                if p.netloc != domain:
                    continue

                skip = [
                    ".jpg",
                    ".jpeg",
                    ".png",
                    ".gif",
                    ".pdf",
                    ".css",
                    ".js",
                    ".ico",
                    ".svg",
                    ".mp4",
                    ".webp",
                    ".zip",
                ]
                if any(p.path.lower().endswith(ext) for ext in skip):
                    continue

                clean = f"{p.scheme}://{p.netloc}{p.path.rstrip('/') or '/'}"
                if clean not in visited and len(found_urls) < max_pages:
                    await context.add_requests([clean])
            except:
                pass

    await crawler.run([start_url])

    elapsed = time.time() - start_time
    print("-" * 50)
    print(f"Crawl complete. Found {len(found_urls)} pages in {elapsed:.2f}s")
    print(f"Speed: {len(found_urls)/elapsed:.2f} pages/sec")

    if found_urls:
        with open(output_file, "w", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(["URL"])
            for url in sorted(found_urls):
                writer.writerow([url])
        print(f"Saved to {output_file}")

    return found_urls


if __name__ == "__main__":
    args = sys.argv[1:]
    url = args[0] if args else "https://example.com"
    max_pages = int(args[1]) if len(args) > 1 else 100
    output = args[2] if len(args) > 2 else "js_crawled.csv"
    asyncio.run(crawl_js_site(url, max_pages, output))

Usage:

python dynamic_crawler.py <url> [max_pages] [output_file]

Examples:

python dynamic_crawler.py https://react.dev
python dynamic_crawler.py https://vuejs.org 50
python dynamic_crawler.py https://nextjs.org 100 nextjs.csv

While Scrapy uses lightweight HTTP requests, Crawlee spins up entire Chrome browsers in the background.

Key differences between Scrapy and Crawlee:

Scrapy

Crawlee

Speed

Fast (100s of pages/min)

Medium (10 to 50 pages/min)

JavaScript

Can't execute

Full JS support

Concurrency

Request-level

Browser page-level (auto-scaling)

Memory

Low (~50MB)

Medium (~100MB per parallel page)

Use case

Static HTML sites

SPAs, React, Vue, Angular

Use Crawlee only when you've confirmed that Scrapy returns empty or incomplete pages. For less intensive tasks, go with Scrapy as it's a much faster solution. Switch to Crawlee only for JavaScript-rendered content.

Tip: Running headless browsers locally is RAM-intensive and easily fingerprinted by anti-bot systems. If you'd rather skip infrastructure management, managed scraping APIs (like Decodo) can return fully rendered HTML via a single API call.

When to use custom scripts:

The site has no sitemap or an incomplete one
You need to crawl more than 500 pages (Screaming Frog's free limit)
You need custom logic (specific URL patterns, authentication, etc.)
You want to integrate URL discovery into a larger pipeline
You're building a recurring monitoring system

Addressing common challenges & advanced scalability

Real-world URL discovery rarely goes smoothly. As you scale from crawling small sites to enterprise-level projects, you'll encounter predictable obstacles. Here's how to overcome each one.

Challenge 1: Missing sitemaps

Problem: many websites don't have a sitemap at all, or their sitemap is severely outdated and missing hundreds of pages.

Solution: implement a web crawling script that starts at the homepage and discovers links dynamically. Rather than relying on what the site owner has declared, you follow every internal link to build a complete picture.

This is exactly what the Python crawler in Method 4 does. It:

Starts at a seed URL (usually the homepage).
Parses all <a> tags to find internal links.
Adds new links to a queue.
Repeats until all discoverable pages are visited.

# The core discovery loop (Pseudo-code)
while urls_to_visit:
    url = urls_to_visit.pop()
    html = fetch(url)
    new_links = extract_internal_links(html)
    urls_to_visit.extend(new_links)
    visited.add(url)

Scaling tip: for large sites, use the concurrent crawling approach with ThreadPoolExecutor (Python) or Promise.all (Node.js) to crawl multiple pages simultaneously while respecting rate limits.

Challenge 2: JavaScript-heavy websites (SPAs)

Problem: standard libraries like requests (Python) or axios (Node.js) only fetch raw HTML. They can't execute JavaScript. This means Single Page Applications built with React, Vue, or Angular appear nearly empty.

When you fetch a typical React app with standard requests, you get this blank shell:

<div id="root"></div>
<script src="/bundle.js"></script>

The navigation, product listings, blog posts, everything, is generated by JavaScript that standard scrapers never execute.

Solution 1: Crawlee with Playwright. As detailed in Method 4 (Solution C), you can use a headless browser framework like Crawlee to actually render the JavaScript before extracting the links. Trade-off: this requires managing heavy server infrastructure and high RAM usage.

Solution 2: Scraping APIs with JavaScript rendering (recommended). For large-scale crawling without managing infrastructure, scraping APIs offer a simpler approach. Instead of managing browser instances yourself, you make a single API call with a JS rendering parameter:

import requests

url = "https://scraper-api.decodo.com/v2/scrape"

payload = {
    "url": "https://react.dev/",
    "headless": "html",  # Execute JavaScript before returning HTML
}

headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "authorization": "Basic AUTH_TOKEN",
}


# Using Decodo's API with JS rendering
response = requests.post(url, json=payload, headers=headers)

print(response.text)  # Fully rendered HTML containing all URLs

import requests

url = "https://scraper-api.decodo.com/v2/scrape"

payload = {
    "url": "https://react.dev/",
    "headless": "html",  # Execute JavaScript before returning HTML
}

headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "authorization": "Basic AUTH_TOKEN",
}


# Using Decodo's API with JS rendering
response = requests.post(url, json=payload, headers=headers)

print(response.text)  # Fully rendered HTML containing all URLs

This approach eliminates browser management entirely – the API renders JavaScript server-side and returns the final HTML to your script.

Challenge 3: Anti-bot blocking & rate limits

Problem: websites protect themselves from aggressive crawling. Making too many requests too quickly triggers anti-bot systems, resulting in:

403 Forbidden – you've been identified as a bot (Cloudflare, Akamai, DataDome)
429 Too Many Requests – you've exceeded the rate limit
CAPTCHA challenges – the site wants to verify you're human
IP bans – your IP address is blocked entirely

Solution 1: Basic countermeasures (browser fingerprinting)

Start by mimicking real user traffic. Anti-bot systems immediately flag requests missing standard browser headers:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Sec-Ch-Ua': '"Google Chrome";v="145", "Chromium";v="145", "Not.A/Brand";v="24"',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Ch-Ua-Platform': '"Windows"'
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Sec-Ch-Ua': '"Google Chrome";v="145", "Chromium";v="145", "Not.A/Brand";v="24"',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Ch-Ua-Platform': '"Windows"'
}

Solution 2: Managed proxy & scraping services

Headers only get you so far. For serious crawling projects, you need IP rotation to prevent bans. Managed services like Decodo handle proxy rotation, automatic retries, and CAPTCHA bypassing so you don't have to build that infrastructure yourself.

Scaling considerations

As your crawling needs grow, consider these architectural patterns:

Scale

Approach

Recommended stack

< 1,000 pages

Single-threaded

Python requests + Beautiful Soup

1,000 to 10,000 pages

Concurrent crawling

Python asyncio or Node.js Promise.all

10,000 to 100,000 pages

Distributed queue

Celery, Redis queue, multiple workers

100,000+ pages

Managed infrastructure

Decodo Web Scraping API, serverless functions

Key scaling principles:

Control request concurrency – hitting a server with 500 requests per second is the fastest way to get your IP banned. Implement limits to crawl politely without triggering DDoS protection.
Implement exponential backoff – when you hit a 429 error, wait progressively longer before retrying (e.g., 2s, 4s, 8s).
Cache aggressively – store HTML responses locally in Redis or S3 so you don't re-crawl identical pages unnecessarily.
Monitor error rates – track success rates by domain. If your 403s spike, your current proxy pool has been burned and needs rotation.

What to do after collecting URLs

Now that you have a comprehensive map of your target website, the raw URL data is your baseline. Here's how enterprise teams operationalize this data:

1. Large-scale SEO audits

Feed your URL list into an auditing pipeline to check status codes (404s, 301 redirects) and extract metadata at scale. Finding orphaned pages or missing canonical tags is impossible without a complete URL baseline.

2. LLM & AI training data

Raw URLs are the starting point for building proprietary datasets. Feed your URL list into a headless scraper to extract the raw text and HTML, which is then used for AI and LLM training and custom knowledge bases.

3. Competitor content gap analysis

By mapping a competitor's entire URL structure, you can reverse-engineer their content strategy. Categorize their URLs by subfolder (e.g., /blog/, /features/, /integrations/) to discover which content pillars they're investing in heavily, and where your own site is lacking.

4. Flawless site migrations

When redesigning an enterprise website, a single missed redirect can destroy years of SEO value. Your scraped URL list acts as the master checklist to ensure every old page is successfully mapped to its new destination.

Final thoughts

Finding every URL on a website doesn't have to be a manual nightmare. Whether you are using a simple Google search operator or building a massive Scrapy pipeline, the key is matching the right method to the scale of your project.

If you're scaling beyond what scripts and free tools can handle, Decodo's Web Scraping API can take the infrastructure burden off your plate. Try it free and see how it fits your workflow.

Scale your URL discovery

Handle JavaScript sites and anti-bot blocks automatically.

Start free trial

About the author

Justinas Tamasevicius

Head of Engineering

Justinas Tamaševičius is Head of Engineering with over two decades of expertize in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.

Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

In this article

Never get blocked again

Get residential proxies with automatic rotation included.

Get proxies

DATA COLLECTION

SEARCH ENGINE OPTIMIZATION

PYTHON

How to Scrape Google Search Data

Business success is driven by data, and few data sources are as valuable as Google’s Search Engine Results Page (SERP). Collecting this data can be complex, but various tools and automation techniques make it easier. This guide explores practical ways to scrape Google search results, highlights the benefits of such efforts, and addresses common challenges.

Dominykas Niaura

Last updated: Dec 30, 2024

7 min read

DATA COLLECTION

API

Web Crawling vs Web Scraping: What’s the Difference?

When it comes to gathering online data, two terms often create confusion: web crawling and web scraping. Although both involve extracting information from websites, they serve different purposes and employ distinct methods. In this article, we’ll break down these concepts, show you how they work, and help you decide which one suits your data extraction needs.

Justinas Tamasevicius

Last updated: Jul 01, 2025

7 min read

DATA COLLECTION

PYTHON

AI Web Scraping With Python: A Comprehensive Guide

AI web scraping with Python lets you extract data from websites without relying on fragile parsing rules. AI helps handling page inconsistencies and dynamic content, while Python continues to manage fetching. In this guide, you'll see how models extract data from unstructured pages, reduce manual parsing rules, support automation, and scale into reliable pipelines.

Mykolas Juodis

Last updated: Dec 23, 2025

6 min read

Frequently asked questions

What if the website blocks me?

If you're getting 403 or 429 errors, start by slowing down your request rate and adding realistic browser headers. For persistent blocks, know that modern anti-bot systems like Cloudflare also check your TLS fingerprint (JA3/JA4) – if your User-Agent says "Chrome" but your TLS handshake looks like a Python library, you'll still be blocked regardless of headers. At that point, managed scraping services with residential proxy rotation become necessary.

Can I find all URLs without writing code?

Yes. Google search operators (like site:example.com) are the fastest way to see indexed pages. For a visual interface, tools like Screaming Frog work well for small sites (under 500 URLs).

How do I find hidden pages?

"Hidden" or "orphan" pages (those with no internal links) can't be discovered by standard crawling because no "path" exists for the bot to follow.

To find them, look outside the HTML:

Google Search Console/Analytics. The best source for finding "live" pages receiving traffic but missing from internal navigation.
Wayback Machine. reveals historical URLs that may still be live but were removed from the current site links.
Sitemap cross-referencing. Compare crawl results against sitemap.xml to find URLs the bot missed.
Server logs. If you own the site, Nginx/Apache logs are the "source of truth" for bot hits.

What's the difference between crawling and scraping?

Crawling is the process of discovery, mapping the architecture, and finding what URLs exist on a site. Scraping is the process of extraction, pulling specific data (prices, text, images) from those discovered URLs.

How often should I re-crawl a site?

It depends on the site's "freshness" requirements. News sites require daily or hourly crawls, while a corporate site might only change monthly. For eCommerce, weekly crawls are recommended to detect new product launches or price changes. Regular monitoring ensures your URL list stays accurate as the site's architecture evolves.

How To Find All URLs on a Domain

Method 1: The technical basics (sitemaps & robots.txt)

Checking robots.txt

Locating and parsing XML sitemaps

Handling sitemap index files

Quick sitemap parsing (no code required)

Method 2: SEO crawling tools (no coding required)

Screaming Frog SEO Spider

Method 3: Google search operators & no-code automation

Google search operators (the quickest method)

No-code automation with n8n

Method 4: Building custom scripts (Python & Node.js)

Solution A: Sitemap extraction script

Solution B: Full site crawler with Scrapy

Solution C: Fast JavaScript crawling with Crawlee

Addressing common challenges & advanced scalability

Challenge 1: Missing sitemaps

Challenge 2: JavaScript-heavy websites (SPAs)

Challenge 3: Anti-bot blocking & rate limits

Solution 1: Basic countermeasures (browser fingerprinting)

Solution 2: Managed proxy & scraping services

Scaling considerations

What to do after collecting URLs

1. Large-scale SEO audits

2. LLM & AI training data

3. Competitor content gap analysis

4. Flawless site migrations

Final thoughts

Related articles

Frequently asked questions

What if the website blocks me?

Can I find all URLs without writing code?

How do I find hidden pages?

What's the difference between crawling and scraping?

How often should I re-crawl a site?