Back to blog

How to Scrape Indeed for Job Data: A Comprehensive Guide

Indeed hosts millions of job listings across industries and locations, making it a valuable data source for analysts, recruiters, data engineers, and founders who need real-time job intelligence. Scraping job data is challenging because sites change and anti-bot defenses evolve. This guide walks you through a resilient, modern approach that works reliably today – and scales when you need it to.

Zilvinas Tamulis

Sep 12, 2025

13 min read

What data can you extract?

Indeed is a popular job search platform, operating in over 60 countries, with 615M+ job-seeker profiles and 3.3M+ employers, resulting in approximately 27 hires per minute. It offers various job types across country-specific sites, making its dataset a widely used source for labor-market analysis.

Standard Indeed job scraping yields the essentials:

  • Job titles, company data, locations
  • Posting timestamps, job URLs/IDs
  • Descriptions, benefits, and salary ranges where disclosed
  • Job type (full-time, part-time, or contract)

Why it matters – data engineers use this to build real-time job intelligence pipelines. Analysts track hiring velocity across tech stacks and geographies. Founders monitor competitor hiring patterns to spot market opportunities.

Now that you know what data you can collect, let's understand how Indeed's website is structured and how that affects our approach to scraping.

Understanding Indeed's data architecture

Indeed organizes job information in a consistent structure that allows efficient extraction once you understand the moving parts.

How Indeed search works

Indeed constructs search URLs with stable parameters that you can modify programmatically. A basic search looks like:

https://www.indeed.com/jobs?q=data+analyst&l=Chicago%2C+IL

Key parameters you’ll use:

  • q – query keywords (for example, data analyst)
  • l – location (for example, Chicago, IL; use remote for remote roles)
  • start – pagination offset in increments of 10 (0, 10, 20, …)
  • sort=date – newest results first
  • fromage – posting age filter (for example, 1 = last 24 hours)
  • radius – distance from the location center (for example, 100 = within 100 miles)

Regional domains include:

  • USA – www.indeed.com
  • Canada – ca.indeed.com
  • UK – uk.indeed.com
  • Australia – au.indeed.com

Others vary by country.

A reliable shortcut – embedded JSON beats brittle HTML

Scraping Indeed is challenging due to its dynamic, JavaScript-rendered content and complex HTML structure, which can be difficult to navigate reliably. Targeting embedded JSON data offers a more stable and efficient alternative to parsing the DOM. Rather than maintaining many CSS selectors, parse the structured payload that Indeed injects into the page. The most useful data appears under:

window.mosaic.providerData["mosaic-provider-jobcards"]

This JSON contains the job listings before HTML rendering and typically offers:

  • A more stable structure than the rendered DOM
  • Complete listing fields in one place
  • Faster extraction without deep DOM parsing

To locate this data:

Step 1. Open an Indeed search results page:

How to Scrape Indeed for Job Data: A Comprehensive Guide

Step 2. Open your browser developer tools.

Step 3. Go to the Network tab, refresh the page to populate results, then filter by Doc:

Step 4. Select the main HTML response.

Step 5. Search for window.mosaic.providerData["mosaic-provider-jobcards"] in the response to view the embedded JSON:

Inside you’ll find results (an array of job records):

The hierarchy commonly looks like:

window.mosaic.providerData["mosaic-provider-jobcards"]
└── metaData
└── mosaicProviderJobCardsModel
└── results # array of job objects

With Indeed’s URL patterns and JSON structure in mind, we can now cover common scraping challenges.

Common anti-scraping challenges

Indeed employs multiple layers of bot mitigation to flag and block automated traffic.

  • CAPTCHA and behavioral detection. Cloudflare Turnstile or sign-in gates appear when traffic shows automated patterns such as high request rates, headless browser defaults, or mismatched geo.
  • Rate limiting and IP blocks. Excessive bursts from a single IP trigger throttling or temporary bans.
  • Browser fingerprinting. TLS signatures and JavaScript execution patterns that don’t resemble human browsing are flagged quickly.
  • Login walls. After rapid navigation, you may be required to create an account or sign-in.

With the challenges in mind, let’s pick the right tool for the job.

Overcome obstacles with ease

Bypass CAPTCHAs, rate limits, and IP blocks to get unrestricted Indeed data.

Choosing the right technology stack

Indeed, a high-friction target: basic HTTP clients, such as Requests, often encounter CAPTCHAs or login walls before any job data appears, and vanilla Playwright, Puppeteer, or Selenium setups are easily fingerprinted. We’ll use SeleniumBase (over plain Selenium) for its built-in antidetect capabilities:

  • Stealth modes (UC/CDP) for high-friction sites.
  • UC mode (based on undetected-chromedriver) to minimize detection.
  • User-agent handling and Chromium flags; optional selenium-stealth integration for deeper fingerprint masking.
  • Antidetect helpers (disconnect/reconnect flows, incognito mode, CAPTCHA helpers).
  • Simplified API for common actions.

For targets like Indeed, these features strengthen stealth and enable more human-like interaction patterns. Pairing SeleniumBase with disciplined retry/backoff patterns helps bypass CAPTCHAs and provides a durable baseline for collection.

We’ll cover additional techniques for scaling Indeed job scraping later in this guide. For a deeper dive into bypass strategies, see our pro tips on navigating anti-bot systems.

Building Indeed scraper – step-by-step implementation

Let’s build a scraper that works reliably on Indeed’s listings – step by step.

Step 1 – installation and setup

Make sure you have Python 3 or a later version installed. If not, download it from the official Python website. Use a virtual environment to isolate dependencies:

# Create project directory
mkdir indeed_scraper
cd indeed_scraper
# Set up virtual environment
py -m venv venv # Windows
python3 -m venv venv # macOS/Linux
# Activate environment
venv\\Scripts\\activate.bat # Windows (CMD)
source venv/bin/activate # macOS/Linux
# Install required packages
pip install seleniumbase inscriptis

The inscriptis library converts HTML snippets (job description snippets) into clean text.

Create an indeed_scraper.py file in your editor, and you're ready to build.

Step 2 – import dependencies

Import the libraries needed for browser control, text parsing, and data handling:

import json
import re
from urllib.parse import quote_plus
from typing import List, Dict, Optional
import random
from seleniumbase import SB
from inscriptis import get_text

These are standard libraries for JSON handling and regex, urllib.parse for URL encoding, typing hints for code clarity, random for delay generation, SeleniumBase for browser automation, and inscriptis for clean text extraction from HTML.

Step 3 – class initialization

Set up the scraper class with configuration, methods, and regex patterns:

class IndeedJobScraperSB:
"""Indeed job scraper with antidetect capabilities."""
MOSAIC_PATTERN = (
r'window\.mosaic\.providerData\["mosaic-provider-jobcards"\]\s*=\s*({.*?});'
)
def __init__(self, region: str = "www") -> None:
self.region = region
self.base_url = f"https://{region}.indeed.com/jobs"

A regex constant is used to capture embedded job data and regional support for building correct base URLs like ca.indeed.com or uk.indeed.com.

Step 4 – set up the scrape

Orchestrate the main scraping loop to construct search URLs, iterate through pages, and collect results:

def scrape_jobs(
self, role: str, location: str, radius: int = 25, max_pages: int = 3
) -> List[Dict]:
all_jobs = []
print(f"Scraping {max_pages} pages for '{role}' in '{location}'...")
with SB(uc=True, headless=False, test=True) as sb:
for page_num in range(max_pages):
url = f"{self.base_url}?q={quote_plus(role)}&l={quote_plus(location)}&radius={radius}&start={page_num * 10}"
print(f"Page {page_num + 1}/{max_pages}...", end=" ")
page_jobs = self._scrape_single_page(sb, url)
if page_jobs:
print(f"found {len(page_jobs)} jobs")
all_jobs.extend(page_jobs)
else:
print("no jobs")
if page_num < max_pages - 1:
sb.sleep(random.uniform(5, 10))
# Deduplicate jobs that appear across multiple pages
seen_keys = {job["job_key"] for job in all_jobs if job.get("job_key")}
return [
job for job in all_jobs if job.get("job_key") and job["job_key"] in seen_keys
][: len(seen_keys)]

Use uc=True for undetected Chrome mode, headless=False for debugging visibility, and quote_plus for URL encoding. The start parameter handles pagination (0, 10, 20...), while built-in deduplication filters overlapping results.

Step 5 – anti-detection and retry logic

Build resilience with retry logic, exponential backoff, and human-like delays:

def _scrape_single_page(self, sb, url: str, max_retries: int = 3) -> List[Dict]:
for attempt in range(max_retries):
try:
sb.uc_open_with_reconnect(url)
sb.sleep(random.uniform(3, 7) + attempt * 2)
sb.wait_for_element_visible("[data-jk]", timeout=15)
mosaic_data = self._extract_mosaic_data(sb.get_page_source())
return self._process_job_listings(mosaic_data) if mosaic_data else []
except Exception:
if attempt == max_retries - 1:
return []
sb.sleep(random.uniform(2**attempt, 2 ** (attempt + 1)))
return []

Combine exponential backoff with randomized delays, waiting for [data-jk] elements to load, and handling failures gracefully. Return empty results instead of crashing when all retries are exhausted.

Step 6 – extract job data

Target Indeed's embedded JSON data and extract the job listings:

def _extract_mosaic_data(self, page_html: str) -> Optional[Dict]:
match = re.search(self.MOSAIC_PATTERN, page_html, re.DOTALL)
if match:
try:
data = json.loads(match.group(1))
return data.get("metaData", {}).get("mosaicProviderJobCardsModel")
except json.JSONDecodeError:
return None
return None

Use regex to find the JavaScript variable containing job data, then safely navigate Indeed's nested structure with chained .get() methods. Handle JSON parsing errors gracefully without crashing.

Step 7 – processing and normalization

Transform Indeed's raw data into clean, structured job objects:

def _process_job_listings(self, mosaic_data: Dict) -> List[Dict]:
jobs = []
for job in mosaic_data.get("results", []):
if job.get("jobkey") and job.get("title") and job.get("company"):
jobs.append(
{
"job_key": job["jobkey"],
"title": job["title"],
"company": job["company"],
"location": job.get("formattedLocation", ""),
"description_snippet": (
get_text(job.get("snippet", "")).strip()
if job.get("snippet")
else ""
),
"job_url": (
f"https://{self.region}.indeed.com{job['viewJobLink']}"
if job.get("viewJobLink")
else ""
),
"relative_time": job.get("formattedRelativeTime"),
}
)
return jobs

Filter for essential fields, clean HTML snippets with inscriptis, and build complete URLs.

Step 8 – save results

After scraping, results are saved as JSON.

def save_jobs(self, jobs: List[Dict], filename: str) -> None:
with open(filename, "w", encoding="utf-8") as f:
json.dump(jobs, f, indent=2, ensure_ascii=False)
print(f"Saved {len(jobs)} jobs to {filename}")

Step 9 – usage example

This implementation accepts flexible parameters, including region codes (e.g., in, ca, au), job keywords (e.g., content manager, AI engineer), location targeting (e.g., city, state, remote positions), radius, and page limits. Since each page yields 15 job listings, the max_pages parameter controls the scope of your extraction.

def main():
scraper = IndeedJobScraperSB(region="ca")
jobs = scraper.scrape_jobs("data analyst", "toronto, on", radius=100, max_pages=5)
if jobs:
print(f"\nTotal: {len(jobs)} unique jobs")
scraper.save_jobs(jobs, "jobs.json")
else:
print("\nNo jobs found")
if __name__ == "__main__":
main()

Complete code

Here's the complete implementation combining all the steps:

import json
import re
from urllib.parse import quote_plus
from typing import List, Dict, Optional
import random
from seleniumbase import SB
from inscriptis import get_text
class IndeedJobScraperSB:
"""Indeed job scraper with anti-detection capabilities."""
# Regex to extract Indeed's embedded job data from JavaScript
MOSAIC_PATTERN = (
r'window\.mosaic\.providerData\["mosaic-provider-jobcards"\]\s*=\s*({.*?});'
)
def __init__(self, region: str = "www") -> None:
self.region = region
self.base_url = f"https://{region}.indeed.com/jobs"
def scrape_jobs(
self, role: str, location: str, radius: int = 25, max_pages: int = 3
) -> List[Dict]:
all_jobs = []
print(f"Scraping {max_pages} pages for '{role}' in '{location}'...")
with SB(uc=True, headless=False, test=True) as sb:
for page_num in range(max_pages):
url = f"{self.base_url}?q={quote_plus(role)}&l={quote_plus(location)}&radius={radius}&start={page_num * 10}"
print(f"Page {page_num + 1}/{max_pages}...", end=" ")
page_jobs = self._scrape_single_page(sb, url)
if page_jobs:
print(f"found {len(page_jobs)} jobs")
all_jobs.extend(page_jobs)
else:
print("no jobs")
if page_num < max_pages - 1:
sb.sleep(random.uniform(5, 10))
# Deduplicate jobs that appear across multiple pages
seen_keys = {job["job_key"] for job in all_jobs if job.get("job_key")}
return [
job
for job in all_jobs
if job.get("job_key") and job["job_key"] in seen_keys
][: len(seen_keys)]
def _scrape_single_page(self, sb, url: str, max_retries: int = 3) -> List[Dict]:
for attempt in range(max_retries):
try:
sb.uc_open_with_reconnect(url)
sb.sleep(random.uniform(3, 7) + attempt * 2)
sb.wait_for_element_visible("[data-jk]", timeout=15)
mosaic_data = self._extract_mosaic_data(sb.get_page_source())
return self._process_job_listings(mosaic_data) if mosaic_data else []
except Exception:
if attempt == max_retries - 1:
return []
sb.sleep(random.uniform(2**attempt, 2 ** (attempt + 1)))
return []
def _extract_mosaic_data(self, page_html: str) -> Optional[Dict]:
match = re.search(self.MOSAIC_PATTERN, page_html, re.DOTALL)
if match:
try:
data = json.loads(match.group(1))
return data.get("metaData", {}).get("mosaicProviderJobCardsModel")
except json.JSONDecodeError:
return None
return None
def _process_job_listings(self, mosaic_data: Dict) -> List[Dict]:
jobs = []
for job in mosaic_data.get("results", []):
if job.get("jobkey") and job.get("title") and job.get("company"):
jobs.append(
{
"job_key": job["jobkey"],
"title": job["title"],
"company": job["company"],
"location": job.get("formattedLocation", ""),
"description_snippet": (
get_text(job.get("snippet", "")).strip()
if job.get("snippet")
else ""
),
"job_url": (
f"https://{self.region}.indeed.com{job['viewJobLink']}"
if job.get("viewJobLink")
else ""
),
"relative_time": job.get("formattedRelativeTime"),
}
)
return jobs
def save_jobs(self, jobs: List[Dict], filename: str) -> None:
with open(filename, "w", encoding="utf-8") as f:
json.dump(jobs, f, indent=2, ensure_ascii=False)
print(f"Saved {len(jobs)} jobs to {filename}")
def main():
scraper = IndeedJobScraperSB(region="www")
jobs = scraper.scrape_jobs("data analyst", "remote", radius=100, max_pages=15)
if jobs:
print(f"\nTotal: {len(jobs)} unique jobs")
scraper.save_jobs(jobs, "jobs.json")
else:
print("\nNo jobs found")
if __name__ == "__main__":
main()

Run the code using python indeed_scraper.py, and you'll see the browser open and start scraping jobs from multiple pages. It saves the JSON data to your project directory, looking like this:

[
{
"job_key": "deadbeefcafebabe",
"title": "Data Analyst",
"company": "Acme Health Systems",
"location": "Northbridge, ON A1B 2C3",
"description_snippet": "• Strong skills in SQL, Excel, and other data manipulation tools.\n• Proficiency in data visualization tools such as Power BI or Tableau.",
"job_url": "https://ca.indeed.com/viewjob?jk=deadbeefcafebabe",
"relative_time": "9 days ago"
},
{
"job_key": "baddc0ffee0dd00d",
"title": "Data Analyst, Metrics",
"company": "Nimbus Mobility",
"location": "Remote (Canada)",
"description_snippet": "• Demonstrated data analysis and problem-solving skills.\n• Continuously improve data processes and analytical methodologies.",
"job_url": "https://ca.indeed.com/viewjob?jk=baddc0ffee0dd00d",
"relative_time": "30+ days ago"
}
]

You've built a functional Indeed scraper that navigates result pages and collects structured data. Next, we'll explore advanced techniques to make the scraper more resilient and harder to detect.

Scaling up scraping operations

The basic scraper works, but scaling introduces anti-bot defenses. If you’re only fetching a handful of pages, your script is fine. However, when you need to collect hundreds of thousands of job listings across multiple locations on a regular basis, scaling up becomes essential. Here’s how to prepare for large-scale scraping.

Use proxies

Proxies are the foundation of large-scale scraping. By routing traffic through rotating residential proxies, requests appear to originate from real devices worldwide. This distributes requests across diverse user footprints, reducing automation signals.

Key advantages:

  • Rotation – IPs change automatically per request or session, minimizing detection patterns.
  • Geographic flexibility – match IPs to the target’s Indeed domains.
  • Anonymity and stability – mask your origin while maintaining high success rates.

Not all proxies deliver. Some are slow, easily blocked, or insecure. Before deploying at scale, test proxies to confirm performance and reliability.

For production workloads, proxy rotation is critical. Premium providers like Decodo offer 115M+ residential IPs with geo-targeting and automatic rotation. Setup takes only a few steps via Decodo’s residential proxies quick-start guide or this 2-minute video. Once configured, route your scraper through proxies aligned to the Indeed domains you’re targeting.

Use a Web Scraping API for tough cases

Even with proxies and careful throttling, tough targets like Indeed can trigger CAPTCHAs, rate limits, or fingerprinting. When this happens, it’s often more efficient to use a Web Scraping API.

Decodo’s Web Scraping API abstracts away the hardest parts of scaling:

  • Automatic proxy rotation
  • Built-in CAPTCHA solving
  • JavaScript rendering
  • Retries on failures

With a single API call, you can reliably fetch an Indeed page. The API shifts the heavy lifting – including headless browsers, IP pools, and concurrency – into the cloud, allowing you to focus on data rather than infrastructure. It also uses 100% success-based billing – you only pay for successful scrapes.

For developers, the API includes code examples, monitoring dashboards, and structured outputs. Setup requires only a few steps through the Web Scraping API quick-start guide.

Conclusion

This guide walked through building an Indeed scraper step by step and showed how to scale from small projects to production workloads. Reliable large-scale scraping combines two core strategies: starting with residential proxies to distribute requests, and transitioning to a Web Scraping API when challenges become too complex. Together, these tools provide the reliability, coverage, and scale required for enterprise-grade data collection.

Further reading

To continue learning and stay current, explore these resources:

  • Tutorials and blogs. Explore Decodo’s articles for guides on scraping techniques, scaling data collection, AI-driven workflows, and strategies for bypassing anti-bot systems.
  • AI for scraping. Discover how Claude and ChatGPT enhance productivity throughout your scraping workflow.
  • MCP integration. Explore the Decodo MCP Web Scraper to see how Model Context Protocol (MCP) connects with tools like Cursor, VS Code, and Claude Desktop to extend scraping workflows.
  • Comparison of providers. If you’re evaluating vendors, review Decodo’s web scraping services comparison to make an informed choice.
  • Best practices. Follow web scraping best practices to ensure efficiency, compliance, and reliability.

Get unlimited job data

Scrape millions of job listings with Decodo's Scraping API – built for speed and efficiency.

About the author

Zilvinas Tamulis

Technical Copywriter

A technical writer with over 4 years of experience, Žilvinas blends his studies in Multimedia & Computer Design with practical expertise in creating user manuals, guides, and technical documentation. His work includes developing web projects used by hundreds daily, drawing from hands-on experience with JavaScript, PHP, and Python.


Connect with Žilvinas via LinkedIn

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

What is Indeed scraping?

Indeed scraping is the process of extracting job listings and related data from the Indeed website using automated tools. Developers often use techniques like web scraping with Python to collect job postings for analysis, aggregation, or the development of job search tools. It’s a practical way to access real-time job market data without manual searching.

How to scrape Indeed without getting blocked?

To scrape Indeed without getting blocked, use rotating proxies to distribute your requests and mimic real user behavior by adding delays and randomizing user agents. By avoiding overwhelming the server with rapid requests, your scraper can gather Indeed job postings reliably while staying undetected.

Can I scrape multiple locations at once?

Yes, you can scrape multiple locations at once by iterating over location parameters in your scraping script. For example, you can loop through a list of cities or postal codes and append them to the search URL, allowing you to web scrape Indeed job postings across different regions in a single run.

How often should I scrape job listings?

The frequency of scraping Indeed job listings depends on your use case, but scraping once every few hours daily is generally enough to keep data fresh without raising flags. Over-scraping can lead to blocks, so balance timeliness with caution to maintain uninterrupted access.

How AI Secretly Gathers Data and What They're Not Telling You

Artificial Intelligence powers everything from chatbots to complex data analysis tools. But behind the sleek interfaces and impressive capabilities lies a hidden process – the petabytes of data collected. We sat down with our CEO, Vytautas Savickas, to discuss the AI revolution and how data is being collected to fuel various tools.

Benediktas Kazlauskas

Mar 12, 2025

6 min read

🐍 Python Web Scraping: In-Depth Guide 2025

Welcome to 2025, the year of the snake – and what better way to celebrate than by mastering Python, the ultimate "snake" in the tech world! If you’re new to web scraping, don’t worry – this guide starts from the basics, guiding you step-by-step on collecting data from websites. Whether you’re curious about automating simple tasks or diving into more significant projects, Python makes it easy and fun to start. Let’s slither into the world of web scraping and see how powerful this tool can be!

Zilvinas Tamulis

Feb 28, 2025

15 min read

How to Scrape Google Search Data

Business success is driven by data, and few data sources are as valuable as Google’s Search Engine Results Page (SERP). Collecting this data can be complex, but various tools and automation techniques make it easier. This guide explores practical ways to scrape Google search results, highlights the benefits of such efforts, and addresses common challenges.

Dominykas Niaura

Dec 30, 2024

7 min read

© 2018-2025 decodo.com. All Rights Reserved