NEW

How To Scrape JSON Data in Python: Complete Tutorial

JSON is the format that most web APIs and modern websites use to send their data. This tutorial shows how to scrape JSON data in Python – fetching it, parsing it, modifying it, and exporting clean files. You'll also learn about the tools for messy or oversized responses, and how to get data when sites block you with fingerprinting.

Justinas Tamasevicius

Last updated: Jul 03, 2026

19 min read

A code file icon centered inside a rounded square

TL;DR

The built-in json module handles most JSON parsing and exporting. The hard part is getting clean access, since sites block based on your fingerprint and behavior, not just your IP.

Parse JSON with Python's built-in json module – json.loads() reads a string, json.load() reads a file.
Modify, validate, and export records with json.dumps() and json.dump(), and keep ensure_ascii=False for non-English text.
Fetch JSON directly from APIs with Requests and response.json(), or read embedded JSON-LD (a page's built-in structured data) from its HTML.
Use jmespath, chompjs, ijson, and orjson on nested, malformed, or very large responses.
When sites block you, match a real browser fingerprint with curl_cffi (or a full browser like Playwright) behind rotating residential proxies – or hand the unblocking to a managed web scraping API like Decodo when CAPTCHAs and behavioral checks pile up.
When a page has no clean JSON, an LLM with a schema can extract the data, and increasingly, scraped JSON feeds AI agents through an MCP server.

What is JSON, and why does it matter for web scraping?

JSON (JavaScript Object Notation) is a text format for storing and exchanging structured data. It's built from 2 structures: objects are written with curly braces and hold key-value pairs, while arrays are written with square brackets and hold ordered lists. Values can be strings, numbers, booleans, null, or more objects and arrays nested inside.

JSON became the standard on the web because it's lightweight, human-readable, and language-independent. Almost every REST API returns JSON, and most data-heavy sites now load their content from background JSON endpoints instead of placing it directly into the HTML. That shift matters for scraping, because the cleanest target is often the raw JSON that a site already sends to its own front end, not the rendered page.

JSON vs. Python dictionaries

JSON and Python dictionaries look almost identical, but they're different things. JSON is a string of plain text that you receive over the network or read from a file. A Python dictionary is a live object in memory that you can index and edit. What connects them is parsing, which turns a JSON string into Python objects you can work with, and then serializes them back when you're done.

JSON-to-Python type mapping

When you parse JSON, each type maps to a native Python equivalent.

JSON type

Python type

object

dict

array

list

string

str

number (integer)

int

number (decimal)

float

true or false

True or False

null

None

The mapping isn't perfectly symmetric, so a few details change when you round-trip data. JSON has a single number type that Python splits into int and float. JSON object keys are always strings, so a dictionary with integer keys comes back with string keys after a round-trip through JSON. Python tuples are also serialized as arrays, so they return as lists, not tuples.

If you're new to web scraping, start with what web scraping is, and then keep what JSON is open as a reference.

Before you start

The code needs Python 3.9 or newer. Create a virtual environment first, then install libraries as each section introduces them:

python -m venv .venv
source .venv/bin/activate      
# on Windows: 
.venv\Scripts\activate

Every code block lists its own pip install line, so you add only what you use.

Reading and parsing JSON with the built-in json module

The json module comes with Python, so there's nothing to install. Once you import it, you have everything you need to read and write JSON.

The json.loads() function parses a JSON string into Python objects (the trailing s stands for "string"). Here's a product listing, the kind of payload that an eCommerce API returns:

import json

raw = '''{"product": {"name": "Wireless Mouse", "price": 29.99,
  "specs": {"color": "black", "dpi": [800, 1600, 3200]}},
  "in_stock": true}'''

data = json.loads(raw)
print(data["product"]["name"])           # Wireless Mouse
print(data["product"]["specs"]["dpi"][-1])  # 3200
print(type(data["in_stock"]))             # <class 'bool'>

Notice how true became a Python bool automatically. To access nested data, you chain dictionary keys and list indices, like data["product"]["specs"]["dpi"][-1].

When the JSON is in a file, use json.load() (no s) with a context manager so the file is closed automatically:

with open("product.json", "r", encoding="utf-8") as file:   # a JSON file you've saved locally
    data = json.load(file)

The difference between the two is the input type: json.loads() takes a string that you already have in memory, while json.load() takes a file-like object and reads it for you. For a refresher on the underlying idea, see what parsing is. If you're still setting up, running Python code in the terminal covers the basics.

Modifying JSON data: Adding, updating, and deleting elements

Once JSON is parsed into a dictionary, you edit it with plain Python. This is where you clean and enrich your records before saving them.

product = {"name": "Wireless Mouse", "price": 29.99, "tags": ["tech"]}

product["currency"] = "USD"                       # add a new key
product["price"] = 24.99                           # update a value
product.update({"price": 19.99, "on_sale": True})  # bulk update several keys

removed = product.pop("tags", [])                  # delete safely, with a default
del product["currency"]                            # delete a key you know exists

print(product)   # {'name': 'Wireless Mouse', 'price': 19.99, 'on_sale': True}
print(removed)   # ['tech']

The del keyword raises KeyError if the key is missing, while pop() takes a default and won't raise when the key is missing. Use pop() whenever a field might be absent, which is the norm with scraped data.

Merging records from several endpoints is common, too. The {**a, **b} syntax combines 2 dictionaries, with the second one overriding any shared keys. On Python 3.9 and newer, base | extra is the cleaner equivalent:

base = {"id": 1, "name": "Mouse"}
extra = {"price": 19.99, "rating": 4.6}
merged = {**base, **extra}
print(merged)   # {'id': 1, 'name': 'Mouse', 'price': 19.99, 'rating': 4.6}

For deeply nested structures, reach the parent before assigning: record["product"]["specs"]["weight_g"] = 90. When your cleanup grows into full pipelines, what data cleaning is goes deeper, and the Python pandas tutorial shows a table-first alternative for bulk edits.

Converting Python objects back to JSON

Once you have the JSON, no matter where it came from, the next step is serialization, which means turning Python objects back into a JSON string or file. After you've parsed, cleaned, and enriched your data, you serialize it for storage or for the next step in your pipeline.

The json.dumps() function returns a JSON string, and 3 options cover most cases:

import json

product = {"name": "Café Latte Mug", "price": 12.5, "colors": ["red", "blue"]}

print(json.dumps(product))                       # compact, ASCII-escaped
print(json.dumps(product, indent=2, sort_keys=True))  # pretty, ordered keys
print(json.dumps(product, ensure_ascii=False))   # keeps é instead of \u00e9

The first and last lines differ only in that accented field:

{"name": "Caf\u00e9 Latte Mug", "price": 12.5, "colors": ["red", "blue"]}   # default (ASCII-escaped)
{"name": "Café Latte Mug", "price": 12.5, "colors": ["red", "blue"]}        # ensure_ascii=False

The output escapes non-ASCII characters by default, so Café becomes Caf\u00e9. Setting ensure_ascii=False keeps human-readable accents and any non-Latin script intact. Use indent while developing to read the output, and sort_keys for stable, diff-friendly files.

To write straight to disk, json.dump() takes a file object:

with open("products.json", "w", encoding="utf-8") as file:
    json.dump(product, file, indent=2, ensure_ascii=False)

For large or streaming datasets, write JSONL instead – one JSON object per line. It's append-friendly and streamable line by line, which is why data and LLM pipelines favor it:

import json

records = [{"name": "Wireless Mouse", "price": 24.99}]   # your list of parsed dicts

with open("products.jsonl", "w", encoding="utf-8") as file:
    for record in records:
        file.write(json.dumps(record, ensure_ascii=False) + "\n")

Serializing custom objects

Scrapers often hold data in custom classes, which aren't part of JSON's standard types. If you call json.dumps() on one, it raises TypeError: Object of type ScrapedProduct is not JSON serializable. This means you should give it a JSONEncoder subclass that tells the module how to handle your types:

import json
from datetime import datetime, timezone

class ScrapedProduct:
    def __init__(self, name, price, scraped_at):
        self.name = name
        self.price = price
        self.scraped_at = scraped_at

class ProductEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, ScrapedProduct):
            return {"__product__": True, "name": obj.name,
                    "price": obj.price, "scraped_at": obj.scraped_at.isoformat()}
        if isinstance(obj, datetime):
            return obj.isoformat()
        return super().default(obj)

item = ScrapedProduct("Mouse", 29.99, datetime(2026, 6, 9, tzinfo=timezone.utc))
encoded = json.dumps(item, cls=ProductEncoder, indent=2)
print(encoded)
# {
#   "__product__": true,
#   "name": "Mouse",
#   "price": 29.99,
#   "scraped_at": "2026-06-09T00:00:00+00:00"
# }

To rebuild the object when reading the data back, pass an object_hook to the decoder. It runs on every JSON object and lets you reconstruct your class. Run the encoder block above first, so encoded and ScrapedProduct exist:

# continues the encoder block above (needs json, datetime, ScrapedProduct, encoded)
def product_hook(d):
    if d.get("__product__"):
        return ScrapedProduct(d["name"], d["price"],
                              datetime.fromisoformat(d["scraped_at"]))
    return d

restored = json.loads(encoded, object_hook=product_hook)
print(type(restored).__name__, restored.name)   # ScrapedProduct Mouse

If your objects hold mostly standard types, a Pydantic model removes the need for the encoder altogether. Its model_dump_json() serializes datetime and other common types for you. Keep a JSONEncoder subclass for the cases where you need full control over an unusual type.

JSON is rarely your only output format. Flattening nested records into rows for CSV, Excel, or a database is its own step – see how to save your scraped data and what CSV is.

Fetching JSON data from web APIs and web pages

To scrape JSON data in Python, almost everything you'll fetch comes from 2 sources: APIs that return JSON directly, and JSON embedded inside HTML pages.

Calling an API with Requests

Requests is the standard way to make HTTP(S) calls in Python. Its response.json() method parses the body for you, so you skip the manual json.loads() step. This live price feed is the kind of endpoint that any price-monitoring or market-intelligence scraper relies on:

# pip install requests
import requests

url = "https://api.coinbase.com/v2/prices/BTC-USD/spot"
response = requests.get(url, timeout=15)
response.raise_for_status()        # stop early on 4xx/5xx responses

data = response.json()
price = float(data["data"]["amount"])   # APIs often send numbers as strings
print(f"BTC/USD: {price}")

The raise_for_status() call turns a failed request into a clear exception instead of letting bad data continue through your code. The value data["data"]["amount"] arrives as a string – a common API quirk – so you cast it to float_ before doing any math on it. For a deeper look at the library, read Mastering Python Requests. If you're choosing between async clients, [httpx vs. Requests vs. aiohttp_](/blog/httpx-vs-requests-vs-aiohttp) compares the options.

Find the hidden API first

Before you parse any HTML, check whether the site has its own JSON endpoint. Many modern pages render in the browser by fetching JSON in the background.

To find it:

Open your browser's DevTools and switch to the Network tab.
Filter by Fetch/XHR, so you only see data calls.
Reload the page and watch the requests appear.
Click any request that returns JSON to inspect its response.
Copy its URL, headers, and query parameters, then replay the call with Requests for clean, structured data without any HTML parsing.

The same page, 2 views: open the Network tab, filter to Fetch/XHR, select the quotes?page=1 request, and read its structured JSON in the Preview pane – then copy that URL to replay it with Requests.

The guide to inspecting elements explains the DevTools workflow.

Capturing JSON responses with Playwright

When a site loads its JSON through calls that you can't easily replay – signed URLs, short-lived tokens, or heavy JavaScript – you can let a real browser load the page and capture the response as it arrives. Playwright drives a headless browser, and the page.on("response") listener gives you every response object, including the parsed JSON:

# pip install playwright
# then download the browser once: playwright install chromium
from playwright.sync_api import sync_playwright

captured = []

def handle_response(response):
    if "/api/quotes" in response.url:
        captured.append(response.json())

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.on("response", handle_response)
    page.goto("https://quotes.toscrape.com/scroll", wait_until="networkidle")
    page.mouse.wheel(0, 5000)          # scroll to trigger more AJAX calls
    page.wait_for_timeout(1500)
    browser.close()

print(len(captured), "JSON responses captured")
print(captured[0]["quotes"][0]["author"]["name"])   # Albert Einstein

The quotes.toscrape.com/scroll page fetches the quotes API as you scroll, so the listener collects each batch of JSON without parsing the HTML. A plain Playwright browser still emits detectable automation signals. One well-known example is the Runtime.enable a call that it makes over the DevTools protocol, so a protected target can detect it. For this reason, you'd use a patched runtime like Patchright (an ordinary stealth plugin won't close the Runtime.enable leak), or you'd send the target to the Site Unblocker. To learn more about controlling the browser, see Playwright web scraping.

Handling authentication and pagination

Many APIs authenticate with a key, header, or bearer token. The headers argument carries them, and you read secrets from environment variables instead of hardcoding them:

# pip install requests
import os
import requests

url = "https://quotes.toscrape.com/api/quotes"   # swap in your authenticated endpoint
headers = {"Accept": "application/json"}
token = os.getenv("API_TOKEN")     # never commit credentials to your code
if token:
    headers["Authorization"] = f"Bearer {token}"

response = requests.get(url, headers=headers, timeout=15)

APIs typically return data one page at a time. They paginate, and you follow the pages until there are none left. The examples below use the same open practice API (quotes.toscrape.com), which you can use without a key or rate limits. It exposes a has_next flag, which makes the loop simple:

import requests

session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0"})

all_quotes, page = [], 1
while True:
    response = session.get(
        "https://quotes.toscrape.com/api/quotes",
        params={"page": page},
        timeout=15,
    )
    payload = response.json()
    all_quotes.extend(payload["quotes"])
    if not payload["has_next"]:
        break
    page += 1

print(len(all_quotes))   # 100

A Session reuses the connection and shared headers across every page, which is faster and lighter than standalone calls. Your own API may use a different end signal – a total page count, an offset/limit pair, a cursor token, or a Link header. The same loop structure handles them all, because in every case, you fetch a page, accumulate the results, and stop on the end signal. For the API concepts behind this, see the characteristics of a REST API.

Making your scraper production-ready

The loops above call raise_for_status() and crash on the first failure. That's acceptable for a demo, but a serious problem for a real run, because a deep page eventually returns a 429, a 503, or a block page, and that single failure stops the whole job. A loop becomes production-ready with 3 small changes: retry transient failures, pace your requests, and save as you go.

First, wrap the request so transient errors retry with backoff instead of crashing, and respect Retry-After when the server sends it:

# pip install requests
import json
import random
import time
import requests

def get_json(session, url, **kwargs):
    for attempt in range(5):
        try:
            response = session.get(url, timeout=15, **kwargs)
            if response.status_code in (429, 500, 502, 503, 504):
                wait = response.headers.get("Retry-After")
                time.sleep(int(wait) if wait and wait.isdigit()
                           else 2 ** attempt + random.random())   # backoff + jitter
                continue
            response.raise_for_status()      # real 4xx (404, 401) fail fast
            return response.json()
        except (requests.Timeout, requests.ConnectionError):
            time.sleep(2 ** attempt + random.random())            # transient – retry
    raise RuntimeError(f"gave up on {url}")

Then run the loop so that one bad page is skipped instead of stopping the run. Write each page to disk as you fetch it, so that a crash on page 500 doesn't lose pages 1 to 499:

session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0"})

failed = []
with open("quotes.jsonl", "a", encoding="utf-8") as out:
    for page in range(1, 11):
        try:
            for quote in get_json(session, "https://quotes.toscrape.com/api/quotes",
                                  params={"page": page})["quotes"]:
                out.write(json.dumps(quote, ensure_ascii=False) + "\n")
            out.flush()                       # progress lands on disk page by page
        except Exception:
            failed.append(page)               # log it, keep going
        time.sleep(1 + random.random())       # pace requests; tune to the target

print(f"{10 - len(failed)} pages saved, failed: {failed}")   # 10 pages saved, failed: []

That time.sleep is pacing – a small delay so a fast endpoint doesn't see a burst of requests. On restart, read the page numbers already written to quotes.jsonl and skip them, which lets you resume at no extra cost. You should also never run an unbounded while True against a target that you don't control. Cap it with for page in range(1, MAX_PAGES + 1) so a buggy end signal can't loop forever.

If you'd rather not write all of this yourself, the tenacity library wraps the retry logic in a decorator. Either way, every line here (retries, pacing, resume logic, and block handling), is code that you now have to maintain. A managed Web Scraping API does all of it in one call, which is worth considering once you depend on a scraper in production.

Scaling up: Fetch pages concurrently with asyncio

The loop above fetches one page at a time, so most of the run is spent waiting on the network. Once you're fetching hundreds of pages, the standard choice is httpx with asyncio: send many requests at once, capped by a semaphore so you don't overload the server.

# pip install httpx
import asyncio
import httpx

API_URL = "https://quotes.toscrape.com/api/quotes"


async def fetch_page(client, page, semaphore):
    async with semaphore:                        # cap how many run at once
        response = await client.get(API_URL, params={"page": page}, timeout=15)
        response.raise_for_status()
        return response.json()["quotes"]


async def scrape_pages(total_pages):
    semaphore = asyncio.Semaphore(5)             # 5 in flight, not 500
    async with httpx.AsyncClient(headers={"User-Agent": "Mozilla/5.0"}) as client:
        tasks = [fetch_page(client, p, semaphore) for p in range(1, total_pages + 1)]
        pages = await asyncio.gather(*tasks)
    return [quote for page in pages for quote in page]


quotes = asyncio.run(scrape_pages(10))
print(len(quotes))   # 100

There are 10 pages here. You'd read the total from the first response, then request the rest. The asyncio.Semaphore matters here because it stops you from sending 500 requests to a target at once, which is both rude and a fast way to get blocked. The speedup grows with the number of pages and the per-request latency. For resilience at scale, give fetch_page the same retry logic shown above (adapted to async, with await asyncio.sleep(…) instead of time.sleep and httpx's exception types). Also, pass return_exceptions=True to asyncio.gather so that one failed page doesn't stop the whole batch.

JSON endpoint blocked? Typical

Decodo's residential proxies rotate through 115M+ IPs so your Python requests hit the API from a fresh address every time. No rate limits, no bans.

Get started

Extracting embedded JSON from HTML

When a site has no public API, the data is often still JSON, hidden inside the page. The most common case is JSON-LD: a structured-data format that search engines read, found in a <script type="application/ld+json"> tag. Beautiful Soup selects the tag and json.loads() parses its contents:

# pip install requests beautifulsoup4
import json
import requests
from bs4 import BeautifulSoup

html = requests.get(
    "https://www.scrapingcourse.com/ecommerce/",
    headers={"User-Agent": "Mozilla/5.0"},
    timeout=15,
).text

soup = BeautifulSoup(html, "html.parser")
tag = soup.find("script", type="application/ld+json")
ld = json.loads(tag.string)

website = next(node for node in ld["@graph"] if node.get("@type") == "WebSite")
print(website["name"])   # eCommerce Test Site to Learn Web Scraping

If you want to avoid searching for the tag yourself, the extruct library (pip install extruct) extracts JSON-LD – plus microdata, OpenGraph, and RDFa – from a page in one call with extruct.extract(html, syntaxes=["json-ld"]).

Some pages instead store data in a plain JavaScript variable, like window.__DATA__ = {…}. That content often isn't valid JSON, so you need chompjs rather than json.loads(). For more on tag selection, see Beautiful Soup web scraping.

Getting past blocks: Proxies and browser fingerprints

Scraping now depends on your identity as much as on your proxy, because the block usually doesn't start with your IP. Instead, it starts with your fingerprint and your behavior. Every HTTPS request carries a TLS fingerprint (labeled JA4, which mostly replaced JA3 after browsers started randomizing their TLS extension order), and a plain Python client's fingerprint doesn't match any real browser. Recent browsers also send a post-quantum TLS key share by default, so a client that claims to be Chrome but leaves it out clearly isn't a real browser. Services like Cloudflare and DataDome score that mismatch before they ever look at your IP. They also check signals like mouse movement, scroll speed, and JavaScript challenges that they regenerate on every page load.

A proxy is an intermediary server that forwards your request, so the site sees its IP, not yours. Rotating residential proxies spreads requests across many real IP addresses, which fixes the IP reputation problem. curl_cffi handles the other part of the problem, because it impersonates a real browser's TLS and HTTP/2 handshake (a current version impersonates a recent Chrome build, post-quantum key share included) while keeping the Requests API that you already know:

# pip install curl_cffi
from curl_cffi import requests   # drop-in replacement for the Requests API

response = requests.get(
    "https://quotes.toscrape.com/api/quotes?page=1",
    impersonate="chrome",        # match a real Chrome TLS fingerprint
    timeout=20,
)
print(response.json()["quotes"][0]["author"]["name"])   # Albert Einstein

To rotate the exit IP too, route it through residential proxies – add proxies={"http": PROXY, "https": PROXY} to the call, with PROXY = "http://USER:PASS@gate.decodo.com:7000". A real fingerprint plus rotating IPs clears most fingerprint- and IP-based blocks, but it doesn't clear all of them. The fingerprint checks, CAPTCHAs, and behavioral checks that guard commercial targets change constantly. To compare proxy types, read how residential proxies work, or browse the residential proxies plans.

When a site adds CAPTCHAs and behavioral checks on top of this, doing that ongoing work yourself is a constant maintenance cost. Decodo's Site Unblocker handles the fingerprints, CAPTCHAs, and rendering behind one endpoint, and the Web Scraping API goes a step further. You send it a URL, and it returns the page itself, as raw HTML or as parsed JSON for supported targets.

Scraping a real protected site end-to-end

The examples so far run on open endpoints, so you can reproduce them. A real target (an Amazon listing, a flight fare, a marketplace page) is protected by the anti-bot stack just described. A managed Web Scraping API is built for exactly this. You send a target and query, then it handles the proxies, fingerprint, CAPTCHAs, and JavaScript, and returns the data already parsed as JSON. The example below runs it against Amazon search:

# pip install requests
import json
import requests

# Token from your dashboard, under Web Scraping API.
AUTH = "Basic YOUR_BASE64_TOKEN"

response = requests.post(
    "https://scraper-api.decodo.com/v2/scrape",
    headers={"Content-Type": "application/json", "Authorization": AUTH},
    json={
        "target": "amazon_search",   # a dedicated, anti-bot-protected target
        "query": "wireless mouse",
        "parse": True,               # return structured JSON, not raw HTML
    },
    timeout=120,
)
response.raise_for_status()

# Access and parsing both happened server-side; the listing comes back as JSON.
# The adjacent results.results is the API's own nesting – read it off the raw response once.
content = response.json()["results"][0]["content"]
organic = content["results"]["results"]["organic"]

# Keep the fields you need and save as JSONL – append-friendly and streamable.
with open("products.jsonl", "w", encoding="utf-8") as file:
    for p in organic:
        record = {
            "title": p["title"],
            "asin": p["asin"],
            "price": p.get("price"),
            "currency": p.get("currency"),
            "rating": p.get("rating"),
            "reviews": p.get("reviews_count"),
        }
        file.write(json.dumps(record, ensure_ascii=False) + "\n")

print(f"Saved {len(organic)} products")

Each line in products.jsonl is one clean record:

{"title": "Logitech M185 Wireless Mouse", "asin": "B004YAVF8I", "price": 13.42, "currency": "USD", "rating": 4.5, "reviews": 52834}

From start to finish, this gives you access to a protected site, the parsing, and a clean dataset, all from one call, plus the techniques in this tutorial. That single call replaced the unblocking stack and parsing that you'd otherwise build yourself, and the Web Scraping API provides pre-built targets like this for Amazon and Google. For a general site with no dedicated target, use "target": "universal" (add "headless": "html" to render JavaScript). The API then returns the page's HTML in content, which you parse with the JSON-LD technique shown earlier, or the JMESPath queries covered below.

Error handling and common pitfalls

Scraped JSON is never as clean as the documentation suggests, so a few failures appear again and again.

JSONDecodeError on invalid responses

json.JSONDecodeError is raised when the text isn't valid JSON: a truncated body, an empty response, or, most often in scraping, an HTML block page returned instead of data. The exception carries useful detail:

import json
try:
    json.loads("{'bad': True,}")   # single quotes aren't valid JSON
except json.JSONDecodeError as error:
    print(error.msg, "at position", error.pos)
    # Expecting property name enclosed in double quotes at position 1

When a request gets blocked, the server often returns an HTML page with a CAPTCHA, and response.json() then raises this error. This is usually a signal that you should slow down or send your requests through proxies, and it isn't a bug in your parser. See what to do about parsing errors in Python for more patterns.

Missing keys and defensive access

Accessing a missing key with square brackets raises KeyError. Call .get() for a default instead, which keeps optional fields from crashing a long scrape:

quote = {"author": "Albert Einstein"}
print(quote.get("rating", "N/A"))   # N/A

This is the practical takeaway of EAFP (easier to ask forgiveness than permission): instead of checking every field first (LBYL, look before you leap), you try the access and catch failures, since you can't always predict a remote response.

record = {"price": 19.99}
total = 0

# LBYL
if "price" in record and isinstance(record["price"], (int, float)):
    total += record["price"]

# EAFP
try:
    total += record["price"]
except (KeyError, TypeError):
    pass

Validating structure before you trust it

For production scrapers, validate the structure of every record before processing. The jsonschema library checks types and required fields against a schema:

# pip install jsonschema
from jsonschema import validate, ValidationError

schema = {
    "type": "object",
    "required": ["name", "price"],
    "properties": {
        "name": {"type": "string"},
        "price": {"type": "number", "minimum": 0},
    },
}

try:
    validate({"name": "Mouse", "price": -5}, schema)
except ValidationError as error:
    print(error.message)   # -5 is less than the minimum of 0

Pydantic goes further than schema validation: it coerces real-world values into a typed model and raises a clear error when they don't fit:

# pip install pydantic
from pydantic import BaseModel, field_validator

class Product(BaseModel):
    name: str
    price: float
    in_stock: bool = True

    @field_validator("price")
    @classmethod
    def non_negative(cls, value):
        if value < 0:
            raise ValueError("price must be >= 0")
        return value

product = Product(name="Wireless Mouse", price="29.99")   # the string is coerced to a float
print(product.model_dump())
# {'name': 'Wireless Mouse', 'price': 29.99, 'in_stock': True}

Pick Pydantic when you want typed objects flowing through your pipeline, and jsonschema when you need a language-agnostic schema to share with non-Python services.

Encoding is the final detail that you need to get right. If a response arrives in a non-UTF-8 charset, set response.encoding before reading the text, and always write output with ensure_ascii=False to keep international characters readable. When invalid JSON comes from flaky endpoints, retrying failed Python requests is often the right answer.

Advanced JSON parsing: Querying, malformed JavaScript, and big files

The built-in json module covers most jobs, while a handful of libraries handle the rest. That means querying complex data, parsing malformed JavaScript, processing files too large for memory, decoding quickly, and aggregating folders of files.

Querying nested data with JMESPath

JMESPath is a query language for JSON, much like SQL is for tables. Instead of writing nested loops, you describe the data you want with a path expression and call jmespath.search():

# pip install jmespath
import jmespath

data = {"quotes": [
    {"text": "A", "author": {"name": "Einstein"}, "tags": ["change", "world"]},
    {"text": "B", "author": {"name": "Twain"}, "tags": ["humor"]},
]}

# pull every author name in one expression
print(jmespath.search("quotes[].author.name", data))
# ['Einstein', 'Twain']

# filter by a condition, then reshape the result
print(jmespath.search("quotes[?contains(tags, 'humor')].author.name", data))
# ['Twain']

print(jmespath.search("quotes[].{who: author.name, count: length(tags)}", data))
# [{'who': 'Einstein', 'count': 2}, {'who': 'Twain', 'count': 1}]

On deeply nested API responses, JMESPath replaces manual indexing that would be fragile and hard to read, because filtering, projection, and reshaping all fit in a single string.

Parsing messy JavaScript objects with chompjs

Embedded JavaScript objects use single quotes, trailing commas, or unquoted keys, which is valid JavaScript but not valid in strict JSON, so you handle them with chompjs rather than json.loads(). The library reads them and returns a clean Python dictionary:

# pip install chompjs
import chompjs

js = "{name: 'Wireless Mouse', price: 29.99, inStock: true, tags: ['tech', 'sale',],}"
print(chompjs.parse_js_object(js))
# {'name': 'Wireless Mouse', 'price': 29.99, 'inStock': True, 'tags': ['tech', 'sale']}

Use chompjs.parse_js_object() when you read data from an inline <script> tag during scraping, because its main strength is extracting the object from the surrounding code. For a relaxed object on its own, json5 and pyjson5 handle the same single quotes, trailing commas, and unquoted keys in one call. Both tools parse the object literal as written, so for values computed at runtime, like price: 10 * 2 you need a real browser engine such as Playwright. For the browser-side equivalent, see JSON.parse() in JavaScript, and for script-built pages, scraping JavaScript-rendered content.

Handling very large JSON files

Loading a multi-gigabyte file with json.load() reads the whole thing into memory at once, which can crash your process. The ijson library streams the file and yields items one at a time:

# pip install ijson
import ijson

# large_products.json is your own file shaped like {"products": [{"price": <number>}, ...]}
total = 0
with open("large_products.json", "rb") as file:
    for price in ijson.items(file, "products.item.price"):
        total += price
print(total)

When files still fit in memory but speed matters, orjson is a near drop-in replacement that parses and serializes far faster than the standard module:

# pip install orjson
import orjson

data = orjson.loads('{"name": "Café", "price": 29.99}')   # accepts str or bytes
print(orjson.dumps(data))   # b'{"name":"Caf\xc3\xa9","price":29.99}'

On large payloads, orjson.dumps() runs roughly 5–10x faster than json.dumps(). It also returns bytes rather than a string and keeps UTF-8 characters without escaping by default. For this reason, you stream with ijson when a file won't fit in memory, and you switch to orjson when throughput is the limit. When you're validating every record too, msgspec decodes JSON straight into typed structs and validates in one pass.

Querying a folder of scraped JSON

The tools above each work on one document. Once you've scraped many files, the next question is how to aggregate and query them. DuckDB answers it with plain SQL over a glob (a wildcard path like data/*.jsonl), with no schema setup and with larger-than-memory handling included by default:

# pip install duckdb
import duckdb

# data/*.jsonl = your own JSONL exports, one object per line with title/price/rating;
# create a data/ folder and put your files there before running this
cheap = duckdb.sql("""
    SELECT title, price, rating
    FROM read_json_auto('data/*.jsonl')
    WHERE price < 30
    ORDER BY rating DESC
    LIMIT 5
""").fetchall()
print(cheap)

# Save the cleaned result as Parquet – tiny files, fast to re-query.
duckdb.sql("COPY (SELECT * FROM read_json_auto('data/*.jsonl')) TO 'products.parquet' (FORMAT parquet)")

The read_json_auto call reads every matching file, infers the schema, and streams from disk when the data won't fit in memory. If you'd rather stay in Python data frames, Polars (pip install polars) covers the same step: pl.read_ndjson("products.jsonl") reads one of those saved files into a typed columnar frame (column-oriented, fast for bulk operations) to clean, dedupe, and cast. It pairs with orjson or msgspec for decoding.

Pick the tool based on what "too big" means in your case: ijson for one massive document at fixed memory, DuckDB or Polars for a folder of files or millions of rows to aggregate, and msgspec when you also need validation on a known structure.

When to let an LLM extract the JSON

Everything above assumes the data is already JSON. When the data isn't JSON – which happens when the values are buried in unstructured HTML that changes shape from page to page – there's a different route: you send the page to an LLM with a schema and let it return the JSON. The model reads the content the way a person would, so it survives markup changes that break CSS selectors.

This complements parsing, and it doesn't replace it. An LLM call is far slower and more expensive per page than reading JSON that you already have. It can also invent fields, which is why you save it for pages whose structure varies too much to parse.

The pattern pairs a Pydantic schema with an LLM that supports structured outputs, so the response is validated against your types before you use it. Here, Claude extracts products from a page that has no public API:

# pip install anthropic pydantic beautifulsoup4 requests
import requests
from bs4 import BeautifulSoup
from pydantic import BaseModel
import anthropic


class Product(BaseModel):
    name: str
    price: float


class ProductList(BaseModel):
    products: list[Product]


# 1. Fetch the page (behind a proxy or the Web Scraping API on protected sites)
html = requests.get(
    "https://www.scrapingcourse.com/ecommerce/",
    headers={"User-Agent": "Mozilla/5.0"},
    timeout=15,
).text

# 2. Strip the markup down to the text the model needs to read
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style"]):
    tag.decompose()
page_text = soup.select_one("ul.products").get_text("\n", strip=True)   # scope to the product block; the model handles the fields

# 3. Hand the text plus a schema to the model and get back validated objects
client = anthropic.Anthropic()   # reads ANTHROPIC_API_KEY from the environment
response = client.messages.parse(
    model="claude-opus-4-8",          # any current Claude model with structured outputs works (pick by budget)
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": f"Extract every product as name and price:\n\n{page_text}",
    }],
    output_format=ProductList,
)

products = response.parsed_output.products   # validated list of Product objects
print(f"Extracted {len(products)} products")
print(products[0].name, products[0].price)

The output_format=ProductList argument constrains the model to your schema at decode time, not just validating afterward, using Claude's structured outputs. The result, response.parsed_output, comes back as typed Pydantic objects, so a malformed field fails with a clear error instead of slipping into your dataset. A schema guarantees the shape, not the facts, which is why you still spot-check the values, as with any extracted data. The same pattern works with any LLM that supports structured outputs.

You still need a reliable way to fetch the page, either proxies or a Web Scraping API, before any model can read it, and you still validate the result. The LLM replaces the parser, but it doesn't replace the fetching layer. For most JSON scraping, the deterministic tools earlier in this tutorial are faster, cheaper, and exact. Use an LLM when the page structure varies too much for selectors.

The DIY path above means you own the prompt, the API key, the per-page cost, and the re-tuning each time a model version changes. Decodo's AI Parser is the no-code version: you describe the fields you want, and it returns structured JSON from the page, without managing your own prompt or API key.

Scraping JSON in the agent era

Much scraped JSON now isn't saved to a CSV file, because it feeds an AI agent or a RAG pipeline, and that changed the interface. The Model Context Protocol (MCP), an open standard created by Anthropic and now governed by the Linux Foundation, lets an AI agent call a scraping tool directly and get structured JSON back. Decodo's MCP server is one of these. It connects apps like Claude, Cursor, and Windsurf to the Web Scraping API, so an agent fetches and extracts on demand with no custom integration to build. With anti-bot systems now scoring behavior and regenerating their defenses per page load, running your own unblocking stack takes real effort, which turns the choice into a routing decision:

Situation

Approach

Data is already JSON on a stable target

Parse it yourself with the tools in this tutorial

Structure varies, or there's no clean source

LLM extraction with a schema

Target is heavily protected, or an agent needs it on demand

The Site Unblocker to keep your own parser, the Web Scraping API to get it parsed, or an MCP server to feed an agent

When the data feeds a model rather than a database, the format itself becomes part of the cost. You pay for every token, and JSON is verbose. TOON (Token-Oriented Object Notation) re-encodes the same data losslessly into a more compact form. Uniform record lists become a header plus CSV-style rows, which the project benchmarks at roughly 30–60% fewer input tokens than JSON:

# pip install git+https://github.com/toon-format/toon-python.git
from toon_format import encode

records = [
    {"title": "Logitech M185", "price": 13.42, "rating": 4.5},
    {"title": "Amazon Basics Mouse", "price": 8.99, "rating": 4.4},
]
print(encode({"products": records}))
# products[2]{title,price,rating}:
#   Logitech M185,13.42,4.5
#   Amazon Basics Mouse,8.99,4.4

Treat it as a conversion layer, not a storage format: keep your scraped data as JSON and encode to TOON only when you send it into a prompt. It helps most on the uniform record lists that scraping tends to produce. The format and its Python libraries are new and still pre-1.0, from late 2025, so the snippet installs the official implementation from source while it stabilizes. It's an early-adopter optimization. Pin a version you have tested, and use it deliberately rather than by default.

The fundamentals still hold. You still read JSON, handle errors, and validate output, whether the request came from your own script or an agent calling a tool.

Approaching a new target

When you start with a URL you haven't scraped before, work through it in this order:

Look for a JSON endpoint first. Open DevTools, filter the Network tab to Fetch/XHR, and reload. If the page loads its data from a background JSON call, replay that call with Requests. It's faster and cleaner than parsing HTML.
If the request gets blocked, fix the fingerprint before the IP. Match a real browser with curl_cffi (impersonate="chrome"), route through rotating residential proxies, and send the hardest, CAPTCHA-guarded targets to the Web Scraping API or Site Unblocker.
If the JSON loads through calls that you can't replay, use Playwright. Drive a real browser to capture responses from Signed URLs or short-lived tokens as they arrive.
If there's no JSON endpoint, check the HTML. Extract embedded JSON-LD with extruct. If the data is unstructured or its layout changes from page to page, send the text to an LLM with a Pydantic schema.
Once you have the JSON, complete the work. Validate it with Pydantic or jsonschema, reshape it with JMESPath, and save it as JSON or JSONL – then query a folder of those files with DuckDB once the dataset grows.

Most of the time, the data is already JSON somewhere, and the rest of these steps are the fallback for when it isn't.

Best practices for scraping JSON

These habits separate a one-off script from a production scraper.

Scrape responsibly – check the target's robots.txt and Terms of Service first, prefer an official or public API where one exists, honor rate limits, and collect only public data you're allowed to use.
Validate before you process – check for expected keys, guard against None, and confirm types.
Keep ensure_ascii=False when writing JSON that holds international characters.
Pretty-print with indent during development, then switch to compact output for storage.
Route requests through rotating residential proxies once you scale, to avoid rate limits and bans.
Store the raw API response next to the parsed data, so you can debug and reprocess without re-scraping.
Add a timestamp and source metadata to every saved file for easy versioning.
Match a real browser fingerprint with curl_cffi before you assume you need a full browser, because it clears most TLS-based blocks on its own.
Save large or streaming datasets as JSONL for append-friendly, pipeline-ready output.

For more on rotation, read why rotating proxies work best.

The complete script

Here's a working scraper that ties the pieces together. It pages through the quotes API, reshapes each record with JMESPath, enriches it with metadata, and saves a clean JSON file. Install the dependencies with pip install requests jmespath (a virtual environment keeps them isolated), save it as scraper.py, and run it. It works as-is, and the proxy turns on the moment you add your residential proxy credentials.

import json
from datetime import datetime, timezone

import jmespath
import requests

# Decodo residential proxy. Add your credentials to route through rotating IPs;
# until you do, the script runs directly against the open quotes API.
PROXY = "http://YOUR_PROXY_USERNAME:YOUR_PROXY_PASSWORD@gate.decodo.com:7000"

API_URL = "https://quotes.toscrape.com/api/quotes"


def fetch_all_quotes(session):
    """Page through the quotes API until has_next turns False."""
    quotes, page = [], 1
    while True:
        response = session.get(API_URL, params={"page": page}, timeout=15)
        response.raise_for_status()
        payload = response.json()
        quotes.extend(payload["quotes"])
        if not payload["has_next"]:
            break
        page += 1
    return quotes


def main():
    session = requests.Session()
    session.headers.update({"User-Agent": "Mozilla/5.0"})
    if "YOUR_PROXY" not in PROXY:                  # enabled once you add credentials
        session.proxies.update({"http": PROXY, "https": PROXY})

    raw_quotes = fetch_all_quotes(session)

    # Keep only the fields you need, in one JMESPath pass.
    records = jmespath.search(
        "[].{author: author.name, text: text, tags: tags}", raw_quotes
    )

    # Enrich each record before saving.
    scraped_at = datetime.now(timezone.utc).replace(microsecond=0).isoformat()
    for record in records:
        record["tag_count"] = len(record["tags"])
        record["scraped_at"] = scraped_at

    with open("quotes.json", "w", encoding="utf-8") as file:
        json.dump(records, file, indent=2, ensure_ascii=False)

    print(f"Saved {len(records)} quotes to quotes.json")


if __name__ == "__main__":
    main()

Run it from your terminal:

python scraper.py

You'll see Saved 100 quotes to quotes.json, and a quotes.json file appears in the same folder. Each record looks like this, with scraped_at set to your run's UTC time:

{
  "author": "Albert Einstein",
  "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d",
  "tags": ["change", "deep-thoughts", "thinking", "world"],
  "tag_count": 4,
  "scraped_at": "2026-06-09T11:11:28+00:00"
}

That script fetches the pages sequentially with Requests for readability, so it leaves out the retry harness from earlier. On a real run, wrap each request in the get_json helper so a transient failure doesn't stop the job. Beyond that, 2 more changes adapt it to harder loads. For hundreds of pages, use the httpx + asyncio version from the scaling-up section, and on a fingerprint-protected site, install curl_cffi and change 2 lines. Swap the import for from curl_cffi import requests, then build the session with requests.Session(impersonate="chrome"). Both mirror the Requests API, so nothing else changes.

Bottom line

The part that breaks most often isn't your code. It's getting blocked by IP bans, TLS fingerprinting, and CAPTCHAs before the data arrives. Match the tool to the target: use rotating residential proxies for volume, and use the unblocking and agent options covered above when a site blocks you or an LLM needs the data on demand. The Web Scraping API has a free tier if you'd rather let it handle the unblocking stack entirely.

Skip the extraction pipeline

Decodo's Web Scraping API returns structured JSON from any page, even JS-rendered ones. No parsing, no proxy setup, no anti-bot workarounds in your Python code.

Try for free

About the author

Justinas Tamasevicius

Director of Engineering

Justinas Tamaševičius is Director of Engineering with over two decades of expertise in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.

Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

In this article

Clean JSON, every request

Rotating residential proxies, anti-bot bypass, and structured data output. One call, parsed response back.

Start now

Frequently asked questions

What is the difference between json.load() and json.loads() in Python?

The json.load() function reads from a file-like object, while json.loads() parses a string already in memory. Use json.load(file) for a file, and json.loads('{"a": 1}') for a string.

Can I parse JavaScript objects with Python's json module?

No, and that's by design, because json.loads() enforces strict JSON, so JavaScript objects with single quotes, trailing commas, or unquoted keys raise a JSONDecodeError. Parsing those needs chompjs and its parse_js_object() function.

How do I handle very large JSON files without running out of memory?

Use ijson to stream and parse the file incrementally, so you never load it all at once. If the file fits in memory but parsing is slow, orjson is a faster drop-in replacement, and msgspec is faster still when you also need validation.

Why does response.json() raise a JSONDecodeError?

It means the response body wasn't valid JSON. The usual culprit in scraping is a block page or CAPTCHA returned as HTML instead of data. Check the status code and response.text first, and route through proxies if you're being blocked.

Why am I getting blocked even when my code is correct?

The block is usually your fingerprint, not your IP. A plain Python client's TLS handshake (its JA4 signature) doesn't match any real browser, so services like Cloudflare flag it immediately. Match a real browser fingerprint with curl_cffi, route through rotating residential proxies, and escalate to a Web Scraping API when CAPTCHAs and JavaScript challenges multiply.

Should I use an LLM to extract JSON when scraping?

You should use one only when there's no clean source. If the data is already JSON – an API, JSON-LD, or a hidden endpoint – parse it directly, because that's quicker and deterministic. Use an LLM with a schema when the data is in unstructured or frequently-changing HTML, and validate its output with Pydantic.

How do I extract JSON from a website in Python?

Check for a hidden JSON API first: open DevTools, filter to Fetch/XHR, and if you find one, call that endpoint with Requests and read response.json(). If the data is embedded in the HTML as JSON-LD, extract it with extruct or Beautiful Soup. Both give you structured JSON without parsing the rendered page.

How do I save JSON data to a file in Python?

Use json.dump(data, file) to write a Python object to a file, keeping ensure_ascii=False for non-English text. For large or streaming datasets, write JSONL so you can append records while you scrape and process the file without loading it all at once.

How do I scrape JSON from an API in Python?

Send a request with Requests and call response.json() to get a Python dictionary. Add headers or a token if the API needs authentication, follow pagination until there are no pages left, and retry transient failures when they occur. When the API is behind anti-bot protection, route through rotating residential proxies or a Web Scraping API.

robot holding a device as pink puzzle-piece holograms float in a glass station with icy mountains and red security cameras outside

PARSING

Digesting parsing: what is parsing?

Data parsing is the process of converting raw, unstructured data into well-structured and understandable information. In this article, we’ll explore what data parsing is, how it works, and the various parsing technologies that unlock a range of features to both businesses and individuals.

James Keenan

Last updated: Aug 25, 2021

12 min read

PYTHON

Mastering Python Requests: A Comprehensive Guide to Using Proxies

When using Python's Requests library, proxies can help with tasks like web scraping, interacting with APIs, or accessing geo-restricted content. Proxies route HTTP requests through different IP addresses, helping you avoid IP bans, maintain anonymity, and bypass restrictions. This guide covers how to set up and use proxies with the Requests library. Let’s get started!

Zilvinas Tamulis

Last updated: Feb 29, 2024