NEW

Scraping Yelp: A Step-by-Step Tutorial

Yelp doesn't make scraping easy. The data you need is spread across multiple backend systems (no single endpoint gives you everything), and standard HTTP libraries get blocked before the first response. This guide covers every extraction method with Python, including the TLS impersonation and anti-bot techniques you need to avoid blocks at scale.

Justinas Tamasevicius

Last updated: Mar 12, 2026

15 min read

TL;DR

Yelp stores data in 4 separate locations: Hypernova JSON (search results), Apollo cache (business details), GraphQL API (reviews), and server-rendered HTML (not-recommended reviews)
Standard HTTP libraries (Requests, httpx) get blocked – Yelp's bot detection checks TLS fingerprints before the HTTP request begins
Use curl_cffi for browser-level TLS impersonation and residential proxies for IP rotation
This tutorial builds 4 standalone Python scrapers, one per data source
All code is tested as of early 2026 – complete scraper files available for download

Why scrape Yelp data?

For data teams working on market research, competitive analysis, lead generation, or sentiment analysis, Yelp is one of the richest public data sources available.

The typical data points worth extracting include:

Business details – name, address, phone, hours, coordinates, categories, price range, amenities
Reviews – full text, star rating, date, author info, useful/funny/cool vote counts
Search rankings – which businesses rank for specific queries in specific cities
Not-recommended (NR) reviews – the ones Yelp's algorithm suppresses (useful for sentiment analysis and review authenticity research)

What about the Yelp Fusion API?

Yelp offers a Places API (formerly the Fusion API) for programmatic access. But there are real limitations:

Limited free trial – a 30-day trial gives you 5,000 API calls for evaluation only (no commercial use). Paid tiers: Starter at $7.99 per 1,000 calls (business data only), Plus at $9.99 per 1,000 calls (adds review excerpts and photos), Premium at $14.99 per 1,000 calls
Strict rate limits – base allocation is 30,000 calls/month with a 5,000/day cap. The Starter plan limits you to 300 calls/day.
Only short review excerpts – the Plus plan returns 3 excerpts per business (~160 characters each), while the Premium plan gets 7 via the Review Highlights endpoint. No plan returns full review text.
No not-recommended reviews – these aren't available through the API at all
No photo captions, limited attributes – the API returns a subset of what's on the actual page

When you need complete review datasets, full business attributes, or not-recommended reviews, scraping the site directly is the realistic option.

Yelp scraping made simple

Easily extract Yelp reviews, ratings, and business details with Decodo's Web Scraping API.

Try it out

Setting up your Python environment

You need Python 3.10+ and 3 libraries:

pip install curl_cffi beautifulsoup4 python-dotenv

What each library does:

curl_cffi – HTTP client that impersonates real browser TLS fingerprints at the network level. Regular HTTP clients like Requests or httpx get blocked by Yelp's bot detection because their TLS handshakes don't match real browsers. Headless browsers like Playwright also pass TLS checks, but they're overkill here because Yelp's data lives in embedded JSON and a GraphQL API, so you don't need to render the page.
Beautiful Soup 4 and python-dotenv – HTML parsing and .env proxy credential loading.

Project structure

Each scraper is a standalone file with no cross-dependencies:

yelp-scraping/
├── .env                        # Proxy credentials
├── yelp_search.py              # Search results scraper
├── yelp_business.py            # Business details scraper
├── yelp_scraper.py             # Reviews scraper (GraphQL)
└── yelp_not_recommended.py     # Not-recommended reviews scraper

Proxy setup

Create a .env file with your proxy credentials. You can get yours from the Decodo dashboard (see the residential proxy quick-start guide for setup details):

PROXY_URL=http://username:password@us.decodo.com:10000

This example uses Decodo residential proxies with a rotating gateway, so each request gets a fresh IP from a pool of 115M+ residential addresses. The us.decodo.com endpoint ensures US-based IPs, which matters because Yelp serves different data based on geographic location.

The key requirement for Yelp is residential IPs. Datacenter proxies tend to get blocked quickly on Yelp because bot detection systems like DataDome can identify IPs that belong to hosting providers.

Understanding Yelp's page structure and data locations

Before writing any scraping code, you need to know where Yelp actually stores its data. A common approach is to parse the visible HTML, but Yelp uses dynamically generated CSS classes that can change without notice. The more reliable data lives in 4 separate locations.

Yelp URL patterns

Each data source uses a different endpoint:

Search results:

https://www.yelp.com/search?find_desc={query}&find_loc={location}

Business pages

https://www.yelp.com/biz/{business-slug}

NR reviews

https://www.yelp.com/not_recommended_reviews/{business-slug}

Reviews API

POST https://www.yelp.com/gql/batch

A typical Yelp business page with the data points these scrapers extract:

A Yelp business page for Flour Bakery + Café showing the business name, 4.3-star rating, 1.4K reviews, price range, categories, and operating hours

Data source 1 – Hypernova JSON (search results)

Search result pages embed their data in a <script data-hypernova-key> tag. This Hypernova JSON contains all the business listings, ratings, review counts, categories, and ranking positions.

Yelp search results page with DevTools showing the Hypernova JSON script tag containing search result data

The data path is deeply nested: legacyProps → searchAppProps → searchPageProps → mainContentComponentsListProps. Addresses are stored separately in an Apollo state cache (<script data-apollo-state>) on the same page.

Data source 2 – Apollo cache (business details)

Open any Yelp business page, inspect the HTML, and search for ROOT_QUERY. You'll find a massive <script type="application/json"> tag containing the Apollo Client normalized cache, a flat JSON store with every business detail the page needs to render.

Apollo Client cache visible in Chrome DevTools on a Yelp business page, showing the large application/json script tag containing ROOT_QUERY data

This cache stores data as key-value entities with cross-references. For example, categories aren't stored directly on the business. Instead, they're referenced with {"__ref": "Category:bakeries"} and resolved from the same cache.

Data source 3 – GraphQL batch API (reviews)

Reviews load through an internal GraphQL API at POST https://www.yelp.com/gql/batch. The operation is called GetBusinessReviewFeed and returns structured review data (full text, ratings, dates, author info, vote counts) without any HTML parsing needed.

Yelp reviews page with Chrome DevTools Network tab open, showing the GraphQL batch request payload with GetBusinessReviewFeed operation

Data source 4 – server-rendered HTML (not-recommended reviews)

Not-recommended reviews live at /not_recommended_reviews/{slug} and use server-rendered HTML. No JSON blobs, no GraphQL – just HTML that you parse with Beautiful Soup. These reviews are split into 2 sections: "not currently recommended" and "removed for TOS violations", each with its own pagination parameter.

Yelp's not-recommended reviews page showing filtered reviews with the "not currently recommended" header

Scraping Yelp search results

Search pages embed their listings in a Hypernova JSON blob. Each page returns 10 results, and pagination uses a start query parameter.

Creating a session with browser impersonation

Every scraper in this tutorial uses the same imports and session setup (individual scrapers add a few more as needed):

import csv
import json
import os
import re
import time
import random
from html import unescape
from urllib.parse import quote_plus, unquote
from pathlib import Path

from bs4 import BeautifulSoup
from curl_cffi import CurlOpt, requests
from curl_cffi.requests.exceptions import RequestException
from dotenv import load_dotenv

load_dotenv(Path(__file__).parent / ".env")

The curl_cffi session impersonates a real browser's TLS fingerprint:

RETRY_IMPERSONATIONS = ["safari2601", "safari184", "safari260", "chrome133a", "safari170"]

def _get_session(impersonate="safari2601"):
    proxy_url = os.environ.get("PROXY_URL")
    return requests.Session(
        impersonate=impersonate,
        proxy=proxy_url,
        timeout=(10, 30),
        curl_options={
            CurlOpt.TCP_KEEPALIVE: 1,
            CurlOpt.TCP_KEEPIDLE: 60,
            CurlOpt.TCP_KEEPINTVL: 30,
            CurlOpt.DNS_CACHE_TIMEOUT: 300,
        },
    )

RETRY_IMPERSONATIONS = ["safari2601", "safari184", "safari260", "chrome133a", "safari170"]

def _get_session(impersonate="safari2601"):
    proxy_url = os.environ.get("PROXY_URL")
    return requests.Session(
        impersonate=impersonate,
        proxy=proxy_url,
        timeout=(10, 30),
        curl_options={
            CurlOpt.TCP_KEEPALIVE: 1,
            CurlOpt.TCP_KEEPIDLE: 60,
            CurlOpt.TCP_KEEPINTVL: 30,
            CurlOpt.DNS_CACHE_TIMEOUT: 300,
        },
    )

The impersonate parameter tells curl_cffi which browser to mimic. safari2601 (Safari 26.0.1) is a reliable default. The TCP_KEEPALIVE options prevent stale connections during pagination loops.

Important: don't set User-Agent, Accept, or Accept-Language headers manually. The impersonation already handles these. Setting them yourself can create mismatches that expose you as a scraper.

Extracting the Hypernova JSON

Look for the <script> tag with a data-hypernova-key attribute and strip the HTML comment wrappers before parsing:

def _parse_hypernova(soup):
    """Extract search results from the Hypernova JSON blob."""
    script = soup.find("script", attrs={"data-hypernova-key": True})
    if not script or not script.string:
        return None

    text = script.string.strip()
    # Strip HTML comment wrappers
    if text.startswith("<!--"):
        text = text[4:]
    if text.endswith("-->"):
        text = text[:-3]

    return json.loads(text)

def _parse_hypernova(soup):
    """Extract search results from the Hypernova JSON blob."""
    script = soup.find("script", attrs={"data-hypernova-key": True})
    if not script or not script.string:
        return None

    text = script.string.strip()
    # Strip HTML comment wrappers
    if text.startswith("<!--"):
        text = text[4:]
    if text.endswith("-->"):
        text = text[:-3]

    return json.loads(text)

The JSON is wrapped in HTML comments (), which you need to strip before parsing. Miss this step and json.loads() throws a JSONDecodeError.

Extracting business data from results

Business listings are deeply nested in the Hypernova JSON, and sponsored results need special URL decoding. The addresses dict is extracted separately from the <script data-apollo-state> tag on the same page (see the full source for that extraction code):

def _extract_businesses(data, addresses):
    components = (
        data.get("legacyProps", {})
        .get("searchAppProps", {})
        .get("searchPageProps", {})
        .get("mainContentComponentsListProps", [])
    )

    businesses = []
    for comp in components:
        srb = comp.get("searchResultBusiness")
        if not srb:
            continue

        alias = srb.get("alias", "")
        is_ad = comp.get("isAd", False)

        # Sponsored results use redirect URLs -- decode to get the real alias
        biz_url = srb.get("businessUrl", "")
        if "/adredir" in biz_url:
            match = re.search(r"redirect_url=([^&]+)", biz_url)
            if match:
                alias = unquote(match.group(1)).split("/biz/")[-1].split("?")[0]

        categories = [unescape(c.get("title", "")) for c in (srb.get("categories") or [])]

        businesses.append({
            "rank": comp.get("ranking"),
            "name": unescape(srb.get("name", "")),
            "alias": alias,
            "url": f"https://www.yelp.com/biz/{alias}",
            "biz_id": comp.get("bizId", ""),
            "rating": srb.get("rating"),
            "review_count": srb.get("reviewCount"),
            "price_range": srb.get("priceRange", ""),
            "categories": categories,
            "phone": srb.get("phone", ""),
            "neighborhoods": srb.get("neighborhoods", []),
            "address": addresses.get(alias, ""),
            "is_ad": is_ad,
            "is_closed": srb.get("isClosed", False),
        })

    return businesses

def _extract_businesses(data, addresses):
    components = (
        data.get("legacyProps", {})
        .get("searchAppProps", {})
        .get("searchPageProps", {})
        .get("mainContentComponentsListProps", [])
    )

    businesses = []
    for comp in components:
        srb = comp.get("searchResultBusiness")
        if not srb:
            continue

        alias = srb.get("alias", "")
        is_ad = comp.get("isAd", False)

        # Sponsored results use redirect URLs -- decode to get the real alias
        biz_url = srb.get("businessUrl", "")
        if "/adredir" in biz_url:
            match = re.search(r"redirect_url=([^&]+)", biz_url)
            if match:
                alias = unquote(match.group(1)).split("/biz/")[-1].split("?")[0]

        categories = [unescape(c.get("title", "")) for c in (srb.get("categories") or [])]

        businesses.append({
            "rank": comp.get("ranking"),
            "name": unescape(srb.get("name", "")),
            "alias": alias,
            "url": f"https://www.yelp.com/biz/{alias}",
            "biz_id": comp.get("bizId", ""),
            "rating": srb.get("rating"),
            "review_count": srb.get("reviewCount"),
            "price_range": srb.get("priceRange", ""),
            "categories": categories,
            "phone": srb.get("phone", ""),
            "neighborhoods": srb.get("neighborhoods", []),
            "address": addresses.get(alias, ""),
            "is_ad": is_ad,
            "is_closed": srb.get("isClosed", False),
        })

    return businesses

2 parsing pitfalls to watch for in the Hypernova data:

HTML entities in Hypernova data. Category names come through with HTML entities (like & instead of &), so "Coffee & Tea" appears as "Coffee & Tea" in the raw data. Always run html.unescape() on text extracted from Hypernova JSON.
Sponsored results use redirect URLs. Ad listings don't link to /biz/{alias} directly – they use /adredir?redirect_url=… with the real URL encoded in the query string.

Search pagination – single session is critical

Pagination for Yelp search requires the same session with cookies across all pages. If you create a new session for each page, Yelp's bot detection is more likely to flag the requests because cookies and connection state don't carry over. Here's how to keep a single session across all pages:

def scrape_search(query, location, max_pages=None, delay=5.0):
    # ONE session for the entire search -- cookies carry over
    session = _get_session("safari2601")

    first_url = f"https://www.yelp.com/search?find_desc={quote_plus(query)}&find_loc={quote_plus(location)}"
    resp = session.get(first_url, headers={"Referer": "https://www.google.com/"})

    # ... parse the first page, compute total_pages from result count ...

    prev_url = first_url

    for page in range(2, total_pages + 1):
        time.sleep(delay + random.uniform(0, delay * 0.3))
        offset = (page - 1) * 10

        session.upkeep()  # Keep HTTP/2 connections alive

        page_url = f"{first_url}&start={offset}"
        # Referer must be the previous page URL -- mimics clicking "Next"
        resp = session.get(page_url, headers={"Referer": prev_url})

        prev_url = page_url

def scrape_search(query, location, max_pages=None, delay=5.0):
    # ONE session for the entire search -- cookies carry over
    session = _get_session("safari2601")

    first_url = f"https://www.yelp.com/search?find_desc={quote_plus(query)}&find_loc={quote_plus(location)}"
    resp = session.get(first_url, headers={"Referer": "https://www.google.com/"})

    # ... parse the first page, compute total_pages from result count ...

    prev_url = first_url

    for page in range(2, total_pages + 1):
        time.sleep(delay + random.uniform(0, delay * 0.3))
        offset = (page - 1) * 10

        session.upkeep()  # Keep HTTP/2 connections alive

        page_url = f"{first_url}&start={offset}"
        # Referer must be the previous page URL -- mimics clicking "Next"
        resp = session.get(page_url, headers={"Referer": prev_url})

        prev_url = page_url

3 things that keep search pagination working:

session.upkeep() – call this between pagination requests. Without it, HTTP/2 connections go stale during the 5+ second delays, and the next request fails.
Referer chain – page 1 uses Referer: https://www.google.com/ (user arrived from search). Page 2+ uses the previous Yelp page URL.
5-second delay minimum – search pages need longer delays than other endpoints. The code defaults to 5 seconds (configurable via –delay), and the retry logic handles 403 responses automatically if the delay is too short.

Running the search scraper

The code blocks above are simplified for readability. See the full yelp_search.py source for the complete working script with error handling and CLI parsing.

Pass a search term, location, and optional page limit. The scraper outputs ranked results as JSON or CSV:

python3 yelp_search.py "tacos" "Los Angeles, CA" --max-pages 3

The output JSON contains ranked business listings with all extracted fields:

[
  {
    "rank": 1,
    "name": "Avenue 26 Tacos",
    "alias": "avenue-26-tacos-los-angeles-2",
    "url": "https://www.yelp.com/biz/avenue-26-tacos-los-angeles-2",
    "biz_id": "boqeEN38XuEKimgKisrqSA",
    "rating": 4.4,
    "review_count": 650,
    "price_range": "$",
    "categories": ["Food Trucks"],
    "phone": "(213) 375-3300",
    "neighborhoods": ["Little Tokyo"],
    "address": "353 S Alameda St, Los Angeles",
    "is_ad": false,
    "is_closed": false
  }
]

[
  {
    "rank": 1,
    "name": "Avenue 26 Tacos",
    "alias": "avenue-26-tacos-los-angeles-2",
    "url": "https://www.yelp.com/biz/avenue-26-tacos-los-angeles-2",
    "biz_id": "boqeEN38XuEKimgKisrqSA",
    "rating": 4.4,
    "review_count": 650,
    "price_range": "$",
    "categories": ["Food Trucks"],
    "phone": "(213) 375-3300",
    "neighborhoods": ["Little Tokyo"],
    "address": "353 S Alameda St, Los Angeles",
    "is_ad": false,
    "is_closed": false
  }
]

Extracting business details from Yelp business pages

The business details scraper parses the Apollo Client cache, the ROOT_QUERY JSON blob from data source 2. This gives you the full set of business attributes that search results don't include.

Parsing the Apollo cache

Identify the Apollo cache by size (50KB+) and the presence of ROOT_QUERY:

def _parse_apollo_cache(soup):
    """Find and parse the Apollo state cache -- it's the biggest JSON blob on the page."""
    for tag in soup.find_all("script", type="application/json"):
        text = tag.string or ""
        if len(text) > 50000 and "ROOT_QUERY" in text:
            clean = unescape(text).strip()
            if clean.startswith("<!--"):
                clean = clean[4:]
            if clean.endswith("-->"):
                clean = clean[:-3]
            return json.loads(clean.strip())
    return None

def _parse_apollo_cache(soup):
    """Find and parse the Apollo state cache -- it's the biggest JSON blob on the page."""
    for tag in soup.find_all("script", type="application/json"):
        text = tag.string or ""
        if len(text) > 50000 and "ROOT_QUERY" in text:
            clean = unescape(text).strip()
            if clean.startswith("<!--"):
                clean = clean[4:]
            if clean.endswith("-->"):
                clean = clean[:-3]
            return json.loads(clean.strip())
    return None

The page has multiple <script type="application/json"> tags, but the Apollo cache is the big one (50KB+). Check for ROOT_QUERY to identify it. Like the Hypernova data, it may be wrapped in HTML comments and contains encoded entities.

Resolving Apollo cache references

The Apollo cache uses a normalized format where related entities aren't nested. Instead, they're referenced by key:

{
  "Business:abc123": {
    "name": "Flour Bakery",
    "categories": [
      {"__ref": "Category:bakeries"},
      {"__ref": "Category:coffee"}
    ]
  },
  "Category:bakeries": {
    "title": "Bakeries",
    "alias": "bakeries"
  }
}

You need a resolver function to follow these references:

def _resolve_ref(cache, ref_or_val):
    if isinstance(ref_or_val, dict) and "__ref" in ref_or_val:
        return cache.get(ref_or_val["__ref"], {})
    return ref_or_val

Extracting operating hours

Hours require a truthiness edge case because ["Closed"] is truthy in Python, so you can't just check bool(day_hours):

def _extract_hours(biz):
    op = biz.get("operationHours")
    if not op:
        return []
    weekly = op.get("regularHoursMergedWithSpecialHoursForCurrentWeek", [])
    hours = []
    for day in weekly:
        day_hours = day.get("regularHours", [])
        # Edge case: bool(["Closed"]) is True in Python -- check content, not just truthiness
        is_open = bool(day_hours) and day_hours != ["Closed"]
        hours.append({
            "day": day.get("dayOfWeekShort", ""),
            "hours": day_hours,
            "is_open": is_open,
        })
    return hours

def _extract_hours(biz):
    op = biz.get("operationHours")
    if not op:
        return []
    weekly = op.get("regularHoursMergedWithSpecialHoursForCurrentWeek", [])
    hours = []
    for day in weekly:
        day_hours = day.get("regularHours", [])
        # Edge case: bool(["Closed"]) is True in Python -- check content, not just truthiness
        is_open = bool(day_hours) and day_hours != ["Closed"]
        hours.append({
            "day": day.get("dayOfWeekShort", ""),
            "hours": day_hours,
            "is_open": is_open,
        })
    return hours

Extracting photos

Photos are the trickiest part of the Apollo cache. They aren't stored under a simple key. Instead, they're nested under biz["media"] with GraphQL argument syntax in the key names:

def _extract_photos(biz, cache, limit=10):
    photos = []
    media = biz.get("media", {})

    for key in media:
        if "orderedMediaItems" not in key:
            continue
        edges = media[key].get("edges", [])
        for edge in edges:
            node = _resolve_ref(cache, edge.get("node", {}))

            # Skip non-photo items (videos also appear here)
            if node.get("__typename") != "BusinessPhoto":
                continue

            # Photo URLs use parameterized keys like url({"size":"LARGE"})
            photo_url = ""
            photo_url_obj = node.get("photoUrl", {})
            if isinstance(photo_url_obj, dict):
                for url_key, url_val in photo_url_obj.items():
                    if isinstance(url_val, str) and url_val.startswith("http"):
                        if "LARGE" in url_key or "ORIGINAL" in url_key:
                            photo_url = url_val
                            break

            photos.append({
                "id": node.get("encid", ""),
                "caption": node.get("caption") or "",  # or "" because None is common
                "url": photo_url,
            })
            if len(photos) >= limit:
                break
    return photos

def _extract_photos(biz, cache, limit=10):
    photos = []
    media = biz.get("media", {})

    for key in media:
        if "orderedMediaItems" not in key:
            continue
        edges = media[key].get("edges", [])
        for edge in edges:
            node = _resolve_ref(cache, edge.get("node", {}))

            # Skip non-photo items (videos also appear here)
            if node.get("__typename") != "BusinessPhoto":
                continue

            # Photo URLs use parameterized keys like url({"size":"LARGE"})
            photo_url = ""
            photo_url_obj = node.get("photoUrl", {})
            if isinstance(photo_url_obj, dict):
                for url_key, url_val in photo_url_obj.items():
                    if isinstance(url_val, str) and url_val.startswith("http"):
                        if "LARGE" in url_key or "ORIGINAL" in url_key:
                            photo_url = url_val
                            break

            photos.append({
                "id": node.get("encid", ""),
                "caption": node.get("caption") or "",  # or "" because None is common
                "url": photo_url,
            })
            if len(photos) >= limit:
                break
    return photos

3 edge cases in the photo extraction logic:

Videos mixed with photos. The orderedMediaItems list contains both BusinessPhoto and BusinessVideo items. Filter by __typename.
Parameterized URL keys. Photo URLs aren't under a simple url key – they're under url({"size":"LARGE"}). You need to search for the key containing "LARGE" or "ORIGINAL."
Null captions. node.get("caption", "") returns None when the JSON value is null, not "". Use node.get("caption") or "" instead.

Running the business scraper

Pass any Yelp business URL. The scraper extracts the Apollo cache and outputs a single JSON object with the full profile. See the full yelp_business.py source for the complete script.

python3 yelp_business.py "https://www.yelp.com/biz/flour-bakery-cafe-boston"

The output is a single JSON object with the full business profile (truncated here for readability):

{
  "biz_id": "-5gWvrcKOPmhlcZju3tpbw",
  "name": "Flour Bakery + Café",
  "alias": "flour-bakery-café-boston-4",
  "url": "https://www.yelp.com/biz/flour-bakery-café-boston-4",
  "is_closed": false,
  "rating": 4.3,
  "review_count": 1436,
  "price_range": "$$",
  "phone": "(617) 338-4333",
  "address": {
    "line1": "12 Farnsworth St",
    "city": "Boston",
    "state": "MA",
    "postal_code": "02210",
    "country": "US"
  },
  "neighborhoods": ["Waterfront", "South Boston"],
  "coordinates": { "latitude": 42.35123, "longitude": -71.048747 },
  "categories": [
    { "title": "Bakeries", "alias": "bakeries" },
    { "title": "Coffee & Tea", "alias": "coffee" },
    { "title": "Sandwiches", "alias": "sandwiches" }
  ],
  "hours": [
    { "day": "Mon", "hours": ["7:00 AM - 7:00 PM"], "is_open": true },
    { "day": "Tue", "hours": ["7:00 AM - 7:00 PM"], "is_open": true },
    { "day": "Wed", "hours": ["7:00 AM - 7:00 PM"], "is_open": true },
    // ... Thu, Fri, Sun omitted ...
    { "day": "Sat", "hours": ["8:00 AM - 6:00 PM"], "is_open": true }
  ],
  "website": "https://flourbakery.com",
  "attributes": [
    { "name": "Offers delivery", "alias": "RestaurantsDelivery", "is_active": true },
    { "name": "Free Wi-Fi", "alias": "wifi_options", "is_active": true },
    { "name": "Outdoor seating", "alias": "has_outdoor_seating", "is_active": true }
  ],
  "photos": [
    {
      "id": "iKA2Ynabpt9q9pQ7iLpEcw",
      "caption": "Broccoli melt (half size)",
      "url": "https://s3-media0.fl.yelpcdn.com/bphoto/iKA2Ynabpt9q9pQ7iLpEcw/l.jpg"
    }
  ]
}

{
  "biz_id": "-5gWvrcKOPmhlcZju3tpbw",
  "name": "Flour Bakery + Café",
  "alias": "flour-bakery-café-boston-4",
  "url": "https://www.yelp.com/biz/flour-bakery-café-boston-4",
  "is_closed": false,
  "rating": 4.3,
  "review_count": 1436,
  "price_range": "$$",
  "phone": "(617) 338-4333",
  "address": {
    "line1": "12 Farnsworth St",
    "city": "Boston",
    "state": "MA",
    "postal_code": "02210",
    "country": "US"
  },
  "neighborhoods": ["Waterfront", "South Boston"],
  "coordinates": { "latitude": 42.35123, "longitude": -71.048747 },
  "categories": [
    { "title": "Bakeries", "alias": "bakeries" },
    { "title": "Coffee & Tea", "alias": "coffee" },
    { "title": "Sandwiches", "alias": "sandwiches" }
  ],
  "hours": [
    { "day": "Mon", "hours": ["7:00 AM - 7:00 PM"], "is_open": true },
    { "day": "Tue", "hours": ["7:00 AM - 7:00 PM"], "is_open": true },
    { "day": "Wed", "hours": ["7:00 AM - 7:00 PM"], "is_open": true },
    // ... Thu, Fri, Sun omitted ...
    { "day": "Sat", "hours": ["8:00 AM - 6:00 PM"], "is_open": true }
  ],
  "website": "https://flourbakery.com",
  "attributes": [
    { "name": "Offers delivery", "alias": "RestaurantsDelivery", "is_active": true },
    { "name": "Free Wi-Fi", "alias": "wifi_options", "is_active": true },
    { "name": "Outdoor seating", "alias": "has_outdoor_seating", "is_active": true }
  ],
  "photos": [
    {
      "id": "iKA2Ynabpt9q9pQ7iLpEcw",
      "caption": "Broccoli melt (half size)",
      "url": "https://s3-media0.fl.yelpcdn.com/bphoto/iKA2Ynabpt9q9pQ7iLpEcw/l.jpg"
    }
  ]
}

Scraping Yelp reviews with the GraphQL API

The reviews scraper uses Yelp's internal GraphQL batch endpoint, the same one the frontend calls when you scroll through reviews on a business page. This returns structured data directly.

Getting the business ID

Before you can query reviews, you need the encBizId (encrypted business ID). It's in the HTML:

def extract_biz_id(session, url):
    resp = session.get(url, headers={"Referer": "https://www.google.com/"})
    soup = BeautifulSoup(resp.text, "html.parser")

    meta = soup.find("meta", attrs={"name": "yelp-biz-id"})
    if not meta:
        raise RuntimeError("Could not find yelp-biz-id meta tag.")

    enc_biz_id = str(meta["content"])
    return enc_biz_id

Building the GraphQL payload

The request uses a persisted query hash (documentId) instead of raw GraphQL, and pagination cursors are base64-encoded:

from base64 import b64encode

GQL_URL = "https://www.yelp.com/gql/batch"
DOC_ID = "ef51f33d1b0eccc958dddbf6cde15739c48b34637a00ebe316441031d4bf7681"

def build_gql_payload(enc_biz_id, offset=0):
    variables = {
        "encBizId": enc_biz_id,
        "reviewsPerPage": 10,
        "selectedReviewEncId": "",
        "hasSelectedReview": False,
        "sortBy": "DATE_DESC",
        "languageCode": "en",
        "ratings": [5, 4, 3, 2, 1],
        "isSearching": False,
        "isTranslating": False,
        "translateLanguageCode": "en",
        "reactionsSourceFlow": "businessPageReviewSection",
        "minConfidenceLevel": "HIGH_CONFIDENCE",
        "highlightType": "",
        "highlightIdentifier": "",
        "isHighlighting": False,
    }

    if offset > 0:
        token = b64encode(
            json.dumps({"version": 1, "type": "offset", "offset": offset}).encode()
        ).decode()
        variables["after"] = token

    return [{
        "operationName": "GetBusinessReviewFeed",
        "variables": variables,
        "extensions": {
            "operationType": "query",
            "documentId": DOC_ID,
        },
    }]

from base64 import b64encode

GQL_URL = "https://www.yelp.com/gql/batch"
DOC_ID = "ef51f33d1b0eccc958dddbf6cde15739c48b34637a00ebe316441031d4bf7681"

def build_gql_payload(enc_biz_id, offset=0):
    variables = {
        "encBizId": enc_biz_id,
        "reviewsPerPage": 10,
        "selectedReviewEncId": "",
        "hasSelectedReview": False,
        "sortBy": "DATE_DESC",
        "languageCode": "en",
        "ratings": [5, 4, 3, 2, 1],
        "isSearching": False,
        "isTranslating": False,
        "translateLanguageCode": "en",
        "reactionsSourceFlow": "businessPageReviewSection",
        "minConfidenceLevel": "HIGH_CONFIDENCE",
        "highlightType": "",
        "highlightIdentifier": "",
        "isHighlighting": False,
    }

    if offset > 0:
        token = b64encode(
            json.dumps({"version": 1, "type": "offset", "offset": offset}).encode()
        ).decode()
        variables["after"] = token

    return [{
        "operationName": "GetBusinessReviewFeed",
        "variables": variables,
        "extensions": {
            "operationType": "query",
            "documentId": DOC_ID,
        },
    }]

Key parameters in this payload:

documentId – a stable hash of the GraphQL query stored on Yelp's server. You don't send the actual query text – just this hash. If this hash stops working, see the recovery steps below.
Pagination uses base64-encoded offset tokens. The after parameter takes a base64-encoded JSON object: {"version": 1, "type": "offset", "offset": 10}.
sortBy: "DATE_DESC" – returns newest reviews first. Other options: "RELEVANCE_DESC", "ELITES_DESC".

If documentId stops working: open any Yelp business page in Chrome DevTools, go to the Network tab, filter by batch, and look for the GetBusinessReviewFeed request. The current hash is in the request payload under extensions.documentId.

Chrome DevTools Network tab on a Yelp business page, showing the batch request payload with the documentId hash highlighted under extensions

Making the GraphQL request

The request needs specific headers to pass Yelp's server-side validation, especially x-apollo-operation-name:

def fetch_reviews_page(session, enc_biz_id, offset, referer):
    headers = {
        "Content-Type": "application/json",
        "Origin": "https://www.yelp.com",
        "Referer": referer,
        "x-apollo-operation-name": "GetBusinessReviewFeed",
    }

    payload = build_gql_payload(enc_biz_id, offset)
    resp = session.post(GQL_URL, json=payload, headers=headers)

    data = resp.json()
    if isinstance(data, list):
        data = data[0]

    return data["data"]["business"]["reviews"]

def fetch_reviews_page(session, enc_biz_id, offset, referer):
    headers = {
        "Content-Type": "application/json",
        "Origin": "https://www.yelp.com",
        "Referer": referer,
        "x-apollo-operation-name": "GetBusinessReviewFeed",
    }

    payload = build_gql_payload(enc_biz_id, offset)
    resp = session.post(GQL_URL, json=payload, headers=headers)

    data = resp.json()
    if isinstance(data, list):
        data = data[0]

    return data["data"]["business"]["reviews"]

Include the x-apollo-operation-name header. Without it, Yelp rejects the request. The Origin and Referer headers help the request look like it's coming from the actual Yelp frontend.

Parsing review data

Each review node from the GraphQL response contains nested objects for author, text, and feedback. Flatten them into a single dict:

def parse_review(node):
    author = node.get("author", {})
    text_obj = node.get("text", {})
    feedback = node.get("feedback", {})

    return {
        "review_id": node.get("encid", ""),
        "rating": node.get("rating"),
        "date": node.get("createdAt", {}).get("utcDateTime", ""),
        "text": text_obj.get("full", ""),
        "language": text_obj.get("language", ""),
        "author_name": author.get("displayName", ""),
        "author_location": author.get("displayLocation", ""),
        "useful_count": feedback.get("usefulCount", 0),
        "funny_count": feedback.get("funnyCount", 0),
        "cool_count": feedback.get("coolCount", 0),
        "photo_count": len(node.get("businessPhotos") or []),
    }

def parse_review(node):
    author = node.get("author", {})
    text_obj = node.get("text", {})
    feedback = node.get("feedback", {})

    return {
        "review_id": node.get("encid", ""),
        "rating": node.get("rating"),
        "date": node.get("createdAt", {}).get("utcDateTime", ""),
        "text": text_obj.get("full", ""),
        "language": text_obj.get("language", ""),
        "author_name": author.get("displayName", ""),
        "author_location": author.get("displayLocation", ""),
        "useful_count": feedback.get("usefulCount", 0),
        "funny_count": feedback.get("funnyCount", 0),
        "cool_count": feedback.get("coolCount", 0),
        "photo_count": len(node.get("businessPhotos") or []),
    }

Running the reviews scraper

Pass a business URL and an optional page limit. Each page returns 10 reviews sorted by date. See the full yelp_scraper.py source for the complete script.

python3 yelp_scraper.py "https://www.yelp.com/biz/flour-bakery-cafe-boston" --max-pages 3

Each review includes the full text, rating, date, author details, and engagement counts:

[
  {
    "review_id": "7gQYVzBSXupXrsS2aNo-Vw",
    "rating": 4,
    "date": "2026-02-18T18:10:01Z",
    "text": "I recently ordered delivery from Flour Bakery + Cafe(Boston location), and it was absolutely delicious! I tried the breakfast egg sandwich and the chicken tikka masala naan...",
    "language": "en",
    "author_name": "Neha W.",
    "author_location": "Belmont, MA",
    "useful_count": 0,
    "funny_count": 0,
    "cool_count": 0,
    "photo_count": 1
  },
  {
    "review_id": "NFCMmAZ4Id1BWd6Cam1QFw",
    "rating": 5,
    "date": "2026-01-16T17:31:37Z",
    "text": "Good food. Lots of options. I got the achiote chicken as a salad. I appreciated that I could get any sandwich as a salad",
    "language": "en",
    "author_name": "Jess L.",
    "author_location": "Seabrook, NH",
    "useful_count": 0,
    "funny_count": 0,
    "cool_count": 0,
    "photo_count": 1
  }
]

[
  {
    "review_id": "7gQYVzBSXupXrsS2aNo-Vw",
    "rating": 4,
    "date": "2026-02-18T18:10:01Z",
    "text": "I recently ordered delivery from Flour Bakery + Cafe(Boston location), and it was absolutely delicious! I tried the breakfast egg sandwich and the chicken tikka masala naan...",
    "language": "en",
    "author_name": "Neha W.",
    "author_location": "Belmont, MA",
    "useful_count": 0,
    "funny_count": 0,
    "cool_count": 0,
    "photo_count": 1
  },
  {
    "review_id": "NFCMmAZ4Id1BWd6Cam1QFw",
    "rating": 5,
    "date": "2026-01-16T17:31:37Z",
    "text": "Good food. Lots of options. I got the achiote chicken as a salad. I appreciated that I could get any sandwich as a salad",
    "language": "en",
    "author_name": "Jess L.",
    "author_location": "Seabrook, NH",
    "useful_count": 0,
    "funny_count": 0,
    "cool_count": 0,
    "photo_count": 1
  }
]

This is data that Yelp's Places API doesn't expose: full review text, vote counts (useful, funny, cool), and author details.

Scraping not-recommended reviews

This is the scraper most Yelp tutorials skip entirely. Yelp suppresses reviews that its algorithm deems unreliable. They aren't included in the star rating, and they're hidden behind a small link at the bottom of the reviews section.

Not-recommended reviews use server-rendered HTML at /not_recommended_reviews/{business-slug}. There are 2 separate sections, each with its own pagination parameter:

Not-recommended reviews – paginated with ?not_recommended_start=10
Removed reviews (TOS violations) – paginated with ?removed_start=10

Parsing a review from HTML

Each review sits in a div with a data-review-id attribute. Ratings, dates, and text live in predictable CSS classes:

def _parse_review_div(div, section_type="not_recommended"):
    review_id = div.get("data-review-id", "")
    if not review_id:
        return None

    # Rating from the i-stars div title attribute
    rating = None
    stars_div = div.find("div", class_=re.compile(r"i-stars"))
    if stars_div:
        match = re.search(r"([\d.]+)\s*star", stars_div.get("title", ""))
        if match:
            rating = float(match.group(1))

    # Date -- extract just the M/D/YYYY portion
    date_str = ""
    date_span = div.find("span", class_="rating-qualifier")
    if date_span:
        raw_date = date_span.get_text(strip=True)
        # The span sometimes contains extra text like "Updated review"
        date_match = re.match(r"(\d{1,2}/\d{1,2}/\d{4})", raw_date)
        date_str = date_match.group(1) if date_match else raw_date

    # Review text
    text_p = div.find("p", attrs={"lang": True})
    text = text_p.get_text(strip=True) if text_p else ""

    # Skip empty placeholder reviews (Yelp stripped all content)
    if rating is None and not text:
        return None

    return {
        "review_id": review_id,
        "section": section_type,
        "rating": rating,
        "date": date_str,
        "text": text,
        # ... author fields ...
    }

def _parse_review_div(div, section_type="not_recommended"):
    review_id = div.get("data-review-id", "")
    if not review_id:
        return None

    # Rating from the i-stars div title attribute
    rating = None
    stars_div = div.find("div", class_=re.compile(r"i-stars"))
    if stars_div:
        match = re.search(r"([\d.]+)\s*star", stars_div.get("title", ""))
        if match:
            rating = float(match.group(1))

    # Date -- extract just the M/D/YYYY portion
    date_str = ""
    date_span = div.find("span", class_="rating-qualifier")
    if date_span:
        raw_date = date_span.get_text(strip=True)
        # The span sometimes contains extra text like "Updated review"
        date_match = re.match(r"(\d{1,2}/\d{1,2}/\d{4})", raw_date)
        date_str = date_match.group(1) if date_match else raw_date

    # Review text
    text_p = div.find("p", attrs={"lang": True})
    text = text_p.get_text(strip=True) if text_p else ""

    # Skip empty placeholder reviews (Yelp stripped all content)
    if rating is None and not text:
        return None

    return {
        "review_id": review_id,
        "section": section_type,
        "rating": rating,
        "date": date_str,
        "text": text,
        # ... author fields ...
    }

A few edge cases to handle:

Date pollution. The span.rating-qualifier element sometimes captures extra text beyond the date: "11/28/2019Updated review". Use a regex to extract just the date part.
Empty placeholder reviews. Some review divs have an ID but Yelp has stripped all content. Skip these if both rating and text are empty.
2 separate pagination loops. The not-recommended section and the removed section paginate independently. You need 2 loops with different query parameters.

Running the not-recommended scraper

Pass a business URL or alias. The scraper fetches both not-recommended and removed sections with their separate pagination. See the full yelp_not_recommended.py source for the complete script.

python3 yelp_not_recommended.py "https://www.yelp.com/biz/flour-bakery-cafe-boston" --max-pages 2

Not-recommended reviews tend to skew older, as the NR page accumulates filtered content over time. Here's what the output looks like:

[
  {
    "review_id": "DK8feUfZ6_JP_7XwFztwVg",
    "section": "not_recommended",
    "rating": 1.0,
    "date": "7/23/2012",
    "text": "I have been to flour numerous times and love the food, usually. The service has always been a case study for the most inefficient cafe ever, and today was no different.",
    "language": "en",
    "author_name": "Tera L.",
    "author_location": "Boston, MA",
    "author_friends": 0,
    "author_reviews": 1,
    "author_photos": 0
  },
  {
    "review_id": "sRLS_SiuC66J5IvjRuxaBw",
    "section": "removed",
    "rating": 1.0,
    "date": "4/22/2012",
    "text": "This review has been removed for violating our Terms of Service",
    "language": "en",
    "author_name": "Jim M.",
    "author_location": "Alexandria, VA",
    "author_friends": 77,
    "author_reviews": 4,
    "author_photos": 0
  }
]

[
  {
    "review_id": "DK8feUfZ6_JP_7XwFztwVg",
    "section": "not_recommended",
    "rating": 1.0,
    "date": "7/23/2012",
    "text": "I have been to flour numerous times and love the food, usually. The service has always been a case study for the most inefficient cafe ever, and today was no different.",
    "language": "en",
    "author_name": "Tera L.",
    "author_location": "Boston, MA",
    "author_friends": 0,
    "author_reviews": 1,
    "author_photos": 0
  },
  {
    "review_id": "sRLS_SiuC66J5IvjRuxaBw",
    "section": "removed",
    "rating": 1.0,
    "date": "4/22/2012",
    "text": "This review has been removed for violating our Terms of Service",
    "language": "en",
    "author_name": "Jim M.",
    "author_location": "Alexandria, VA",
    "author_friends": 77,
    "author_reviews": 4,
    "author_photos": 0
  }
]

Storing and exporting scraped Yelp data

All 4 scrapers save output as JSON by default. The search, reviews, and not-recommended scrapers also support CSV as an optional format (the business scraper outputs a single nested object, so it's JSON-only). JSON preserves nested structures (categories, hours, photos) without data loss, while CSV flattens everything for spreadsheet analysis.

For JSON, use ensure_ascii=False in json.dump() to keep non-ASCII characters (café, résumé) readable instead of escape sequences. For CSV exports, flatten nested fields before writing because csv.DictWriter doesn't handle nested objects:

for r in results:
    r["categories"] = ", ".join(r["categories"])
    r["neighborhoods"] = ", ".join(r["neighborhoods"])

Data validation considerations

Validate output before storing:

Ratings should be between 1.0 and 5.0 – anything outside this range means a parsing error
Review IDs should be unique – duplicates indicate a pagination bug
Dates should parse cleanly – GraphQL reviews use ISO 8601 (2026-02-18T18:10:01Z), NR reviews use M/D/YYYY
Photo URLs should start with https:// – an empty URL means the extraction missed a video item

For more on data cleaning and storing scraped data, check out the dedicated guides.

Bypassing Yelp's anti-scraping measures

Yelp's bot detection checks TLS fingerprints, browser behavior, and request patterns. It's aggressive. 2 common block responses:

A CAPTCHA challenge. Yelp asks you to prove you're human:

Yelp's CAPTCHA challenge page showing "We want to make sure you are not a robot" with a slider puzzle

A hard block with no option to retry. Your IP or fingerprint is flagged:

Yelp's hard block page showing "You have been blocked" with an explanation about suspicious browser behavior

Why standard HTTP libraries fail on Yelp

When a browser connects to Yelp, the TLS handshake includes a unique fingerprint (JA3/JA4 hash) that identifies the browser and version. Python's Requests library has its own TLS fingerprint that's well-known to bot detection systems. The detection layer checks this fingerprint before the HTTP request even begins. If it doesn't match a known browser, you get blocked.

curl_cffi solves this by replicating the full TLS behavior of real browsers (cipher suites, extensions, handshake order), not just the User-Agent header.

Browser impersonation rotation

The first technique is rotating between browser profiles on each retry:

RETRY_IMPERSONATIONS = ["safari2601", "safari184", "safari260", "chrome133a", "safari170"]

def _request_with_retry(session, method, url, max_retries=5, **kwargs):
    resp = None
    for attempt in range(max_retries):
        imp = RETRY_IMPERSONATIONS[attempt % len(RETRY_IMPERSONATIONS)]
        try:
            resp = getattr(session, method)(url, impersonate=imp, **kwargs)
        except RequestException as e:
            wait = (attempt + 1) * 3 + random.uniform(1, 3)
            time.sleep(wait)
            continue

        if resp.status_code in (200, 201):
            return resp

        if resp.status_code in (403, 429):
            wait = (attempt + 1) * 3 + random.uniform(1, 3)
            time.sleep(wait)
            continue

        return resp
    return resp

RETRY_IMPERSONATIONS = ["safari2601", "safari184", "safari260", "chrome133a", "safari170"]

def _request_with_retry(session, method, url, max_retries=5, **kwargs):
    resp = None
    for attempt in range(max_retries):
        imp = RETRY_IMPERSONATIONS[attempt % len(RETRY_IMPERSONATIONS)]
        try:
            resp = getattr(session, method)(url, impersonate=imp, **kwargs)
        except RequestException as e:
            wait = (attempt + 1) * 3 + random.uniform(1, 3)
            time.sleep(wait)
            continue

        if resp.status_code in (200, 201):
            return resp

        if resp.status_code in (403, 429):
            wait = (attempt + 1) * 3 + random.uniform(1, 3)
            time.sleep(wait)
            continue

        return resp
    return resp

This retry strategy combines 2 techniques:

Impersonation cycling – each retry uses a different browser fingerprint. If Safari 26.0.1 gets blocked, the next attempt tries Safari 18.4, then Chrome 133, etc.
Escalating backoff – wait times increase: ~4s → ~7s → ~10s → ~13s → ~16s, plus random jitter. Yelp returns 403 on blocks, so the retry catches them and switches fingerprints automatically.

The code blocks in the 4 scraper sections use direct session.get()/session.post() calls for clarity. All of the scraper files route all HTTP calls through _request_with_retry.

Why residential proxies matter for Yelp

Yelp uses DataDome for bot detection, which goes beyond TLS fingerprinting. DataDome also evaluates IP reputation, and datacenter IPs get flagged because they belong to hosting providers, not real ISPs. Residential proxies route your requests through real ISP addresses, which score higher in IP reputation checks. With Decodo's rotating residential proxy, requests are also distributed across different IPs, so they don't cluster on a single range.

Page validation

Yelp blocks typically return HTTP 403, but as a safety net, validate that a 200 response contains real page content:

meta = soup.find("meta", attrs={"name": "yelp-biz-id"})
if not meta and len(resp.text) < 50000:
    raise RuntimeError("Got blocked -- page is a captcha/redirect")

Yelp business pages with full content are substantially larger than block or CAPTCHA pages. A response under 50KB with no yelp-biz-id meta tag means something went wrong: either a redirect, a partial load, or an edge case where the block came through as 200.

Managed alternatives

If building and maintaining anti-bot bypass logic isn't your focus, Decodo offers 3 managed options:

Web Scraping API – handles proxy rotation, anti-bot bypass, and JavaScript rendering in a single call. You send a URL, get back the page content.
Site Unblocker – handles anti-bot at the proxy layer. Point your existing HTTP client at the endpoint and add a few lines of proxy config.
AI Parser – skip code entirely. Send a URL and a plain-language description, get back structured JSON.

Best practices for scraping Yelp

These patterns apply across all 4 scrapers and will help you avoid blocks at scale.

Rate limiting

Endpoint

Recommended delay

Notes

Search pages

5s + random jitter

Code default; retry logic handles 403s if too short

Business pages

2s + jitter (when batching)

Single requests don't need delays; batch scraping does

Reviews (GraphQL)

1.5s + jitter

Code default for GraphQL pagination

Not-recommended reviews

1.5s + jitter

Code default for HTML pagination

Always add random jitter to avoid predictable request patterns. Each scraper uses a slightly different jitter range – here's the pattern from the search scraper:

wait = delay + random.uniform(0, delay * 0.3)
time.sleep(wait)

Keep scrapers independent

Run each scraper as a separate process. Reusing the same session across different endpoints can lead to unexpected blocks. Avoid combining them into a single orchestration script. If you need one, create a fresh session per scraper.

Test on small samples first

Run each scraper on 2–3 pages before scaling up. Inspect the JSON output for missing fields or unexpected values – check for None ratings, empty addresses, and HTML entities in text fields.

Handle optional fields gracefully

Not every business has a website, price range, or phone number. Use defensive extraction:

website = (ext.get("website") or {}).get("url", "")

The or {} pattern handles None values without crashing.

Bottom line

You now have 4 standalone Python scrapers, one for each Yelp data source covered in this guide. Each file runs independently with no shared dependencies. Set up your .env with proxy credentials and start extracting.

The scrapers handle the hard parts: TLS fingerprint impersonation, session-based pagination, automatic retry on blocks, and Apollo cache resolution. The code blocks in this post are simplified for readability. The complete scraper files include full error handling, retry guards, and CLI argument parsing.

If you'd rather skip the infrastructure, Decodo's Web Scraping API handles proxy rotation and anti-bot bypass in a single API call. You send a URL, get back the page content.

Unlock Yelp data

Simplify Yelp data extraction with Decodo's high-quality residential proxies, which maintain high success rates even with Yelp's anti-bot measures. Flexible plans for every use case.

About the author

Justinas Tamasevicius

Head of Engineering

Justinas Tamaševičius is Head of Engineering with over two decades of expertize in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.

Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

In this article

Unlimited data at your fingertips

Experience the power of Decodo's residential proxies and Web Scraping API firsthand. Sign up for a 7-day free trial.

Get started

Frequently asked questions

Is it legal to scrape Yelp?

Publicly available data can generally be accessed without logging in, and many pages on Yelp, such as business listings, reviews, and search results, fall into that category. However, Yelp's Terms of Service restrict automated access, so it's a good idea to review their guidelines before building large-scale scraping workflows, especially for commercial projects.

Does Yelp offer an official API for business data?

Yes, the Yelp Places API (formerly Fusion API) covers business search and details. But it has limitations: a 30-day free trial (5,000 calls, evaluation only), strict rate limits, only short review excerpts (not full text), and no access to not-recommended reviews. For larger datasets or complete reviews, direct scraping is the practical alternative.

Can I scrape Yelp without proxies?

For small one-off scrapes, you can run the scrapers without proxies – use curl_cffi for TLS impersonation and keep delays at 3–5 seconds between requests. For any repeatable or multi-business workflow, residential proxies reduce the likelihood of blocks because DataDome scores IP reputation.

How do I avoid CAPTCHAs while scraping Yelp?

Use a TLS-impersonating HTTP client like curl_cffi paired with residential proxies. This addresses the 2 detection vectors most relevant to non-browser scrapers: TLS fingerprinting and IP reputation. Keeping delays between requests (1.5–5 seconds, depending on the endpoint) further reduces detection risk.

Why do Requests get blocked, but curl_cffi works?

Python's Requests library has a TLS fingerprint (JA3/JA4 hash) that bot detection systems recognize as non-browser traffic. curl_cffi replicates the exact TLS behavior of real browsers (cipher suites, extensions, handshake order), so the fingerprint matches Safari or Chrome. Yelp uses DataDome for bot detection, which checks TLS fingerprints as one of its detection layers.

How many reviews can I scrape before getting blocked?

With residential proxies and the default delays, paginating through a single business's reviews is unlikely to trigger blocks. The risk increases with volume – scraping reviews across hundreds of businesses in a single session is where blocks become more likely. Rotating residential IPs and adding longer delays between businesses helps.

DATA COLLECTION

BIG DATA

What Is Web Scraping? A Complete Guide to Its Uses and Best Practices

Web scraping is a powerful tool driving innovation across industries, and its full potential continues to unfold with each day. In this guide, we'll cover the fundamentals of web scraping – from basic concepts and techniques to practical applications and challenges. We’ll share best practices and explore emerging trends to help you stay ahead in this dynamic field.

Dominykas Niaura

Last updated: Jan 29, 2025

10 min read

PYTHON

DATA COLLECTION

🐍 Python Web Scraping: In-Depth Guide 2026

Welcome to 2026! What better way to celebrate than by mastering Python? If you’re new to web scraping, don’t worry – this guide starts from the basics, guiding you step-by-step on collecting data from websites. Whether you’re curious about automating simple tasks or diving into more significant projects, Python makes it easy and fun to start. Let’s slither into the world of web scraping and see how powerful this tool can be!

Zilvinas Tamulis

Last updated: Jan 16, 2025

15 min read

Scraping Yelp: A Step-by-Step Tutorial

TL;DR

Why scrape Yelp data?

What about the Yelp Fusion API?

Setting up your Python environment

Project structure

Proxy setup

Understanding Yelp's page structure and data locations

Yelp URL patterns

Data source 1 – Hypernova JSON (search results)

Data source 2 – Apollo cache (business details)

Data source 3 – GraphQL batch API (reviews)

Data source 4 – server-rendered HTML (not-recommended reviews)

Scraping Yelp search results

Creating a session with browser impersonation

Extracting the Hypernova JSON

Extracting business data from results

Search pagination – single session is critical

Running the search scraper

Extracting business details from Yelp business pages

Parsing the Apollo cache

Resolving Apollo cache references

Extracting operating hours

Extracting photos

Running the business scraper

Scraping Yelp reviews with the GraphQL API

Getting the business ID

Building the GraphQL payload

Making the GraphQL request

Parsing review data

Running the reviews scraper

Scraping not-recommended reviews

Parsing a review from HTML

Running the not-recommended scraper

Storing and exporting scraped Yelp data

Data validation considerations

Bypassing Yelp's anti-scraping measures

Why standard HTTP libraries fail on Yelp

Browser impersonation rotation

Why residential proxies matter for Yelp

Page validation

Managed alternatives

Best practices for scraping Yelp

Rate limiting

Keep scrapers independent

Test on small samples first

Handle optional fields gracefully

Bottom line

Frequently asked questions

Is it legal to scrape Yelp?

Does Yelp offer an official API for business data?

Can I scrape Yelp without proxies?

How do I avoid CAPTCHAs while scraping Yelp?

Why do Requests get blocked, but curl_cffi works?

How many reviews can I scrape before getting blocked?

Related articles