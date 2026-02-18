Yelp's not-recommended reviews page showing filtered reviews with the "not currently recommended" header

Scraping Yelp search results

Search pages embed their listings in a Hypernova JSON blob. Each page returns 10 results, and pagination uses a start query parameter.

Creating a session with browser impersonation

Every scraper in this tutorial uses the same imports and session setup (individual scrapers add a few more as needed):

import csv import json import os import re import time import random from html import unescape from urllib . parse import quote_plus , unquote from pathlib import Path from bs4 import BeautifulSoup from curl_cffi import CurlOpt , requests from curl_cffi . requests . exceptions import RequestException from dotenv import load_dotenv load_dotenv ( Path ( __file__ ) . parent / ".env" )

The curl_cffi session impersonates a real browser's TLS fingerprint:

RETRY_IMPERSONATIONS = [ "safari2601" , "safari184" , "safari260" , "chrome133a" , "safari170" ] def _get_session ( impersonate = "safari2601" ) : proxy_url = os . environ . get ( "PROXY_URL" ) return requests . Session ( impersonate = impersonate , proxy = proxy_url , timeout = ( 10 , 30 ) , curl_options = { CurlOpt . TCP_KEEPALIVE : 1 , CurlOpt . TCP_KEEPIDLE : 60 , CurlOpt . TCP_KEEPINTVL : 30 , CurlOpt . DNS_CACHE_TIMEOUT : 300 , } , )

The impersonate parameter tells curl_cffi which browser to mimic. safari2601 (Safari 26.0.1) is a reliable default. The TCP_KEEPALIVE options prevent stale connections during pagination loops.

Important: don't set User-Agent, Accept, or Accept-Language headers manually. The impersonation already handles these. Setting them yourself can create mismatches that expose you as a scraper.

Extracting the Hypernova JSON

Look for the <script> tag with a data-hypernova-key attribute and strip the HTML comment wrappers before parsing:

def _parse_hypernova ( soup ) : """Extract search results from the Hypernova JSON blob.""" script = soup . find ( "script" , attrs = { "data-hypernova-key" : True } ) if not script or not script . string : return None text = script . string . strip ( ) if text . startswith ( "<!--" ) : text = text [ 4 : ] if text . endswith ( "-->" ) : text = text [ : - 3 ] return json . loads ( text )

The JSON is wrapped in HTML comments (), which you need to strip before parsing. Miss this step and json.loads() throws a JSONDecodeError.

Extracting business data from results

Business listings are deeply nested in the Hypernova JSON, and sponsored results need special URL decoding. The addresses dict is extracted separately from the <script data-apollo-state> tag on the same page (see the full source for that extraction code):

def _extract_businesses ( data , addresses ) : components = ( data . get ( "legacyProps" , { } ) . get ( "searchAppProps" , { } ) . get ( "searchPageProps" , { } ) . get ( "mainContentComponentsListProps" , [ ] ) ) businesses = [ ] for comp in components : srb = comp . get ( "searchResultBusiness" ) if not srb : continue alias = srb . get ( "alias" , "" ) is_ad = comp . get ( "isAd" , False ) biz_url = srb . get ( "businessUrl" , "" ) if "/adredir" in biz_url : match = re . search ( r"redirect_url=([^&]+)" , biz_url ) if match : alias = unquote ( match . group ( 1 ) ) . split ( "/biz/" ) [ - 1 ] . split ( "?" ) [ 0 ] categories = [ unescape ( c . get ( "title" , "" ) ) for c in ( srb . get ( "categories" ) or [ ] ) ] businesses . append ( { "rank" : comp . get ( "ranking" ) , "name" : unescape ( srb . get ( "name" , "" ) ) , "alias" : alias , "url" : f"https://www.yelp.com/biz/ { alias } " , "biz_id" : comp . get ( "bizId" , "" ) , "rating" : srb . get ( "rating" ) , "review_count" : srb . get ( "reviewCount" ) , "price_range" : srb . get ( "priceRange" , "" ) , "categories" : categories , "phone" : srb . get ( "phone" , "" ) , "neighborhoods" : srb . get ( "neighborhoods" , [ ] ) , "address" : addresses . get ( alias , "" ) , "is_ad" : is_ad , "is_closed" : srb . get ( "isClosed" , False ) , } ) return businesses

2 parsing pitfalls to watch for in the Hypernova data:

HTML entities in Hypernova data . Category names come through with HTML entities (like & instead of & ), so "Coffee & Tea" appears as "Coffee & Tea" in the raw data. Always run html.unescape() on text extracted from Hypernova JSON.

. Category names come through with HTML entities (like instead of ), so "Coffee & Tea" appears as "Coffee & Tea" in the raw data. Always run on text extracted from Hypernova JSON. Sponsored results use redirect URLs. Ad listings don't link to /biz/{alias} directly – they use /adredir?redirect_url=… with the real URL encoded in the query string.

Search pagination – single session is critical

Pagination for Yelp search requires the same session with cookies across all pages. If you create a new session for each page, Yelp's bot detection is more likely to flag the requests because cookies and connection state don't carry over. Here's how to keep a single session across all pages:

def scrape_search ( query , location , max_pages = None , delay = 5.0 ) : session = _get_session ( "safari2601" ) first_url = f"https://www.yelp.com/search?find_desc= { quote_plus ( query ) } &find_loc= { quote_plus ( location ) } " resp = session . get ( first_url , headers = { "Referer" : "https://www.google.com/" } ) prev_url = first_url for page in range ( 2 , total_pages + 1 ) : time . sleep ( delay + random . uniform ( 0 , delay * 0.3 ) ) offset = ( page - 1 ) * 10 session . upkeep ( ) page_url = f" { first_url } &start= { offset } " resp = session . get ( page_url , headers = { "Referer" : prev_url } ) prev_url = page_url

3 things that keep search pagination working:

session.upkeep() – call this between pagination requests. Without it, HTTP/2 connections go stale during the 5+ second delays, and the next request fails.

– call this between pagination requests. Without it, HTTP/2 connections go stale during the 5+ second delays, and the next request fails. Referer chain – page 1 uses Referer: https://www.google.com/

– page 1 uses 5-second delay minimum – search pages need longer delays than other endpoints. The code defaults to 5 seconds (configurable via –delay ), and the retry logic handles 403 responses automatically if the delay is too short.

Running the search scraper

The code blocks above are simplified for readability. See the full yelp_search.py source for the complete working script with error handling and CLI parsing.

Pass a search term, location, and optional page limit. The scraper outputs ranked results as JSON or CSV:

python3 yelp_search . py "tacos" "Los Angeles, CA" - - max - pages 3

The output JSON contains ranked business listings with all extracted fields:

[ { "rank" : 1 , "name" : "Avenue 26 Tacos" , "alias" : "avenue-26-tacos-los-angeles-2" , "url" : "https://www.yelp.com/biz/avenue-26-tacos-los-angeles-2" , "biz_id" : "boqeEN38XuEKimgKisrqSA" , "rating" : 4.4 , "review_count" : 650 , "price_range" : "$" , "categories" : [ "Food Trucks" ] , "phone" : "(213) 375-3300" , "neighborhoods" : [ "Little Tokyo" ] , "address" : "353 S Alameda St, Los Angeles" , "is_ad" : false , "is_closed" : false } ]

Extracting business details from Yelp business pages

The business details scraper parses the Apollo Client cache, the ROOT_QUERY JSON blob from data source 2. This gives you the full set of business attributes that search results don't include.

Parsing the Apollo cache

Identify the Apollo cache by size (50KB+) and the presence of ROOT_QUERY:

def _parse_apollo_cache ( soup ) : """Find and parse the Apollo state cache -- it's the biggest JSON blob on the page.""" for tag in soup . find_all ( "script" , type = "application/json" ) : text = tag . string or "" if len ( text ) > 50000 and "ROOT_QUERY" in text : clean = unescape ( text ) . strip ( ) if clean . startswith ( "<!--" ) : clean = clean [ 4 : ] if clean . endswith ( "-->" ) : clean = clean [ : - 3 ] return json . loads ( clean . strip ( ) ) return None

The page has multiple <script type="application/json"> tags, but the Apollo cache is the big one (50KB+). Check for ROOT_QUERY to identify it. Like the Hypernova data, it may be wrapped in HTML comments and contains encoded entities.

Resolving Apollo cache references

The Apollo cache uses a normalized format where related entities aren't nested. Instead, they're referenced by key:

{ "Business:abc123" : { "name" : "Flour Bakery" , "categories" : [ { "__ref" : "Category:bakeries" } , { "__ref" : "Category:coffee" } ] } , "Category:bakeries" : { "title" : "Bakeries" , "alias" : "bakeries" } }

You need a resolver function to follow these references:

def _resolve_ref ( cache , ref_or_val ) : if isinstance ( ref_or_val , dict ) and "__ref" in ref_or_val : return cache . get ( ref_or_val [ "__ref" ] , { } ) return ref_or_val

Extracting operating hours

Hours require a truthiness edge case because ["Closed"] is truthy in Python, so you can't just check bool(day_hours):

def _extract_hours ( biz ) : op = biz . get ( "operationHours" ) if not op : return [ ] weekly = op . get ( "regularHoursMergedWithSpecialHoursForCurrentWeek" , [ ] ) hours = [ ] for day in weekly : day_hours = day . get ( "regularHours" , [ ] ) is_open = bool ( day_hours ) and day_hours != [ "Closed" ] hours . append ( { "day" : day . get ( "dayOfWeekShort" , "" ) , "hours" : day_hours , "is_open" : is_open , } ) return hours

Extracting photos

Photos are the trickiest part of the Apollo cache. They aren't stored under a simple key. Instead, they're nested under biz["media"] with GraphQL argument syntax in the key names:

def _extract_photos ( biz , cache , limit = 10 ) : photos = [ ] media = biz . get ( "media" , { } ) for key in media : if "orderedMediaItems" not in key : continue edges = media [ key ] . get ( "edges" , [ ] ) for edge in edges : node = _resolve_ref ( cache , edge . get ( "node" , { } ) ) if node . get ( "__typename" ) != "BusinessPhoto" : continue photo_url = "" photo_url_obj = node . get ( "photoUrl" , { } ) if isinstance ( photo_url_obj , dict ) : for url_key , url_val in photo_url_obj . items ( ) : if isinstance ( url_val , str ) and url_val . startswith ( "http" ) : if "LARGE" in url_key or "ORIGINAL" in url_key : photo_url = url_val break photos . append ( { "id" : node . get ( "encid" , "" ) , "caption" : node . get ( "caption" ) or "" , "url" : photo_url , } ) if len ( photos ) >= limit : break return photos

3 edge cases in the photo extraction logic:

Videos mixed with photos . The orderedMediaItems list contains both BusinessPhoto and BusinessVideo items. Filter by __typename .

. The list contains both and items. Filter by . Parameterized URL keys . Photo URLs aren't under a simple url key – they're under url({"size":"LARGE"}) . You need to search for the key containing "LARGE" or "ORIGINAL."

. Photo URLs aren't under a simple key – they're under . You need to search for the key containing "LARGE" or "ORIGINAL." Null captions. node.get("caption", "") returns None when the JSON value is null , not "" . Use node.get("caption") or "" instead.

Running the business scraper

Pass any Yelp business URL. The scraper extracts the Apollo cache and outputs a single JSON object with the full profile. See the full yelp_business.py source for the complete script.

python3 yelp_business . py "https://www.yelp.com/biz/flour-bakery-cafe-boston"

The output is a single JSON object with the full business profile (truncated here for readability):

{ "biz_id" : "-5gWvrcKOPmhlcZju3tpbw" , "name" : "Flour Bakery + Café" , "alias" : "flour-bakery-café-boston-4" , "url" : "https://www.yelp.com/biz/flour-bakery-café-boston-4" , "is_closed" : false , "rating" : 4.3 , "review_count" : 1436 , "price_range" : "$$" , "phone" : "(617) 338-4333" , "address" : { "line1" : "12 Farnsworth St" , "city" : "Boston" , "state" : "MA" , "postal_code" : "02210" , "country" : "US" } , "neighborhoods" : [ "Waterfront" , "South Boston" ] , "coordinates" : { "latitude" : 42.35123 , "longitude" : - 71.048747 } , "categories" : [ { "title" : "Bakeries" , "alias" : "bakeries" } , { "title" : "Coffee & Tea" , "alias" : "coffee" } , { "title" : "Sandwiches" , "alias" : "sandwiches" } ] , "hours" : [ { "day" : "Mon" , "hours" : [ "7:00 AM - 7:00 PM" ] , "is_open" : true } , { "day" : "Tue" , "hours" : [ "7:00 AM - 7:00 PM" ] , "is_open" : true } , { "day" : "Wed" , "hours" : [ "7:00 AM - 7:00 PM" ] , "is_open" : true } , // . . . Thu , Fri , Sun omitted . . . { "day" : "Sat" , "hours" : [ "8:00 AM - 6:00 PM" ] , "is_open" : true } ] , "website" : "https://flourbakery.com" , "attributes" : [ { "name" : "Offers delivery" , "alias" : "RestaurantsDelivery" , "is_active" : true } , { "name" : "Free Wi-Fi" , "alias" : "wifi_options" , "is_active" : true } , { "name" : "Outdoor seating" , "alias" : "has_outdoor_seating" , "is_active" : true } ] , "photos" : [ { "id" : "iKA2Ynabpt9q9pQ7iLpEcw" , "caption" : "Broccoli melt (half size)" , "url" : "https://s3-media0.fl.yelpcdn.com/bphoto/iKA2Ynabpt9q9pQ7iLpEcw/l.jpg" } ] }

Scraping Yelp reviews with the GraphQL API

The reviews scraper uses Yelp's internal GraphQL batch endpoint, the same one the frontend calls when you scroll through reviews on a business page. This returns structured data directly.

Getting the business ID

Before you can query reviews, you need the encBizId (encrypted business ID). It's in the HTML:

def extract_biz_id ( session , url ) : resp = session . get ( url , headers = { "Referer" : "https://www.google.com/" } ) soup = BeautifulSoup ( resp . text , "html.parser" ) meta = soup . find ( "meta" , attrs = { "name" : "yelp-biz-id" } ) if not meta : raise RuntimeError ( "Could not find yelp-biz-id meta tag." ) enc_biz_id = str ( meta [ "content" ] ) return enc_biz_id

Building the GraphQL payload

The request uses a persisted query hash (documentId) instead of raw GraphQL, and pagination cursors are base64-encoded:

from base64 import b64encode GQL_URL = "https://www.yelp.com/gql/batch" DOC_ID = "ef51f33d1b0eccc958dddbf6cde15739c48b34637a00ebe316441031d4bf7681" def build_gql_payload ( enc_biz_id , offset = 0 ) : variables = { "encBizId" : enc_biz_id , "reviewsPerPage" : 10 , "selectedReviewEncId" : "" , "hasSelectedReview" : False , "sortBy" : "DATE_DESC" , "languageCode" : "en" , "ratings" : [ 5 , 4 , 3 , 2 , 1 ] , "isSearching" : False , "isTranslating" : False , "translateLanguageCode" : "en" , "reactionsSourceFlow" : "businessPageReviewSection" , "minConfidenceLevel" : "HIGH_CONFIDENCE" , "highlightType" : "" , "highlightIdentifier" : "" , "isHighlighting" : False , } if offset > 0 : token = b64encode ( json . dumps ( { "version" : 1 , "type" : "offset" , "offset" : offset } ) . encode ( ) ) . decode ( ) variables [ "after" ] = token return [ { "operationName" : "GetBusinessReviewFeed" , "variables" : variables , "extensions" : { "operationType" : "query" , "documentId" : DOC_ID , } , } ]

Key parameters in this payload:

documentId – a stable hash of the GraphQL query stored on Yelp's server. You don't send the actual query text – just this hash. If this hash stops working, see the recovery steps below.

– a stable hash of the GraphQL query stored on Yelp's server. You don't send the actual query text – just this hash. If this hash stops working, see the recovery steps below. Pagination uses base64-encoded offset tokens. The after parameter takes a base64-encoded JSON object: {"version": 1, "type": "offset", "offset": 10} .

The parameter takes a base64-encoded JSON object: . sortBy: "DATE_DESC" – returns newest reviews first. Other options: "RELEVANCE_DESC" , "ELITES_DESC" .

If documentId stops working: open any Yelp business page in Chrome DevTools, go to the Network tab, filter by batch, and look for the GetBusinessReviewFeed request. The current hash is in the request payload under extensions.documentId.