Back to blog

How To Scrape Shopify Stores: Complete Developer Guide

Most Shopify stores have a built-in JSON endpoint for product data: prices, variants, inventory, images. Web scraping Shopify means requesting /products.json, paginating, and getting the catalog as JSON. But the endpoint is limited to 250 products per page, and some merchants disable it. This guide covers both: the JSON approach for stores that have it, and the fallback for stores that don't.

TL;DR

  • Web scraping Shopify stores starts with the /products.json endpoint, which returns structured product data without HTML parsing.
  • The endpoint works on most public stores. 6 out of 8 we tested returned data directly
  • For stores that disable it, the XML sitemap plus JSON-LD extraction from product pages covers the core fields (name, price, availability)
  • You need Python, curl_cffi (for Cloudflare TLS fingerprinting), and, optionally, residential proxies for multi-store scraping
  • The complete standalone script at the end handles both approaches automatically. Change the URL and run it.

What Shopify product data is worth scraping

Each use case maps to specific fields in the products.json response:

  • Competitor price monitoring. Extract pricecompare_at_price, and variants across multiple stores on a daily schedule. A non-null compare_at_price indicates an active promotion.
  • Product trend research. Extract titletags, and product_type to map which categories a brand is expanding into.
  • Inventory tracking. The available boolean indicates stock status per variant. Use inventory_quantity via /products/{handle}.js for exact counts.
  • Feed generation. The products.json response maps to most product feed schemas with minimal transformation.

The web scraping for market research guide describes additional competitor research patterns.

Confirm a site runs on Shopify

Before writing any scraping code, install curl_cffi and verify your target site runs on Shopify:

pip install curl_cffi beautifulsoup4

The fastest detection method is to request the products.json endpoint and check for valid JSON:

from curl_cffi import requests
def is_shopify_store(url):
try:
response = requests.get(
f"{url}/products.json?limit=1",
impersonate="chrome",
timeout=10
)
if response.status_code == 200:
data = response.json()
return "products" in data
except Exception:
pass
return False
print(is_shopify_store("https://www.allbirds.com"))
print(is_shopify_store("https://example.com"))

Allbirds runs on Shopify, so the function returns True. example.com isn't, so it returns False:

True
False

The code uses curl_cffi with impersonate="chrome" because most Shopify stores run behind Cloudflare, which checks TLS fingerprints. Plain Requests with a fake User-Agent still looks like Python at the TLS layer.

If the endpoint returns a 403 or 404, the store may still run on Shopify, but with the JSON endpoint disabled. You can check 2 additional signals: asset URLs pointing to cdn.shopify.com in the page source, and an x-shopify-shop-id response header.

Test your target store first

Before targeting a specific store, run this diagnostic. It checks which endpoints are open and what scraping approach to use:

from curl_cffi import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import json
def diagnose_shopify_store(store_url):
"""Run a full diagnostic on a Shopify store."""
report = {"store": store_url}
ns = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}
# 1. products.json availability + get a sample handle
sample_handle = None
try:
resp = requests.get(
f"{store_url}/products.json?limit=1",
impersonate="chrome", timeout=10
)
report["products_json"] = resp.status_code
if resp.status_code == 200:
data = resp.json()
report["is_shopify"] = "products" in data
if data.get("products"):
sample_handle = data["products"][0]["handle"]
else:
report["is_shopify"] = False
except Exception:
report["products_json"] = "error"
report["is_shopify"] = False
# 2. Catalog size from sitemap (checks all product sub-sitemaps)
try:
resp = requests.get(
f"{store_url}/sitemap.xml",
impersonate="chrome", timeout=10
)
root = ET.fromstring(resp.content)
total = 0
for sm in root.findall("ns:sitemap", ns):
loc = sm.find("ns:loc", ns).text
if "products" in loc:
r2 = requests.get(
loc, impersonate="chrome", timeout=15
)
r2root = ET.fromstring(r2.content)
total += sum(
1 for u in r2root.findall("ns:url", ns)
if "/products/" in u.find("ns:loc", ns).text
)
report["catalog_size"] = total if total else None
except Exception:
report["catalog_size"] = None
# 3. /products/{handle}.js endpoint
if sample_handle:
try:
r = requests.get(
f"{store_url}/products/{sample_handle}.js",
impersonate="chrome", timeout=10
)
report["js_endpoint"] = r.status_code == 200
except Exception:
report["js_endpoint"] = False
# 4. /collections.json endpoint
try:
r = requests.get(
f"{store_url}/collections.json?limit=1",
impersonate="chrome", timeout=10
)
report["collections_json"] = (
r.status_code == 200 and "collections" in r.json()
)
except Exception:
report["collections_json"] = False
# 5. JSON-LD on HTML product pages
if sample_handle:
try:
r = requests.get(
f"{store_url}/products/{sample_handle}",
impersonate="chrome", timeout=15
)
soup = BeautifulSoup(r.text, "html.parser")
has_jsonld = False
for s in soup.find_all(
"script", type="application/ld+json"
):
try:
d = json.loads(s.string)
items = d if isinstance(d, list) else [d]
for item in items:
if item.get("@type") in (
"Product", "ProductGroup"
):
has_jsonld = True
except Exception:
continue
report["jsonld_on_html"] = has_jsonld
except Exception:
report["jsonld_on_html"] = False
return report
report = diagnose_shopify_store("https://www.allbirds.com")
for key, value in report.items():
print(f" {key:20} {value}")

For Allbirds, every endpoint is open:

store https://www.allbirds.com
products_json 200
is_shopify True
catalog_size 1693
js_endpoint True
collections_json True
jsonld_on_html True

Diagnostic results across famous Shopify stores

2 out of 8 stores block products.json. JSON-LD presence varies by theme:

Store

products.json

Catalog (sitemap)

.js

/collections.json

JSON-LD

Recommended path

allbirds.com

200

1,693

products.json + .js

gymshark.com

403

3,944

blocked

blocked

sitemap + JSON-LD fallback

taylorstitch.com

200

1,610

products.json + .js

skims.com

404

3,092

blocked

blocked

sitemap + JSON-LD fallback

fentybeauty.com

200

773

products.json + .js

kyliecosmetics.com

200

242

products.json + .js

jeffreestarcosmetics.com

200

238

products.json + HTML selectors

redbullshopus.com

200

685

products.json + .js

Merchant-disabled vs. anti-bot blocked. When a store returns 403 or 404 on products.json, the cause matters. A merchant-disabled endpoint (common on Shopify Plus and headless setups) stays blocked regardless of IP. An anti-bot block is often removed when you use a residential proxy.

To distinguish them, retry the same request through a residential proxy. If the status code stays the same, the merchant likely disabled it. If it changes to 200, it was anti-bot blocking:

from curl_cffi import requests
proxy_url = (
"http://user-YOUR_PROXY_USERNAME-country-us"
":YOUR_PROXY_PASSWORD@gate.decodo.com:7000"
)
proxies = {"http": proxy_url, "https": proxy_url}
for store in ["https://www.gymshark.com", "https://skims.com"]:
direct = requests.get(
f"{store}/products.json?limit=1",
impersonate="chrome", timeout=15
).status_code
via_proxy = requests.get(
f"{store}/products.json?limit=1",
impersonate="chrome",
proxies=proxies, timeout=45
).status_code
print(f"{store}: direct={direct}, via US proxy={via_proxy}")

Neither store's status code changes through the proxy:

https://www.gymshark.com: direct=403, via US proxy=403
https://skims.com: direct=404, via US proxy=404

Both are merchant-disabled. For these stores, you skip products.json and use the sitemap to discover product URLs, then extract JSON-LD from each HTML page. Proxies still help during HTML scraping because you're making thousands of individual page requests, but they don't re-enable the disabled endpoint.

Scrape Shopify product data with products.json

Test the endpoint in a browser first. Paste https://www.allbirds.com/products.json?limit=1, and the browser shows raw JSON.

Set up the request function

Every scraping method below uses the same retry wrapper. Define it once:

from curl_cffi import requests
import time
MAX_RETRIES = 3
def fetch_with_retry(url, max_retries=MAX_RETRIES):
"""GET with exponential backoff on failure or rate limit."""
for attempt in range(max_retries):
try:
resp = requests.get(
url, impersonate="chrome", timeout=30
)
if resp.status_code == 429:
wait = 2 ** (attempt + 1)
print(f" Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
resp.raise_for_status()
return resp
except Exception as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
return None

Fetch and paginate

The scraper paginates through the full catalog:

import json
from datetime import datetime
STORE_URL = "https://www.allbirds.com"
def scrape_shopify_products(store_url):
"""Scrape all products via the products.json endpoint."""
all_products = []
page = 1
while True:
url = f"{store_url}/products.json?limit=250&page={page}"
response = fetch_with_retry(url)
products = response.json().get("products", [])
if not products:
break
all_products.extend(products)
print(f"Page {page}: {len(products)} products")
page += 1
time.sleep(2)
return all_products
products = scrape_shopify_products(STORE_URL)
print(f"Total: {len(products)} products")

Allbirds has 917 products across 4 pages:

Page 1: 250 products
Page 2: 250 products
Page 3: 250 products
Page 4: 167 products
Total: 917 products

Without the 2-second delay, Shopify can return 429 responses or empty arrays. The default limit is 30. Set limit=250 to reduce total requests.

Note that catalog sizes fluctuate as stores update inventory. The diagnostic section shows 1,693 from the sitemap (includes color-variant URLs), while products.json returned 917 unique product entries during this test. Both numbers are correct – they count different things.

Parse key product fields

The endpoint returns prices as strings (like "110.00"), not numbers. Convert them if you need numeric comparisons:

for product in products[:1]:
print(f"Title: {product['title']}")
print(f"Handle: {product['handle']}")
print(f"Vendor: {product['vendor']}")
print(f"Type: {product['product_type']}")
for variant in product.get("variants", [])[:1]:
price = float(variant["price"])
compare = variant.get("compare_at_price")
print(f" Variant: {variant['title']}")
print(f" Price: ${price:.2f}")
if compare:
print(f" Compare at: ${float(compare):.2f}")
else:
print(f" Compare at: N/A")
print(f" SKU: {variant.get('sku', 'N/A')}")
print(f" Available: {variant.get('available')}")
print("---")

The first product in the response is a slip-on with one variant shown:

Title: Women's Cruiser Slip On Terry - Warm White (Warm White Sole)
Handle: womens-cruiser-slip-on-terry
Vendor: Allbirds
Type: Shoes
Variant: 5
Price: $110.00
Compare at: N/A
SKU: A12372W050
Available: True
---

compare_at_price is null when there's no sale. When a promotion is active, it holds the original price, and price holds the discounted value. For exact inventory counts, the /products/{handle}.js endpoint (see "Get more detailed product data") has inventory_quantity.

Export to CSV

Flatten variants into one row each for CSV export:

import csv
csv_file = f"allbirds_products_{datetime.now():%Y%m%d}.csv"
with open(csv_file, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow([
"title", "handle", "vendor", "type",
"variant", "sku", "price",
"compare_at_price", "available"
])
for p in products:
for v in p.get("variants", []):
writer.writerow([
p["title"], p["handle"], p["vendor"],
p["product_type"], v["title"],
v.get("sku", ""), v["price"],
v.get("compare_at_price", ""),
v.get("available", "")
])
print(f"Exported to {csv_file}")

The script writes a date-stamped file like allbirds_products_20260414.csv. The first few rows of the CSV look like this:

Each product has multiple size variants (6–10 per product on Allbirds), so the CSV output is several thousand rows. For JSON output with metadata, the guide to saving scraped data describes additional export patterns.

Get more detailed product data with the .js endpoint

The listing endpoint omits inventory counts, barcodes, and media dimensions. The /products/{handle}.js endpoint has them:

from curl_cffi import requests
def get_product_details(store_url, handle):
"""Fetch enriched product data via the .js endpoint."""
url = f"{store_url}/products/{handle}.js"
response = fetch_with_retry(url)
return response.json()
data = get_product_details(
"https://www.allbirds.com", "mens-tree-runners"
)
print(f"Title: {data['title']}")
print(f"Available: {data['available']}")
print(f"Price range: "
f"${data['price_min']/100:.2f} - "
f"${data['price_max']/100:.2f}")
print(f"Variants: {len(data['variants'])}")
print(f"Media items: {len(data.get('media', []))}")
v = data["variants"][0]
print(f"\nEnriched variant fields:")
print(f" SKU: {v['sku']}")
print(f" Price: ${v['price']/100:.2f}")
print(f" Inventory qty: {v.get('inventory_quantity')}")
print(f" Barcode: {v.get('barcode')}")

The output for the Men's Tree Runner includes the extra fields that products.json doesn't have:

Title: Men's Tree Runner - Jet Black (White Sole)
Available: True
Price range: $100.00 - $100.00
Variants: 7
Media items: 4
Enriched variant fields:
SKU: TR3MJBW080
Price: $100.00
Inventory qty: 305
Barcode: 843416184854

The .js endpoint returns prices in the minor currency unit, unlike the listing endpoint. For USD, that means cents (10000 = $100.00). For GBP, that means pence (10000 = £100.00). And inventory_quantity is the reported stock count – 305 units at the time of scraping. The media[] array includes aspect_ratiowidthheight, and media_type per image.

Field

/products.json

/products/{handle}.js

title, handle, vendor

price format

string ("110.00")

cents (11000)

compare_at_price

available

inventory_quantity

barcode

media[] with dimensions

tags

selling_plan_groups

Some stores encode metadata in tags using custom conventions. For example, Allbirds uses namespace::key => value strings for material and carbon scores. Inspect a sample product's tags before building your parser.

Scrape a specific Shopify collection

Append /products.json to any collection URL to get only the products in that category.

The function below reuses fetch_with_retry() from the setup section:

def scrape_collection(store_url, collection_handle):
"""Scrape products from a specific collection."""
all_products = []
page = 1
while True:
url = (
f"{store_url}/collections/{collection_handle}"
f"/products.json?limit=250&page={page}"
)
response = fetch_with_retry(url)
products = response.json().get("products", [])
if not products:
break
all_products.extend(products)
print(f"Page {page}: {len(products)} products")
page += 1
time.sleep(2)
return all_products
sneakers = scrape_collection(
"https://www.allbirds.com", "mens-sneakers"
)
print(f"Total in mens-sneakers: {len(sneakers)} products")

The mens-sneakers collection is returned in a single page:

Page 1: 241 products
Total in mens-sneakers: 241 products

If you only need pricing data for one category, send a single request to /collections/mens-sneakers/products.json?limit=250. That returns 241 products without paginating the full catalog.

List all collections

The /collections.json endpoint returns the category structure. On Allbirds, that's 1,334 collections across 6 pages:

def get_all_collections(store_url):
"""Fetch all collections from a Shopify store."""
all_collections = []
page = 1
while True:
url = (
f"{store_url}/collections.json"
f"?limit=250&page={page}"
)
response = fetch_with_retry(url)
collections = response.json().get("collections", [])
if not collections:
break
all_collections.extend(collections)
print(f"Page {page}: {len(collections)} collections")
page += 1
time.sleep(1)
return all_collections
collections = get_all_collections("https://www.allbirds.com")
print(f"\nTotal: {len(collections)} collections")
for c in collections[:3]:
print(f" {c['handle']} ({c.get('products_count', 0)} products)")

Allbirds has 1,334 collections across 6 pages:

Page 1: 250 collections
Page 2: 250 collections
Page 3: 250 collections
Page 4: 250 collections
Page 5: 250 collections
Page 6: 84 collections
Total: 1334 collections
womens-accessories (28 products)
add-on-essentials (97 products)
mens-sneakers (241 products)

Extract product URLs from the XML sitemap

Shopify appends query parameters to sitemap child URLs. Parse the parent sitemap at /sitemap.xml first to get the exact product sitemap path. The function uses fetch_with_retry() from the setup section:

import xml.etree.ElementTree as ET
def get_product_urls(store_url):
"""Extract all product URLs from the Shopify sitemap."""
response = fetch_with_retry(f"{store_url}/sitemap.xml")
root = ET.fromstring(response.content)
ns = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}
# Collect ALL product sub-sitemaps (large stores have multiple)
product_sitemaps = [
sm.find("ns:loc", ns).text
for sm in root.findall("ns:sitemap", ns)
if "products" in sm.find("ns:loc", ns).text
]
if not product_sitemaps:
return []
# Parse product URLs from each sub-sitemap
urls = []
for sitemap_url in product_sitemaps:
response = fetch_with_retry(sitemap_url)
sub_root = ET.fromstring(response.content)
urls.extend(
url_elem.find("ns:loc", ns).text
for url_elem in sub_root.findall("ns:url", ns)
if "/products/" in url_elem.find("ns:loc", ns).text
)
return urls
product_urls = get_product_urls("https://www.allbirds.com")
print(f"Found {len(product_urls)} product URLs")
for url in product_urls[:3]:
print(url)

The function extracts all product URLs from every sub-sitemap:

Found 917 product URLs
https://www.allbirds.com/products/mens-wool-runners-natural-black
https://www.allbirds.com/products/mens-wool-runners
https://www.allbirds.com/products/mens-wool-runners-natural-white

The sitemap returns URLs, not product data. Combine it with per-page scraping for stores that disable products.json.

Deduplication note. The sitemap count is often higher than the products.json count. Color variants share a parent product but have separate sitemap URLs. Group by the product handle or product_id rather than treating each URL as unique.

Scrape Shopify HTML when products.json is disabled

Some Shopify Plus merchants disable public access to products.json. When the endpoint returns a 403 or 404, extract product data from the HTML pages instead.

Many Shopify themes embed a <script type="application/ld+json"> block on product pages with schema.org Product structured data. The schema varies – some themes use Product, others use ProductGroup with nested hasVariant arrays.

from curl_cffi import requests
from bs4 import BeautifulSoup
import json
def scrape_product_from_html(url):
"""Extract product data from JSON-LD on a product page."""
response = fetch_with_retry(url)
soup = BeautifulSoup(response.text, "html.parser")
for script in soup.find_all(
"script", type="application/ld+json"
):
try:
data = json.loads(script.string)
if data.get("@type") == "Product":
return data
# Some themes use ProductGroup
if data.get("@type") == "ProductGroup":
variants = data.get("hasVariant", [])
if variants:
first = variants[0]
first["name"] = data.get("name")
first["brand"] = data.get("brand")
return first
except (json.JSONDecodeError, TypeError):
continue
return None
product = scrape_product_from_html(
"https://www.taylorstitch.com/products/"
"drift-boardshort-in-ivory-floral-2604"
)
if product:
offers = product.get("offers", {})
if isinstance(offers, list):
offers = offers[0]
print(f"Name: {product.get('name')}")
print(f"Price: {offers.get('price')}")
print(f"Currency: {offers.get('priceCurrency')}")
print(f"Available: {offers.get('availability')}")
print(f"SKU: {product.get('sku')}")

The JSON-LD on this Taylor Stitch product page has the same core fields:

Name: The Drift Boardshort
Price: 118.0
Currency: USD
Available: https://schema.org/InStock
SKU: 2604DBIF28

Not every Shopify theme includes JSON-LD. The diagnostic table shows Jeffree Star Cosmetics as an example that lacks it, while most others embed a ProductGroup block. For stores without JSON-LD, use data- attributes like data-product-id and data-handle rather than CSS class names. For a deeper tutorial on HTML parsing, see the Beautiful Soup web scraping guide.

Build a fallback pipeline

Combine both methods into a single function that tries products.json first and falls back to HTML:

def scrape_store(store_url):
"""Try JSON endpoint, fall back to HTML scraping."""
try:
products = scrape_shopify_products(store_url)
if products:
print(f"JSON endpoint: {len(products)} products")
return products
except Exception as e:
print(f"products.json failed: {e}")
product_urls = get_product_urls(store_url)
products = []
failed = []
for url in product_urls:
try:
product = scrape_product_from_html(url)
if product:
products.append(product)
except Exception as e:
failed.append({"url": url, "error": str(e)})
time.sleep(1)
if failed:
with open("failed_urls.json", "w") as f:
json.dump(failed, f, indent=2)
print(f"{len(failed)} URLs failed - logged")
print(f"HTML fallback: {len(products)} products")
return products

Failed URLs are logged to failed_urls.json for retry:

[
{"url": "https://store.com/products/broken-link", "error": "HTTP Error 404"},
{"url": "https://store.com/products/timeout", "error": "ReadTimeout"}
]

For large catalogs (1,000+ URLs), the HTML loop takes 15–30 minutes at 1 request per second. The complete script at the end includes progress logging.

Handle rate limits and anti-bot protections

Shopify doesn't publish official rate limits for products.json, but enforcement exists. Rapid requests from the same IP can trigger 429 responses or silent empty arrays.

Respect the request cadence

A 2-second delay between paginated requests avoids most rate limit issues. For individual product pages through HTML, use 1–2 seconds. Larger gaps (5–10 seconds) are safer when scraping multiple stores in sequence. For more retry patterns, see the retry guide for Python requests.

Common failure patterns to recognize. These are the signatures you'll see in the terminal when a request fails:

  • Products: 0 after a successful 200 response → silent rate limit, treat it as a 429 and back off
  • curl_cffi.requests.exceptions.HTTPError: HTTP Error 403 on .myshopify.com domains → merchant has password-protected the store (B2B or pre-launch) and no proxy helps
  • xml.etree.ElementTree.ParseError: no element found on sitemap → the store's sitemap URL needs query parameters; fetch the parent /sitemap.xml first to get the exact child URL
  • JSONDecodeError on products.json → the endpoint returned HTML instead of JSON (usually a 5xx error page or a Cloudflare challenge); retry through a residential proxy

Some stores require cookie consent before serving content. Pass a consent cookie in request headers or use browser automation to handle the consent flow. If is_shopify_store() or the diagnostic returns a 403 or 404, switch to the HTML extraction method.

Rotate proxies for multi-store scraping

Scraping a single Shopify store from one IP address works for small catalogs. But when you scrape multiple stores on a recurring schedule, Cloudflare flags your IP. Most Shopify stores route traffic through Cloudflare, so the protection is similar. Some headless setups (like Gymshark) use different CDNs such as AWS CloudFront, but curl_cffi is good practice for consistent browser-like requests.

For Shopify scraping at scale, residential proxies use IP addresses assigned to real household devices. They're far less likely to be blocked than datacenter IPs. Rotating proxies provide a different IP for each request automatically through a gateway.

Configure Decodo residential proxies

To get started, create an account and generate proxy credentials from the dashboard. The residential proxy quick start guide explains the full setup.

Everything goes through one endpoint: gate.decodo.com:7000. Location, session type, and duration are all controlled through username parameters:

user-USERNAME-country-COUNTRY:PASSWORD@gate.decodo.com:7000

For a random global IP without country targeting, drop the country parameter:

from curl_cffi import requests
# Replace YOUR_PROXY_USERNAME and YOUR_PROXY_PASSWORD
proxy_url = (
"http://user-YOUR_PROXY_USERNAME"
":YOUR_PROXY_PASSWORD@gate.decodo.com:7000"
)
response = requests.get(
"https://ip.decodo.com/json",
impersonate="chrome",
proxies={"http": proxy_url, "https": proxy_url},
timeout=30
)
ip_data = response.json()
print(f"Proxy IP: {ip_data['proxy']['ip']}")
print(f"Country: {ip_data['country']['name']}")
print(f"ISP: {ip_data['isp']['isp']}")

The proxy provides a residential IP from a random country:

Proxy IP: 136.158.70.151
Country: Philippines
ISP: Globe Telecom

By default, each request gets a new IP. To keep the same IP across multiple requests (better for paginating a single store), add session and sessionduration to the username: user-USERNAME-session-1-sessionduration-10. The default session duration is 10 minutes.

Target a specific country

For stores that serve different prices by region, add the country parameter to the username. This routes requests through a residential IP in that country:

from curl_cffi import requests
# US-targeted proxy via country parameter
proxy_url = (
"http://user-YOUR_PROXY_USERNAME-country-us"
":YOUR_PROXY_PASSWORD@gate.decodo.com:7000"
)
response = requests.get(
"https://www.allbirds.com/products.json?limit=1",
impersonate="chrome",
proxies={"http": proxy_url, "https": proxy_url},
timeout=30
)
product = response.json()["products"][0]
print(f"Product: {product['title']}")
print(f"Price: ${product['variants'][0]['price']}")

The request goes through a US IP and returns US pricing:

Product: Women's Cruiser Slip On Terry - Warm White (Warm White Sole)
Price: $110.00

Scrape Shopify with the Decodo Web Scraping API

The code above works for most public Shopify stores. But at 50+ stores on a daily schedule, maintaining retry logic, proxy rotation, and fallback chains becomes its own project. The Decodo Web Scraping API reduces that to a single HTTP call per page. You need an API token. The Web Scraping API quick start guide explains how to get one:

from curl_cffi import requests
from bs4 import BeautifulSoup
import json
API_URL = "https://scraper-api.decodo.com/v2/scrape"
API_TOKEN = "YOUR_API_TOKEN"
# Scrape a Gymshark product page - this store
# blocks all JSON endpoints (products.json, .js, collections.json)
response = requests.post(
API_URL,
headers={
"Accept": "application/json",
"Authorization": f"Basic {API_TOKEN}",
"Content-Type": "application/json"
},
json={
"url": "https://www.gymshark.com/products/"
"gymshark-running-t-shirt-ss-tops",
"headless": "html",
"proxy_pool": "premium"
},
timeout=90
)
result = response.json()["results"][0]
print(f"Status code: {result['status_code']}")
# Parse JSON-LD from the rendered HTML
soup = BeautifulSoup(result["content"], "html.parser")
for script in soup.find_all(
"script", type="application/ld+json"
):
try:
data = json.loads(script.string)
if data.get("@type") == "Product":
offers = data.get("offers", {})
print(f"Product: {data['name']}")
print(f"Price: ${offers.get('price')}")
print(f"In stock: {offers.get('availability')}")
except (json.JSONDecodeError, TypeError):
continue

The API renders the Gymshark page and returns the product data that direct requests cannot retrieve:

Status code: 200
Product: Running T-Shirt
Price: $30.4
In stock: https://schema.org/InStock

The API has a free tier for testing. It also supports geo for country targeting and browser actions for JavaScript-rendered pages. The full parameter list is in the API docs.

Scrape Shopify without blocks

Decodo's Web Scraping API handles Shopify's bot detection so you get clean product data on every request.

Automate scraping on a schedule

Price monitoring and inventory tracking need recurring runs.

Schedule with cron

On Linux or macOS, add a cron job to run the scraper at a fixed time. Open the crontab with crontab -e and add:

# Run Shopify scraper daily at 3:00 AM
0 3 * * * cd /home/user/scraper && python shopify_scraper.py >> scraper.log 2>&1

On Windows, use Task Scheduler to create a daily trigger for the same script.

Add a lock file to prevent overlapping runs when the previous scrape hasn't finished:

import os
import sys
LOCK_FILE = "scraper.lock"
if os.path.exists(LOCK_FILE):
print("Previous run still active. Exiting.")
sys.exit(1)
try:
with open(LOCK_FILE, "w") as f:
f.write(str(os.getpid()))
products = scrape_shopify_products(STORE_URL)
# ... save output, export CSV ...
finally:
if os.path.exists(LOCK_FILE):
os.remove(LOCK_FILE)

Detect price changes between runs

Compare each scrape against the previous file using the product handle and variant ID as the composite key:

import json
def detect_changes(current_products, previous_file):
"""Compare current scrape against previous run."""
try:
with open(previous_file) as f:
previous = json.load(f)["products"]
except (FileNotFoundError, KeyError):
return []
prev_map = {}
for p in previous:
for v in p.get("variants", []):
prev_map[f"{p['handle']}_{v['id']}"] = v
changes = []
for p in current_products:
for v in p.get("variants", []):
key = f"{p['handle']}_{v['id']}"
prev = prev_map.get(key)
if not prev:
changes.append({
"handle": p["handle"],
"change": "new_product",
"price": v["price"]
})
elif prev["price"] != v["price"]:
changes.append({
"handle": p["handle"],
"change": "price_changed",
"old_price": prev["price"],
"new_price": v["price"]
})
return changes

Push changed records to a database, spreadsheet, or notification system.

For teams that prefer visual workflows over cron scripts, the Decodo n8n integration is an alternative.

Complete script

Save as shopify_scraper.py and run with python shopify_scraper.py:

"""
Shopify Store Scraper
"""
from curl_cffi import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import json
import csv
import time
import sys
from datetime import datetime
# ---- Configuration ----
STORE_URL = "https://www.allbirds.com"
# Uncomment for Decodo residential proxies
# PROXY_USER = "YOUR_PROXY_USERNAME"
# PROXY_PASS = "YOUR_PROXY_PASSWORD"
# PROXIES = {
# "http": f"http://user-{PROXY_USER}-country-us:"
# f"{PROXY_PASS}@gate.decodo.com:7000",
# "https": f"http://user-{PROXY_USER}-country-us:"
# f"{PROXY_PASS}@gate.decodo.com:7000",
# }
PROXIES = None
DELAY_BETWEEN_PAGES = 2
MAX_RETRIES = 3
# ---- Core functions ----
def fetch_with_retry(url, max_retries=MAX_RETRIES):
for attempt in range(max_retries):
try:
resp = requests.get(url, impersonate="chrome", proxies=PROXIES, timeout=30)
if resp.status_code == 429:
wait = 2 ** (attempt + 1)
print(f" Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
resp.raise_for_status()
return resp
except Exception:
if attempt == max_retries - 1:
raise
time.sleep(2**attempt)
return None
def is_shopify_store(store_url):
try:
r = fetch_with_retry(f"{store_url}/products.json?limit=1")
return r is not None and "products" in r.json()
except Exception:
return False
def scrape_products_json(store_url):
all_products = []
page = 1
while True:
url = f"{store_url}/products.json?limit=250&page={page}"
r = fetch_with_retry(url)
products = r.json().get("products", [])
if not products:
break
all_products.extend(products)
print(f" Page {page}: {len(products)} products")
page += 1
time.sleep(DELAY_BETWEEN_PAGES)
return all_products
def get_product_urls_from_sitemap(store_url):
r = fetch_with_retry(f"{store_url}/sitemap.xml")
root = ET.fromstring(r.content)
ns = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}
product_sitemaps = [
sm.find("ns:loc", ns).text
for sm in root.findall("ns:sitemap", ns)
if "products" in sm.find("ns:loc", ns).text
]
if not product_sitemaps:
return []
urls = []
for sitemap_url in product_sitemaps:
r = fetch_with_retry(sitemap_url)
sub_root = ET.fromstring(r.content)
urls.extend(
u.find("ns:loc", ns).text
for u in sub_root.findall("ns:url", ns)
if "/products/" in u.find("ns:loc", ns).text
)
return urls
def extract_jsonld(url):
r = fetch_with_retry(url)
soup = BeautifulSoup(r.text, "html.parser")
for s in soup.find_all("script", type="application/ld+json"):
try:
d = json.loads(s.string)
items = d if isinstance(d, list) else [d]
for item in items:
if item.get("@type") == "Product":
return item
if item.get("@type") == "ProductGroup":
variants = item.get("hasVariant", [])
if variants:
first = variants[0]
first["name"] = item.get("name")
return first
except Exception:
continue
return None
def scrape_with_fallback(store_url):
try:
products = scrape_products_json(store_url)
if products:
return products, "products.json"
except Exception as e:
print(f" products.json failed: {e}")
print(" Falling back to sitemap + JSON-LD...")
urls = get_product_urls_from_sitemap(store_url)
products = []
for i, url in enumerate(urls):
if i % 50 == 0:
print(f" HTML scrape: {i}/{len(urls)}")
try:
p = extract_jsonld(url)
if p:
products.append(p)
except Exception:
pass
time.sleep(1)
return products, "sitemap+jsonld"
# ---- Output ----
def save_json(products, store_url, method):
output = {
"metadata": {
"store_url": store_url,
"scraped_at": datetime.now().isoformat(),
"method": method,
"total_products": len(products),
},
"products": products,
}
filename = f"products_{datetime.now():%Y%m%d_%H%M%S}.json"
with open(filename, "w") as f:
json.dump(output, f, indent=2)
print(f"\nSaved {len(products)} products to {filename}")
def save_csv(products, store_url):
filename = f"products_{datetime.now():%Y%m%d_%H%M%S}.csv"
with open(filename, "w", newline="") as f:
w = csv.writer(f)
w.writerow(
[
"title",
"handle",
"vendor",
"type",
"variant",
"sku",
"price",
"compare_at_price",
"available",
]
)
for p in products:
for v in p.get("variants", []):
w.writerow(
[
p.get("title"),
p.get("handle"),
p.get("vendor"),
p.get("product_type"),
v.get("title"),
v.get("sku"),
v.get("price"),
v.get("compare_at_price"),
v.get("available"),
]
)
print(f"Saved CSV to {filename}")
# ---- Entry point ----
def main():
print(f"Target store: {STORE_URL}")
if not is_shopify_store(STORE_URL):
print("ERROR: target does not appear to be a Shopify store")
sys.exit(1)
print("Confirmed Shopify store. Starting scrape...")
products, method = scrape_with_fallback(STORE_URL)
if not products:
print("ERROR: no products scraped")
sys.exit(1)
save_json(products, STORE_URL, method)
save_csv(products, STORE_URL)
print(f"\nDone. Method: {method}")
if __name__ == "__main__":
main()

This is the output when running it against allbirds.com:

Target store: https://www.allbirds.com
Confirmed Shopify store. Starting scrape...
Page 1: 250 products
Page 2: 250 products
Page 3: 250 products
Page 4: 250 products
Page 5: 250 products
Page 6: 250 products
Page 7: 193 products
Saved 1693 products to products_20260415_123517.json
Saved CSV to products_20260415_123517.csv
Done. Method: products.json

The JSON file includes metadata and the full product array:

{
"metadata": {
"store_url": "https://www.allbirds.com",
"scraped_at": "2026-04-15T12:35:17.042869",
"method": "products.json",
"total_products": 1693
},
"products": [
{
"title": "Women's Cruiser Slip On Terry - Warm White (Warm White Sole)",
"handle": "womens-cruiser-slip-on-terry",
"vendor": "Allbirds",
...
}
]
}

This is the structure that detect_changes() from the automation section reads with json.load(f)["products"].

Replace STORE_URL with your target and uncomment the PROXIES block if you need rotation.

We tested it against 4 stores with different endpoint configurations:

Store

Method used

Result

fentybeauty.com

products.json (direct)

862 products

kyliecosmetics.com

products.json (direct)

246 products

skims.com

sitemap + JSON-LD (fallback)

full catalog available

gymshark.com

sitemap + JSON-LD (fallback)

full catalog available

The script handled all 4 without configuration changes. For SKIMS and Gymshark, the fallback retrieved 3,000+ product URLs via sitemap and extracted JSON-LD from each page.

What to build next

Here are 3 ideas to extend the scraper:

  • Multi-store price tracker. Loop through competitor store URLs on a daily cron schedule. Send the change detection output to a Slack webhook or email alert
  • Collection-level trend monitor. Use /collections.json to map a store's full category structure, then track which collections add or remove products over time
  • Inventory restock alerter. Poll inventory_quantity via the .js endpoint on high-demand SKUs. Trigger an alert when stock changes from 0 to a positive value

For related eCommerce scraping tutorials, see how to scrape Target product datascraping Amazon product data, and scraping Etsy.

Wrapping up

The biggest surprise from testing 8 stores was how many still have products.json enabled – 6 out of 8 returned data directly. For those, the scraper finishes in under a minute. The fallback path takes longer (one request per product URL), but it covers the 2 stores that disable the JSON endpoint.

If you're scraping more than a few stores, add proxies from the start. Cloudflare detects patterns across stores and blocks appear sooner than expected. The complete script includes proxy support. Uncomment the configuration block and add your credentials.

Monitor prices, not errors

Track competitor pricing across thousands of Shopify stores with Decodo's rotating residential proxies.

About the author

Lukas Mikelionis

Senior Account Manager

Lukas is a seasoned enterprise sales professional with extensive experience in the SaaS industry. Throughout his career, he has built strong relationships with Fortune 500 technology companies, developing a deep understanding of complex enterprise needs and strategic account management.


Connect with Lukas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

How does products.json work?

Append /products.json to any Shopify store domain. It returns up to 250 products per page as JSON: titles, prices, variants, images, and stock status. Add ?page=N&limit=250 to paginate. No authentication is needed on public stores, but some merchants disable it.

What data can you scrape from Shopify?

Prices, inventory, variants, and images from /products.json. Exact stock counts and barcodes from the .js endpoint. Product URLs from the XML sitemap. These endpoints cover product catalog data, not orders, customers, or analytics.

What if products.json is disabled?

Parse the sitemap for product URLs, then extract the JSON-LD block from each product page. Most Shopify themes include name, price, and availability in schema.org format. We tested this on SKIMS and Gymshark – both returned data through the HTML extraction method.

What proxies work for Shopify scraping?

Residential proxies. We tested 8 stores and direct requests without proxies were blocked on several. Residential IPs from real devices passed without issues. For region-specific pricing, use geo-targeted proxies with a country parameter.

Scraping Amazon Product Data Using Python: Step-by-Step Guide

This comprehensive guide will teach you how to scrape Amazon product data using Python. Whether you’re an eCommerce professional, researcher, or developer, you’ll learn to create a solution to extract valuable insights from Amazon’s marketplace. By following this guide, you’ll acquire practical knowledge on setting up your scraping environment, overcoming common challenges, and efficiently collecting the needed data.

How to Scrape Etsy in 2026

Etsy is a global marketplace with millions of handmade, vintage, and unique products across every category imaginable. Scraping Etsy listings gives you access to valuable market data – competitor pricing, trending products, seller performance, and customer sentiment. In this guide, we'll show you how to scrape Etsy using Python, Playwright, and residential proxies to extract product titles, prices, ratings, shop names, and URLs from any Etsy search or category page.

How to Scrape Target Product Data: A Complete Guide for Beginners and Pros

Target is one of the largest retailers in the US, offering a wide range of products, from electronics to groceries. Scraping product data can help you track prices, monitor trends, or build comparison tools to enhance your purchasing decisions. This guide outlines the process, provides suggestions, and provides instructions on how to extract data, such as prices and ratings, efficiently.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved