How to Scrape Shopify Stores: Complete Developer Guide

Most Shopify stores have a built-in JSON endpoint for product data: prices, variants, inventory, images. Web scraping Shopify means requesting /products.json, paginating, and getting the catalog as JSON. But the endpoint is limited to 250 products per page, and some merchants disable it. This guide covers both: the JSON approach for stores that have it, and the fallback for stores that don't.

Lukas Mikelionis

Last updated: Apr 22, 2026

15 min read

TL;DR

Web scraping Shopify stores starts with the /products.json endpoint, which returns structured product data without HTML parsing.
The endpoint works on most public stores. 6 out of 8 we tested returned data directly
For stores that disable it, the XML sitemap plus JSON-LD extraction from product pages covers the core fields (name, price, availability)
You need Python, curl_cffi (for Cloudflare TLS fingerprinting), and, optionally, residential proxies for multi-store scraping
The complete standalone script at the end handles both approaches automatically. Change the URL and run it.

What Shopify product data is worth scraping

Each use case maps to specific fields in the products.json response:

Competitor price monitoring. Extract price, compare_at_price, and variants across multiple stores on a daily schedule. A non-null compare_at_price indicates an active promotion.
Product trend research. Extract title, tags, and product_type to map which categories a brand is expanding into.
Inventory tracking. The available boolean indicates stock status per variant. Use inventory_quantity via /products/{handle}.js for exact counts.
Feed generation. The products.json response maps to most product feed schemas with minimal transformation.

The web scraping for market research guide describes additional competitor research patterns.

Confirm a site runs on Shopify

Before writing any scraping code, install curl_cffi and verify your target site runs on Shopify:

pip install curl_cffi beautifulsoup4

The fastest detection method is to request the products.json endpoint and check for valid JSON:

from curl_cffi import requests

def is_shopify_store(url):
    try:
        response = requests.get(
            f"{url}/products.json?limit=1",
            impersonate="chrome",
            timeout=10
        )
        if response.status_code == 200:
            data = response.json()
            return "products" in data
    except Exception:
        pass
    return False

print(is_shopify_store("https://www.allbirds.com"))
print(is_shopify_store("https://example.com"))

Allbirds runs on Shopify, so the function returns True. example.com isn't, so it returns False:

True
False

The code uses curl_cffi with impersonate="chrome" because most Shopify stores run behind Cloudflare, which checks TLS fingerprints. Plain Requests with a fake User-Agent still looks like Python at the TLS layer.

If the endpoint returns a 403 or 404, the store may still run on Shopify, but with the JSON endpoint disabled. You can check 2 additional signals: asset URLs pointing to cdn.shopify.com in the page source, and an x-shopify-shop-id response header.

Test your target store first

Before targeting a specific store, run this diagnostic. It checks which endpoints are open and what scraping approach to use:

from curl_cffi import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import json

def diagnose_shopify_store(store_url):
    """Run a full diagnostic on a Shopify store."""
    report = {"store": store_url}
    ns = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}

    # 1. products.json availability + get a sample handle
    sample_handle = None
    try:
        resp = requests.get(
            f"{store_url}/products.json?limit=1",
            impersonate="chrome", timeout=10
        )
        report["products_json"] = resp.status_code
        if resp.status_code == 200:
            data = resp.json()
            report["is_shopify"] = "products" in data
            if data.get("products"):
                sample_handle = data["products"][0]["handle"]
        else:
            report["is_shopify"] = False
    except Exception:
        report["products_json"] = "error"
        report["is_shopify"] = False

    # 2. Catalog size from sitemap (checks all product sub-sitemaps)
    try:
        resp = requests.get(
            f"{store_url}/sitemap.xml",
            impersonate="chrome", timeout=10
        )
        root = ET.fromstring(resp.content)
        total = 0
        for sm in root.findall("ns:sitemap", ns):
            loc = sm.find("ns:loc", ns).text
            if "products" in loc:
                r2 = requests.get(
                    loc, impersonate="chrome", timeout=15
                )
                r2root = ET.fromstring(r2.content)
                total += sum(
                    1 for u in r2root.findall("ns:url", ns)
                    if "/products/" in u.find("ns:loc", ns).text
                )
        report["catalog_size"] = total if total else None
    except Exception:
        report["catalog_size"] = None

    # 3. /products/{handle}.js endpoint
    if sample_handle:
        try:
            r = requests.get(
                f"{store_url}/products/{sample_handle}.js",
                impersonate="chrome", timeout=10
            )
            report["js_endpoint"] = r.status_code == 200
        except Exception:
            report["js_endpoint"] = False

    # 4. /collections.json endpoint
    try:
        r = requests.get(
            f"{store_url}/collections.json?limit=1",
            impersonate="chrome", timeout=10
        )
        report["collections_json"] = (
            r.status_code == 200 and "collections" in r.json()
        )
    except Exception:
        report["collections_json"] = False

    # 5. JSON-LD on HTML product pages
    if sample_handle:
        try:
            r = requests.get(
                f"{store_url}/products/{sample_handle}",
                impersonate="chrome", timeout=15
            )
            soup = BeautifulSoup(r.text, "html.parser")
            has_jsonld = False
            for s in soup.find_all(
                "script", type="application/ld+json"
            ):
                try:
                    d = json.loads(s.string)
                    items = d if isinstance(d, list) else [d]
                    for item in items:
                        if item.get("@type") in (
                            "Product", "ProductGroup"
                        ):
                            has_jsonld = True
                except Exception:
                    continue
            report["jsonld_on_html"] = has_jsonld
        except Exception:
            report["jsonld_on_html"] = False

    return report

report = diagnose_shopify_store("https://www.allbirds.com")
for key, value in report.items():
    print(f"  {key:20} {value}")

For Allbirds, every endpoint is open:

  store                https://www.allbirds.com
  products_json        200
  is_shopify           True
  catalog_size         1693
  js_endpoint          True
  collections_json     True
  jsonld_on_html       True

Diagnostic results across famous Shopify stores

2 out of 8 stores block products.json. JSON-LD presence varies by theme:

Store

products.json

Catalog (sitemap)

.js

/collections.json

JSON-LD

Recommended path

allbirds.com

200

1,693

✅

products.json + .js

gymshark.com

403

3,944

blocked

✅

sitemap + JSON-LD fallback

taylorstitch.com

200

1,610

✅

products.json + .js

skims.com

404

3,092

blocked

✅

sitemap + JSON-LD fallback

fentybeauty.com

200

773

✅

products.json + .js

kyliecosmetics.com

200

242

✅

products.json + .js

jeffreestarcosmetics.com

200

238

✅

❌

products.json + HTML selectors

redbullshopus.com

200

685

✅

products.json + .js

Merchant-disabled vs. anti-bot blocked. When a store returns 403 or 404 on products.json, the cause matters. A merchant-disabled endpoint (common on Shopify Plus and headless setups) stays blocked regardless of IP. An anti-bot block is often removed when you use a residential proxy.

To distinguish them, retry the same request through a residential proxy. If the status code stays the same, the merchant likely disabled it. If it changes to 200, it was anti-bot blocking:

from curl_cffi import requests

proxy_url = (
    "http://user-YOUR_PROXY_USERNAME-country-us"
    ":YOUR_PROXY_PASSWORD@gate.decodo.com:7000"
)
proxies = {"http": proxy_url, "https": proxy_url}

for store in ["https://www.gymshark.com", "https://skims.com"]:
    direct = requests.get(
        f"{store}/products.json?limit=1",
        impersonate="chrome", timeout=15
    ).status_code
    via_proxy = requests.get(
        f"{store}/products.json?limit=1",
        impersonate="chrome",
        proxies=proxies, timeout=45
    ).status_code
    print(f"{store}: direct={direct}, via US proxy={via_proxy}")

Neither store's status code changes through the proxy:

https://www.gymshark.com: direct=403, via US proxy=403
https://skims.com: direct=404, via US proxy=404

Both are merchant-disabled. For these stores, you skip products.json and use the sitemap to discover product URLs, then extract JSON-LD from each HTML page. Proxies still help during HTML scraping because you're making thousands of individual page requests, but they don't re-enable the disabled endpoint.

Scrape Shopify product data with products.json

Test the endpoint in a browser first. Paste https://www.allbirds.com/products.json?limit=1, and the browser shows raw JSON.

Allbirds product JSON (allbirds.com/products.json?limit=1) showing product fields and variant prices in a browser window

Set up the request function

Every scraping method below uses the same retry wrapper. Define it once:

from curl_cffi import requests
import time

MAX_RETRIES = 3

def fetch_with_retry(url, max_retries=MAX_RETRIES):
    """GET with exponential backoff on failure or rate limit."""
    for attempt in range(max_retries):
        try:
            resp = requests.get(
                url, impersonate="chrome", timeout=30
            )
            if resp.status_code == 429:
                wait = 2 ** (attempt + 1)
                print(f"  Rate limited. Waiting {wait}s...")
                time.sleep(wait)
                continue
            resp.raise_for_status()
            return resp
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)
    return None

Fetch and paginate

The scraper paginates through the full catalog:

import json
from datetime import datetime

STORE_URL = "https://www.allbirds.com"

def scrape_shopify_products(store_url):
    """Scrape all products via the products.json endpoint."""
    all_products = []
    page = 1

    while True:
        url = f"{store_url}/products.json?limit=250&page={page}"
        response = fetch_with_retry(url)
        products = response.json().get("products", [])

        if not products:
            break

        all_products.extend(products)
        print(f"Page {page}: {len(products)} products")
        page += 1
        time.sleep(2)

    return all_products

products = scrape_shopify_products(STORE_URL)
print(f"Total: {len(products)} products")

Allbirds has 917 products across 4 pages:

Page 1: 250 products
Page 2: 250 products
Page 3: 250 products
Page 4: 167 products
Total: 917 products

Without the 2-second delay, Shopify can return 429 responses or empty arrays. The default limit is 30. Set limit=250 to reduce total requests.

Note that catalog sizes fluctuate as stores update inventory. The diagnostic section shows 1,693 from the sitemap (includes color-variant URLs), while products.json returned 917 unique product entries during this test. Both numbers are correct – they count different things.

Parse key product fields

The endpoint returns prices as strings (like "110.00"), not numbers. Convert them if you need numeric comparisons:

for product in products[:1]:
    print(f"Title:   {product['title']}")
    print(f"Handle:  {product['handle']}")
    print(f"Vendor:  {product['vendor']}")
    print(f"Type:    {product['product_type']}")

    for variant in product.get("variants", [])[:1]:
        price = float(variant["price"])
        compare = variant.get("compare_at_price")
        print(f"  Variant:    {variant['title']}")
        print(f"  Price:      ${price:.2f}")
        if compare:
            print(f"  Compare at: ${float(compare):.2f}")
        else:
            print(f"  Compare at: N/A")
        print(f"  SKU:        {variant.get('sku', 'N/A')}")
        print(f"  Available:  {variant.get('available')}")
    print("---")

The first product in the response is a slip-on with one variant shown:

Title:   Women's Cruiser Slip On Terry - Warm White (Warm White Sole)
Handle:  womens-cruiser-slip-on-terry
Vendor:  Allbirds
Type:    Shoes
  Variant:    5
  Price:      $110.00
  Compare at: N/A
  SKU:        A12372W050
  Available:  True

compare_at_price is null when there's no sale. When a promotion is active, it holds the original price, and price holds the discounted value. For exact inventory counts, the /products/{handle}.js endpoint (see "Get more detailed product data") has inventory_quantity.

Export to CSV

Flatten variants into one row each for CSV export:

import csv

csv_file = f"allbirds_products_{datetime.now():%Y%m%d}.csv"
with open(csv_file, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow([
        "title", "handle", "vendor", "type",
        "variant", "sku", "price",
        "compare_at_price", "available"
    ])
    for p in products:
        for v in p.get("variants", []):
            writer.writerow([
                p["title"], p["handle"], p["vendor"],
                p["product_type"], v["title"],
                v.get("sku", ""), v["price"],
                v.get("compare_at_price", ""),
                v.get("available", "")
            ])

print(f"Exported to {csv_file}")

The script writes a date-stamped file like allbirds_products_20260414.csv. The first few rows of the CSV look like this:

Allbirds product table showing 'Women's Cruiser Slip On Terry - Warm White (Warm White Sole)' with SKUs in dark table

Each product has multiple size variants (6–10 per product on Allbirds), so the CSV output is several thousand rows. For JSON output with metadata, the guide to saving scraped data describes additional export patterns.

Get more detailed product data with the .js endpoint

The listing endpoint omits inventory counts, barcodes, and media dimensions. The /products/{handle}.js endpoint has them:

from curl_cffi import requests

def get_product_details(store_url, handle):
    """Fetch enriched product data via the .js endpoint."""
    url = f"{store_url}/products/{handle}.js"
    response = fetch_with_retry(url)
    return response.json()

data = get_product_details(
    "https://www.allbirds.com", "mens-tree-runners"
)
print(f"Title:       {data['title']}")
print(f"Available:   {data['available']}")
print(f"Price range: "
      f"${data['price_min']/100:.2f} - "
      f"${data['price_max']/100:.2f}")
print(f"Variants:    {len(data['variants'])}")
print(f"Media items: {len(data.get('media', []))}")

v = data["variants"][0]
print(f"\nEnriched variant fields:")
print(f"  SKU:            {v['sku']}")
print(f"  Price:          ${v['price']/100:.2f}")
print(f"  Inventory qty:  {v.get('inventory_quantity')}")
print(f"  Barcode:        {v.get('barcode')}")

The output for the Men's Tree Runner includes the extra fields that products.json doesn't have:

Title:       Men's Tree Runner - Jet Black (White Sole)
Available:   True
Price range: $100.00 - $100.00
Variants:    7
Media items: 4

Enriched variant fields:
  SKU:            TR3MJBW080
  Price:          $100.00
  Inventory qty:  305
  Barcode:        843416184854

The .js endpoint returns prices in the minor currency unit, unlike the listing endpoint. For USD, that means cents (10000 = $100.00). For GBP, that means pence (10000 = £100.00). And inventory_quantity is the reported stock count – 305 units at the time of scraping. The media[] array includes aspect_ratio, width, height, and media_type per image.

Field

/products.json

/products/{handle}.js

title, handle, vendor

✅

price format

string ("110.00")

cents (11000)

compare_at_price

✅

available

✅

inventory_quantity

❌

✅

barcode

❌

✅

media[] with dimensions

❌

✅

Scrape a specific Shopify collection

Append /products.json to any collection URL to get only the products in that category.

The function below reuses fetch_with_retry() from the setup section:

def scrape_collection(store_url, collection_handle):
    """Scrape products from a specific collection."""
    all_products = []
    page = 1

    while True:
        url = (
            f"{store_url}/collections/{collection_handle}"
            f"/products.json?limit=250&page={page}"
        )
        response = fetch_with_retry(url)

        products = response.json().get("products", [])
        if not products:
            break

        all_products.extend(products)
        print(f"Page {page}: {len(products)} products")
        page += 1
        time.sleep(2)

    return all_products

sneakers = scrape_collection(
    "https://www.allbirds.com", "mens-sneakers"
)
print(f"Total in mens-sneakers: {len(sneakers)} products")

The mens-sneakers collection is returned in a single page:

Page 1: 241 products
Total in mens-sneakers: 241 products

If you only need pricing data for one category, send a single request to /collections/mens-sneakers/products.json?limit=250. That returns 241 products without paginating the full catalog.

List all collections

The /collections.json endpoint returns the category structure. On Allbirds, that's 1,334 collections across 6 pages:

def get_all_collections(store_url):
    """Fetch all collections from a Shopify store."""
    all_collections = []
    page = 1

    while True:
        url = (
            f"{store_url}/collections.json"
            f"?limit=250&page={page}"
        )
        response = fetch_with_retry(url)
        collections = response.json().get("collections", [])
        if not collections:
            break

        all_collections.extend(collections)
        print(f"Page {page}: {len(collections)} collections")
        page += 1
        time.sleep(1)

    return all_collections

collections = get_all_collections("https://www.allbirds.com")
print(f"\nTotal: {len(collections)} collections")
for c in collections[:3]:
    print(f"  {c['handle']} ({c.get('products_count', 0)} products)")

Allbirds has 1,334 collections across 6 pages:

Page 1: 250 collections
Page 2: 250 collections
Page 3: 250 collections
Page 4: 250 collections
Page 5: 250 collections
Page 6: 84 collections

Total: 1334 collections
  womens-accessories (28 products)
  add-on-essentials (97 products)
  mens-sneakers (241 products)

Extract product URLs from the XML sitemap

Shopify appends query parameters to sitemap child URLs. Parse the parent sitemap at /sitemap.xml first to get the exact product sitemap path. The function uses fetch_with_retry() from the setup section:

import xml.etree.ElementTree as ET

def get_product_urls(store_url):
    """Extract all product URLs from the Shopify sitemap."""
    response = fetch_with_retry(f"{store_url}/sitemap.xml")
    root = ET.fromstring(response.content)
    ns = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}

    # Collect ALL product sub-sitemaps (large stores have multiple)
    product_sitemaps = [
        sm.find("ns:loc", ns).text
        for sm in root.findall("ns:sitemap", ns)
        if "products" in sm.find("ns:loc", ns).text
    ]

    if not product_sitemaps:
        return []

    # Parse product URLs from each sub-sitemap
    urls = []
    for sitemap_url in product_sitemaps:
        response = fetch_with_retry(sitemap_url)
        sub_root = ET.fromstring(response.content)
        urls.extend(
            url_elem.find("ns:loc", ns).text
            for url_elem in sub_root.findall("ns:url", ns)
            if "/products/" in url_elem.find("ns:loc", ns).text
        )

    return urls

product_urls = get_product_urls("https://www.allbirds.com")
print(f"Found {len(product_urls)} product URLs")
for url in product_urls[:3]:
    print(url)

The function extracts all product URLs from every sub-sitemap:

Found 917 product URLs
https://www.allbirds.com/products/mens-wool-runners-natural-black
https://www.allbirds.com/products/mens-wool-runners
https://www.allbirds.com/products/mens-wool-runners-natural-white

The sitemap returns URLs, not product data. Combine it with per-page scraping for stores that disable products.json.

Deduplication note. The sitemap count is often higher than the products.json count. Color variants share a parent product but have separate sitemap URLs. Group by the product handle or product_id rather than treating each URL as unique.

Scrape Shopify HTML when products.json is disabled

Some Shopify Plus merchants disable public access to products.json. When the endpoint returns a 403 or 404, extract product data from the HTML pages instead.

Many Shopify themes embed a <script type="application/ld+json"> block on product pages with schema.org Product structured data. The schema varies – some themes use Product, others use ProductGroup with nested hasVariant arrays.

from curl_cffi import requests
from bs4 import BeautifulSoup
import json

def scrape_product_from_html(url):
    """Extract product data from JSON-LD on a product page."""
    response = fetch_with_retry(url)
    soup = BeautifulSoup(response.text, "html.parser")

    for script in soup.find_all(
        "script", type="application/ld+json"
    ):
        try:
            data = json.loads(script.string)

            if data.get("@type") == "Product":
                return data

            # Some themes use ProductGroup
            if data.get("@type") == "ProductGroup":
                variants = data.get("hasVariant", [])
                if variants:
                    first = variants[0]
                    first["name"] = data.get("name")
                    first["brand"] = data.get("brand")
                    return first

        except (json.JSONDecodeError, TypeError):
            continue
    return None

product = scrape_product_from_html(
    "https://www.taylorstitch.com/products/"
    "drift-boardshort-in-ivory-floral-2604"
)
if product:
    offers = product.get("offers", {})
    if isinstance(offers, list):
        offers = offers[0]
    print(f"Name:      {product.get('name')}")
    print(f"Price:     {offers.get('price')}")
    print(f"Currency:  {offers.get('priceCurrency')}")
    print(f"Available: {offers.get('availability')}")
    print(f"SKU:       {product.get('sku')}")

The JSON-LD on this Taylor Stitch product page has the same core fields:

Name:      The Drift Boardshort
Price:     118.0
Currency:  USD
Available: https://schema.org/InStock
SKU:       2604DBIF28

Not every Shopify theme includes JSON-LD. The diagnostic table shows Jeffree Star Cosmetics as an example that lacks it, while most others embed a ProductGroup block. For stores without JSON-LD, use data- attributes like data-product-id and data-handle rather than CSS class names. For a deeper tutorial on HTML parsing, see the Beautiful Soup web scraping guide.

Build a fallback pipeline

Combine both methods into a single function that tries products.json first and falls back to HTML:

def scrape_store(store_url):
    """Try JSON endpoint, fall back to HTML scraping."""
    try:
        products = scrape_shopify_products(store_url)
        if products:
            print(f"JSON endpoint: {len(products)} products")
            return products
    except Exception as e:
        print(f"products.json failed: {e}")

    product_urls = get_product_urls(store_url)
    products = []
    failed = []

    for url in product_urls:
        try:
            product = scrape_product_from_html(url)
            if product:
                products.append(product)
        except Exception as e:
            failed.append({"url": url, "error": str(e)})
        time.sleep(1)

    if failed:
        with open("failed_urls.json", "w") as f:
            json.dump(failed, f, indent=2)
        print(f"{len(failed)} URLs failed - logged")

    print(f"HTML fallback: {len(products)} products")
    return products

Failed URLs are logged to failed_urls.json for retry:

[
  {"url": "https://store.com/products/broken-link", "error": "HTTP Error 404"},
  {"url": "https://store.com/products/timeout", "error": "ReadTimeout"}
]

For large catalogs (1,000+ URLs), the HTML loop takes 15–30 minutes at 1 request per second. The complete script at the end includes progress logging.

Handle rate limits and anti-bot protections

Shopify doesn't publish official rate limits for products.json, but enforcement exists. Rapid requests from the same IP can trigger 429 responses or silent empty arrays.

Respect the request cadence

A 2-second delay between paginated requests avoids most rate limit issues. For individual product pages through HTML, use 1–2 seconds. Larger gaps (5–10 seconds) are safer when scraping multiple stores in sequence. For more retry patterns, see the retry guide for Python requests.

Common failure patterns to recognize. These are the signatures you'll see in the terminal when a request fails:

Products: 0 after a successful 200 response → silent rate limit, treat it as a 429 and back off
curl_cffi.requests.exceptions.HTTPError: HTTP Error 403 on .myshopify.com domains → merchant has password-protected the store (B2B or pre-launch) and no proxy helps
xml.etree.ElementTree.ParseError: no element found on sitemap → the store's sitemap URL needs query parameters; fetch the parent /sitemap.xml first to get the exact child URL
JSONDecodeError on products.json → the endpoint returned HTML instead of JSON (usually a 5xx error page or a Cloudflare challenge); retry through a residential proxy

Some stores require cookie consent before serving content. Pass a consent cookie in request headers or use browser automation to handle the consent flow. If is_shopify_store() or the diagnostic returns a 403 or 404, switch to the HTML extraction method.

Rotate proxies for multi-store scraping

Scraping a single Shopify store from one IP address works for small catalogs. But when you scrape multiple stores on a recurring schedule, Cloudflare flags your IP. Most Shopify stores route traffic through Cloudflare, so the protection is similar. Some headless setups (like Gymshark) use different CDNs such as AWS CloudFront, but curl_cffi is good practice for consistent browser-like requests.

For Shopify scraping at scale, residential proxies use IP addresses assigned to real household devices. They're far less likely to be blocked than datacenter IPs. Rotating proxies provide a different IP for each request automatically through a gateway.

Configure Decodo residential proxies

To get started, create an account and generate proxy credentials from the dashboard. The residential proxy quick start guide explains the full setup.

Everything goes through one endpoint: gate.decodo.com:7000. Location, session type, and duration are all controlled through username parameters:

user-USERNAME-country-COUNTRY:PASSWORD@gate.decodo.com:7000

For a random global IP without country targeting, drop the country parameter:

from curl_cffi import requests

# Replace YOUR_PROXY_USERNAME and YOUR_PROXY_PASSWORD
proxy_url = (
    "http://user-YOUR_PROXY_USERNAME"
    ":YOUR_PROXY_PASSWORD@gate.decodo.com:7000"
)

response = requests.get(
    "https://ip.decodo.com/json",
    impersonate="chrome",
    proxies={"http": proxy_url, "https": proxy_url},
    timeout=30
)

ip_data = response.json()
print(f"Proxy IP: {ip_data['proxy']['ip']}")
print(f"Country:  {ip_data['country']['name']}")
print(f"ISP:      {ip_data['isp']['isp']}")

The proxy provides a residential IP from a random country:

Proxy IP: 136.158.70.151
Country:  Philippines
ISP:      Globe Telecom

By default, each request gets a new IP. To keep the same IP across multiple requests (better for paginating a single store), add session and sessionduration to the username: user-USERNAME-session-1-sessionduration-10. The default session duration is 10 minutes.

Target a specific country

For stores that serve different prices by region, add the country parameter to the username. This routes requests through a residential IP in that country:

from curl_cffi import requests

# US-targeted proxy via country parameter
proxy_url = (
    "http://user-YOUR_PROXY_USERNAME-country-us"
    ":YOUR_PROXY_PASSWORD@gate.decodo.com:7000"
)

response = requests.get(
    "https://www.allbirds.com/products.json?limit=1",
    impersonate="chrome",
    proxies={"http": proxy_url, "https": proxy_url},
    timeout=30
)

product = response.json()["products"][0]
print(f"Product: {product['title']}")
print(f"Price:   ${product['variants'][0]['price']}")

The request goes through a US IP and returns US pricing:

Product: Women's Cruiser Slip On Terry - Warm White (Warm White Sole)
Price:   $110.00

Scrape Shopify with the Decodo Web Scraping API

The code above works for most public Shopify stores. But at 50+ stores on a daily schedule, maintaining retry logic, proxy rotation, and fallback chains becomes its own project. The Decodo Web Scraping API reduces that to a single HTTP call per page. You need an API token. The Web Scraping API quick start guide explains how to get one:

from curl_cffi import requests
from bs4 import BeautifulSoup
import json

API_URL = "https://scraper-api.decodo.com/v2/scrape"
API_TOKEN = "YOUR_API_TOKEN"

# Scrape a Gymshark product page - this store
# blocks all JSON endpoints (products.json, .js, collections.json)
response = requests.post(
    API_URL,
    headers={
        "Accept": "application/json",
        "Authorization": f"Basic {API_TOKEN}",
        "Content-Type": "application/json"
    },
    json={
        "url": "https://www.gymshark.com/products/"
               "gymshark-running-t-shirt-ss-tops",
        "headless": "html",
        "proxy_pool": "premium"
    },
    timeout=90
)

result = response.json()["results"][0]
print(f"Status code: {result['status_code']}")

# Parse JSON-LD from the rendered HTML
soup = BeautifulSoup(result["content"], "html.parser")
for script in soup.find_all(
    "script", type="application/ld+json"
):
    try:
        data = json.loads(script.string)
        if data.get("@type") == "Product":
            offers = data.get("offers", {})
            print(f"Product: {data['name']}")
            print(f"Price:   ${offers.get('price')}")
            print(f"In stock: {offers.get('availability')}")
    except (json.JSONDecodeError, TypeError):
        continue

The API renders the Gymshark page and returns the product data that direct requests cannot retrieve:

Status code: 200
Product: Running T-Shirt
Price:   $30.4
In stock: https://schema.org/InStock

The API has a free tier for testing. It also supports geo for country targeting and browser actions for JavaScript-rendered pages. The full parameter list is in the API docs.

Scrape Shopify without blocks

Decodo's Web Scraping API handles Shopify's bot detection so you get clean product data on every request.

Try it free

Automate scraping on a schedule

Price monitoring and inventory tracking need recurring runs.

Schedule with cron

On Linux or macOS, add a cron job to run the scraper at a fixed time. Open the crontab with crontab -e and add:

# Run Shopify scraper daily at 3:00 AM
0 3 * * * cd /home/user/scraper && python shopify_scraper.py >> scraper.log 2>&1

On Windows, use Task Scheduler to create a daily trigger for the same script.

Add a lock file to prevent overlapping runs when the previous scrape hasn't finished:

import os
import sys

LOCK_FILE = "scraper.lock"

if os.path.exists(LOCK_FILE):
    print("Previous run still active. Exiting.")
    sys.exit(1)

try:
    with open(LOCK_FILE, "w") as f:
        f.write(str(os.getpid()))
    products = scrape_shopify_products(STORE_URL)
    # ... save output, export CSV ...
finally:
    if os.path.exists(LOCK_FILE):
        os.remove(LOCK_FILE)

Detect price changes between runs

Compare each scrape against the previous file using the product handle and variant ID as the composite key:

import json

def detect_changes(current_products, previous_file):
    """Compare current scrape against previous run."""
    try:
        with open(previous_file) as f:
            previous = json.load(f)["products"]
    except (FileNotFoundError, KeyError):
        return []

    prev_map = {}
    for p in previous:
        for v in p.get("variants", []):
            prev_map[f"{p['handle']}_{v['id']}"] = v

    changes = []
    for p in current_products:
        for v in p.get("variants", []):
            key = f"{p['handle']}_{v['id']}"
            prev = prev_map.get(key)

            if not prev:
                changes.append({
                    "handle": p["handle"],
                    "change": "new_product",
                    "price": v["price"]
                })
            elif prev["price"] != v["price"]:
                changes.append({
                    "handle": p["handle"],
                    "change": "price_changed",
                    "old_price": prev["price"],
                    "new_price": v["price"]
                })
    return changes

Push changed records to a database, spreadsheet, or notification system.

For teams that prefer visual workflows over cron scripts, the Decodo n8n integration is an alternative.

Complete script

Save as shopify_scraper.py and run with python shopify_scraper.py:

"""
Shopify Store Scraper
"""

from curl_cffi import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import json
import csv
import time
import sys
from datetime import datetime

# ---- Configuration ----
STORE_URL = "https://www.allbirds.com"

PROXY_USER = "YOUR_PROXY_USERNAME"
PROXY_PASS = "YOUR_PROXY PASSWORD"
PROXIES = {
    "http": f"http://user-{PROXY_USER}-country-us:"
             f"{PROXY_PASS}@gate.decodo.com:7000",
    "https": f"http://user-{PROXY_USER}-country-us:"
             f"{PROXY_PASS}@gate.decodo.com:7000",
}

DELAY_BETWEEN_PAGES = 2
MAX_RETRIES = 3


# ---- Core functions ----
def fetch_with_retry(url, max_retries=MAX_RETRIES):
    for attempt in range(max_retries):
        try:
            resp = requests.get(url, impersonate="chrome", proxies=PROXIES, timeout=30)
            if resp.status_code == 429:
                wait = 2 ** (attempt + 1)
                print(f"  Rate limited. Waiting {wait}s...")
                time.sleep(wait)
                continue
            resp.raise_for_status()
            return resp
        except Exception:
            if attempt == max_retries - 1:
                raise
            time.sleep(2**attempt)
    return None


def is_shopify_store(store_url):
    try:
        r = fetch_with_retry(f"{store_url}/products.json?limit=1")
        return r is not None and "products" in r.json()
    except Exception:
        return False


def scrape_products_json(store_url):
    all_products = []
    page = 1
    while True:
        url = f"{store_url}/products.json?limit=250&page={page}"
        r = fetch_with_retry(url)
        products = r.json().get("products", [])
        if not products:
            break
        all_products.extend(products)
        print(f"  Page {page}: {len(products)} products")
        page += 1
        time.sleep(DELAY_BETWEEN_PAGES)
    return all_products


def get_product_urls_from_sitemap(store_url):
    r = fetch_with_retry(f"{store_url}/sitemap.xml")
    root = ET.fromstring(r.content)
    ns = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}

    product_sitemaps = [
        sm.find("ns:loc", ns).text
        for sm in root.findall("ns:sitemap", ns)
        if "products" in sm.find("ns:loc", ns).text
    ]
    if not product_sitemaps:
        return []

    urls = []
    for sitemap_url in product_sitemaps:
        r = fetch_with_retry(sitemap_url)
        sub_root = ET.fromstring(r.content)
        urls.extend(
            u.find("ns:loc", ns).text
            for u in sub_root.findall("ns:url", ns)
            if "/products/" in u.find("ns:loc", ns).text
        )
    return urls


def extract_jsonld(url):
    r = fetch_with_retry(url)
    soup = BeautifulSoup(r.text, "html.parser")
    for s in soup.find_all("script", type="application/ld+json"):
        try:
            d = json.loads(s.string)
            items = d if isinstance(d, list) else [d]
            for item in items:
                if item.get("@type") == "Product":
                    return item
                if item.get("@type") == "ProductGroup":
                    variants = item.get("hasVariant", [])
                    if variants:
                        first = variants[0]
                        first["name"] = item.get("name")
                        return first
        except Exception:
            continue
    return None


def scrape_with_fallback(store_url):
    try:
        products = scrape_products_json(store_url)
        if products:
            return products, "products.json"
    except Exception as e:
        print(f"  products.json failed: {e}")

    print("  Falling back to sitemap + JSON-LD...")
    urls = get_product_urls_from_sitemap(store_url)
    products = []
    for i, url in enumerate(urls):
        if i % 50 == 0:
            print(f"  HTML scrape: {i}/{len(urls)}")
        try:
            p = extract_jsonld(url)
            if p:
                products.append(p)
        except Exception:
            pass
        time.sleep(1)
    return products, "sitemap+jsonld"


# ---- Output ----
def save_json(products, store_url, method):
    output = {
        "metadata": {
            "store_url": store_url,
            "scraped_at": datetime.now().isoformat(),
            "method": method,
            "total_products": len(products),
        },
        "products": products,
    }
    filename = f"products_{datetime.now():%Y%m%d_%H%M%S}.json"
    with open(filename, "w") as f:
        json.dump(output, f, indent=2)
    print(f"\nSaved {len(products)} products to {filename}")


def save_csv(products, store_url):
    filename = f"products_{datetime.now():%Y%m%d_%H%M%S}.csv"
    with open(filename, "w", newline="") as f:
        w = csv.writer(f)
        w.writerow(
            [
                "title",
                "handle",
                "vendor",
                "type",
                "variant",
                "sku",
                "price",
                "compare_at_price",
                "available",
            ]
        )
        for p in products:
            for v in p.get("variants", []):
                w.writerow(
                    [
                        p.get("title"),
                        p.get("handle"),
                        p.get("vendor"),
                        p.get("product_type"),
                        v.get("title"),
                        v.get("sku"),
                        v.get("price"),
                        v.get("compare_at_price"),
                        v.get("available"),
                    ]
                )
    print(f"Saved CSV to {filename}")


# ---- Entry point ----
def main():
    print(f"Target store: {STORE_URL}")
    if not is_shopify_store(STORE_URL):
        print("ERROR: target does not appear to be a Shopify store")
        sys.exit(1)

    print("Confirmed Shopify store. Starting scrape...")
    products, method = scrape_with_fallback(STORE_URL)

    if not products:
        print("ERROR: no products scraped")
        sys.exit(1)

    save_json(products, STORE_URL, method)
    save_csv(products, STORE_URL)
    print(f"\nDone. Method: {method}")


if __name__ == "__main__":
    main()

This is the output when running it against allbirds.com:

Target store: https://www.allbirds.com
Confirmed Shopify store. Starting scrape...
  Page 1: 250 products
  Page 2: 250 products
  Page 3: 250 products
  Page 4: 250 products
  Page 5: 250 products
  Page 6: 250 products
  Page 7: 193 products

Saved 1693 products to products_20260415_123517.json
Saved CSV to products_20260415_123517.csv

Done. Method: products.json

The JSON file includes metadata and the full product array:

{
  "metadata": {
    "store_url": "https://www.allbirds.com",
    "scraped_at": "2026-04-15T12:35:17.042869",
    "method": "products.json",
    "total_products": 1693
  },
  "products": [
    {
      "title": "Women's Cruiser Slip On Terry - Warm White (Warm White Sole)",
      "handle": "womens-cruiser-slip-on-terry",
      "vendor": "Allbirds",
      ...
    }
  ]
}

This is the structure that detect_changes() from the automation section reads with json.load(f)["products"].

Replace STORE_URL with your target and uncomment the PROXIES block if you need rotation.

We tested it against 4 stores with different endpoint configurations:

Store

Method used

Result

fentybeauty.com

products.json (direct)

862 products

kyliecosmetics.com

products.json (direct)

246 products

skims.com

sitemap + JSON-LD (fallback)

full catalog available

gymshark.com

sitemap + JSON-LD (fallback)

full catalog available

The script handled all 4 without configuration changes. For SKIMS and Gymshark, the fallback retrieved 3,000+ product URLs via sitemap and extracted JSON-LD from each page.

What to build next

Here are 3 ideas to extend the scraper:

Multi-store price tracker. Loop through competitor store URLs on a daily cron schedule. Send the change detection output to a Slack webhook or email alert
Collection-level trend monitor. Use /collections.json to map a store's full category structure, then track which collections add or remove products over time
Inventory restock alerter. Poll inventory_quantity via the .js endpoint on high-demand SKUs. Trigger an alert when stock changes from 0 to a positive value

For related eCommerce scraping tutorials, see how to scrape Target product data, scraping Amazon product data, and scraping Etsy.

Wrapping up

The biggest surprise from testing 8 stores was how many still have products.json enabled – 6 out of 8 returned data directly. For those, the scraper finishes in under a minute. The fallback path takes longer (one request per product URL), but it covers the 2 stores that disable the JSON endpoint.

If you're scraping more than a few stores, add proxies from the start. Cloudflare detects patterns across stores and blocks appear sooner than expected. The complete script includes proxy support. Uncomment the configuration block and add your credentials.

Monitor prices, not errors

Track competitor pricing across thousands of Shopify stores with Decodo's rotating residential proxies.

Get started

About the author

Lukas Mikelionis

Senior Account Manager

Lukas is a seasoned enterprise sales professional with extensive experience in the SaaS industry. Throughout his career, he has built strong relationships with Fortune 500 technology companies, developing a deep understanding of complex enterprise needs and strategic account management.

Connect with Lukas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

In this article

Scrape smarter with Decodo

Decodo handles proxies, anti-bot bypass, and rendering so your code stays simple and your data stays flowing.

Start for free

Frequently asked questions

How does products.json work?

Append /products.json to any Shopify store domain. It returns up to 250 products per page as JSON: titles, prices, variants, images, and stock status. Add ?page=N&limit=250 to paginate. No authentication is needed on public stores, but some merchants disable it.

What data can you scrape from Shopify?

Prices, inventory, variants, and images from /products.json. Exact stock counts and barcodes from the .js endpoint. Product URLs from the XML sitemap. These endpoints cover product catalog data, not orders, customers, or analytics.

What if products.json is disabled?

Parse the sitemap for product URLs, then extract the JSON-LD block from each product page. Most Shopify themes include name, price, and availability in schema.org format. We tested this on SKIMS and Gymshark – both returned data through the HTML extraction method.

What proxies work for Shopify scraping?

Residential proxies. We tested 8 stores and direct requests without proxies were blocked on several. Residential IPs from real devices passed without issues. For region-specific pricing, use geo-targeted proxies with a country parameter.

DATA COLLECTION

PYTHON

Scraping Amazon Product Data Using Python: Step-by-Step Guide

This comprehensive guide will teach you how to scrape Amazon product data using Python. Whether you’re an eCommerce professional, researcher, or developer, you’ll learn to create a solution to extract valuable insights from Amazon’s marketplace. By following this guide, you’ll acquire practical knowledge on setting up your scraping environment, overcoming common challenges, and efficiently collecting the needed data.

Zilvinas Tamulis

Last updated: Mar 27, 2025

15 min read

Response panel showing JSON 'results' with 'id','title','url' beside UI labeled 'eCommerce store' and 'Scraping' tooltip

DATA COLLECTION

PYTHON

How to Scrape Etsy in 2026

Etsy is a global marketplace with millions of handmade, vintage, and unique products across every category imaginable. Scraping Etsy listings gives you access to valuable market data – competitor pricing, trending products, seller performance, and customer sentiment. In this guide, we'll show you how to scrape Etsy using Python, Playwright, and residential proxies to extract product titles, prices, ratings, shop names, and URLs from any Etsy search or category page.

Dominykas Niaura

Last updated: Jan 22, 2026

10 min read

BIG DATA

DATA COLLECTION

How to Scrape Target Product Data: A Complete Guide for Beginners and Pros

Target is one of the largest retailers in the US, offering a wide range of products, from electronics to groceries. Scraping product data can help you track prices, monitor trends, or build comparison tools to enhance your purchasing decisions. This guide outlines the process, provides suggestions, and provides instructions on how to extract data, such as prices and ratings, efficiently.

Justinas Tamasevicius

Last updated: Jul 07, 2025

6 min read

How to Scrape Shopify Stores: Complete Developer Guide

TL;DR

What Shopify product data is worth scraping

Confirm a site runs on Shopify

Test your target store first

Diagnostic results across famous Shopify stores

Scrape Shopify product data with products.json

Set up the request function

Fetch and paginate

Parse key product fields

Export to CSV

Get more detailed product data with the .js endpoint

Scrape a specific Shopify collection

List all collections

Extract product URLs from the XML sitemap

Scrape Shopify HTML when products.json is disabled

Build a fallback pipeline

Handle rate limits and anti-bot protections

Respect the request cadence

Rotate proxies for multi-store scraping

Configure Decodo residential proxies

Target a specific country

Scrape Shopify with the Decodo Web Scraping API

Automate scraping on a schedule

Schedule with cron

Detect price changes between runs

Complete script

What to build next

Wrapping up

Frequently asked questions

How does products.json work?

What data can you scrape from Shopify?

What if products.json is disabled?

What proxies work for Shopify scraping?

Related articles