Back to blog

Price Scraping: How To Build a Scraper, Test It, and Scale With Confidence

Share article:

Price data is important for monitoring competitors in eCommerce, enforcing MAP policies, and receiving deal alerts. Doing this manually isn't effective for scaling. A practical approach is price scraping, which helps automatically collect product pricing data from eCommerce websites. This guide will show you how to build a Python scraper using Playwright. It will help you gather real prices, deal with anti-bot measures, and create structured JSON data.

Price Scraping

TL;DR

  • Price scraping is the automated extraction of product pricing data, including name, price, currency, availability, and discount status from eCommerce websites
  • This guide builds a working Python scraper using Playwright for rendering and BeautifulSoup for parsing, with structured JSON output via Pydantic
  • Real eCommerce sites use CAPTCHA, IP blocks, JavaScript rendering, and obfuscated selectors. Each obstacle has a concrete fix.
  • Rotating residential proxies, randomized request timing, and realistic browser fingerprints are the three biggest factors in staying undetected
  • AI cuts the manual work of selector generation, price normalization, and anomaly detection, but adds cost and latency, so use it selectively
  • Price scraping legality varies by jurisdiction and target site ToS. Amazon explicitly prohibits it; publicly available product prices are generally lower risk.
  • For production use, store selectors in config files, validate every output record, monitor for silent failures, and schedule scrapes based on data volatility
  • For high-value or heavily protected targets, Decodo's eCommerce price scraper API returns structured pricing data without the maintenance burden

What is price scraping, and why does it matter

Price scraping is the process of collecting product pricing data from eCommerce websites using automated tools. A typical scrape gathers the product name, current price, currency, availability, and discount status. This data is collected repeatedly and at scale from hundreds or thousands of product pages.

Price scraping is a type of web scraping focused specifically on structured product data that updates often and has strong protection against bots. Unlike general scraping, which might collect blog content or contact details, price scraping targets information that impacts revenue decisions. As a result, the websites that try to protect this data take their defenses seriously.

Why businesses use it

Competitive price monitoring

Retailers keep an eye on their competitors' prices to make quick, informed decisions about their own pricing. For example, a UK electronics retailer tracks GPU prices at Currys and Argos every day. This allows them to match a competitor's price drop within hours instead of days. Being able to respond this quickly gives them a strong advantage in the market.

MAP enforcement

Brands use price scraping to ensure that authorized sellers follow minimum advertised price rules. Checking hundreds of reseller websites by hand isn't practical, so automated scraping helps identify violations quickly. This way, brands can address issues before they harm their reputation.

Deal alerts and price tracking

Services that help consumers send alerts when a product's price falls below a set level. The scraper runs on a schedule, compares the current price against a stored baseline, and fires an alert when the condition is met. This logic is simple, but it needs reliable and accurate data to work well.

Market research

Aggregating prices from many sellers shows trends that you can't see by just looking at a single product. This includes patterns like seasonal changes, price differences by region, and shifts across different product categories. For analysts and pricing teams, this information is essential for building a solid strategy.

If you're interested in scraping eCommerce websites, for more information beyond pricing, that guide offers a complete overview. And if you're running any kind of monitoring use case, proxies for real-time price monitoring cover the infrastructure side.

Building the price scraper step by step

This section explains how to build a price scraper in Python. We'll use Playwright to display the website and BeautifulSoup to extract information. Our target site is books.toscrape.com, which is a practice eCommerce site set up for scraping. You can develop without worrying about any terms of service issues.

Setting up the project

Create a dedicated project directory and set up a virtual environment before installing anything. This keeps your dependencies isolated from other Python projects on your machine.

mkdir price-scraper
cd price-scraper

Create and activate a virtual environment:

# Mac/Linux
python3 -m venv venv
source venv/bin/activate
# Windows (WSL)
python3 -m venv venv
source venv/bin/activate

You should see (venv) appear at the start of your terminal prompt, which confirms the virtual environment is active. Any packages you install from this point are scoped to this project only.

Create the main script file:

touch price_scraper.py

Your project directory should now look like this:

price-scraper/
├── venv/
└── price_scraper.py

Now install the dependencies:

pip install playwright beautifulsoup4 pydantic requests dotenv
python -m playwright install chromium

The ppython -m playwright install chromium step downloads the necessary Chrome browser that Playwright uses. This browser is different from any other browsers on your computer, and the download takes about a minute to finish.

Setting up your proxy credentials

The scraper routes requests through Decodo's residential proxies to rotate IPs automatically and avoid blocks. Before you write any scraping code, get your credentials from the Decodo dashboard:

  1. Log in to your Decodo dashboard.
  2. Navigate to Residential → Proxy setup in the left sidebar.
  3. Scroll to the Authentication section, your first proxy username and password are created automatically.
  4. Copy your username and password. Click the eye icon to reveal the password, or click directly on it to copy.

Your endpoint and port are fixed. gate.decodo.com:7000 is the default residential proxy gateway for rotating requests. Only the username and password are specific to your account.

Before entering your credentials into the scraper, make sure your proxy is working. Go to the Proxy setup page, scroll to Code examples, choose Python, copy the snippet, and run it.

import requests
username = "your_username"
password = "your_password"
proxy = f"http://{username}:{password}@gate.decodo.com:7000"
result = requests.get("https://ip.decodo.com/json", proxies={
"http": proxy,
"https": proxy
})
print(result.text)

If the response shows a different IP address from your own, the proxy is working correctly.

Store your credentials in a .env file instead of hardcoding them into your script. This is especially important if you plan to share your code publicly. 

To do this, create a .env file in the root of your project:

DECODO_PROXY_USERNAME=your_username
DECODO_PROXY_PASSWORD=your_password

Add .env to your .gitignore so it never gets committed:

echo ".env" >> .gitignore

Load them in your script:

import os
from dotenv import load_dotenv
load_dotenv()
proxy = {
"server": "http://gate.decodo.com:7000",
"username": os.getenv("DECODO_PROXY_USERNAME"),
"password": os.getenv("DECODO_PROXY_PASSWORD")
}

To start, run Playwright in headless mode. Headless mode means the browser operates in the background without a visible window. This makes it faster and better for automated scraping.

import asyncio
import os
from dotenv import load_dotenv
from playwright.async_api import async_playwright
load_dotenv()
async def launch_browser():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
proxy={
"server": "http://gate.decodo.com:7000",
"username": os.getenv("DECODO_PROXY_USERNAME"),
"password": os.getenv("DECODO_PROXY_PASSWORD")
},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
page = await context.new_page()
return browser, context, page

Block unnecessary resources to speed up page loads

Images, fonts, and stylesheets add load time without adding useful data. Block them:

async def block_resources(page):
await page.route(
"**/*",
lambda route: route.abort()
if route.request.resource_type in ["image", "font", "stylesheet"]
else route.continue_()
)

Navigate to the product listing page and wait for prices to render:

import asyncio
import random
async def navigate_to_page(page, url):
await page.goto(url, wait_until="domcontentloaded")
# wait for price elements to render
await page.wait_for_selector(".price_color", timeout=10000)
# randomized delay to mimic human browsing patterns
await asyncio.sleep(random.uniform(2, 5))
return await page.content()

The wait_for_selector function pauses the program until the price element shows up on the page. This is important for prices that load with JavaScript through AJAX, as it ensures the scraper only tries to read the page after the data has loaded. The random delay of 2 to 5 seconds simulates how a real person browses, making it harder for anti-bot systems to identify the scraping activity.

Parsing price data from the page

After capturing the page content, pass the source to Beautiful Soup to begin the extraction process. BeautifulSoup is a Python library that converts messy HTML into a navigable structure, allowing you to isolate specific product data using CSS selectors with precision.

from bs4 import BeautifulSoup
def parse_prices(html):
soup = BeautifulSoup(html, "html.parser")
products = []
for article in soup.select("article.product_pod"):
try:
# extract product name
name = article.select_one("h3 a")["title"]
# extract and normalize price
raw_price = article.select_one(".price_color").text.strip()
price = float(raw_price.replace("£", "").replace(",", ""))
# extract availability
availability = article.select_one(".availability").text.strip()
# extract rating
rating = article.select_one("p.star-rating")["class"][1]
products.append({
"name": name,
"price": price,
"currency": "GBP",
"availability": availability,
"rating": rating
})
except (AttributeError, TypeError, ValueError):
# skip products with missing or malformed data
continue
return products

A few things worth noting here:

  • CSS selectors help target elements by their class name and structure. For example, the selector article.product_pod targets all product cards on the page. If you want to learn more about choosing the right selector, see XPath vs. CSS selectors.
  • When it comes to prices, you should normalize them by removing the currency symbol and converting the string into a float. Different countries have different ways of writing numbers. For instance, 1.299,00 is the German format, while 1,299.00 is the US format. Always normalize the price before saving it.
  • Using try/except is a good practice as it allows you to handle issues if an element is missing without crashing the entire scraping process. Learn more about data parsing and its importance.

Structuring and saving the output

Define a Pydantic model to ensure that every scraped record has a consistent structure. Pydantic is a Python library that validates data types while the program runs. For example, if a price comes back as a string instead of a float, it catches the error before it affects your output file.

from pydantic import BaseModel
from typing import Optional
from datetime import datetime
class ProductPrice(BaseModel):
product_name: str
price: float
currency: str
original_price: Optional[float] = None
discount: Optional[float] = None
availability: str
source_url: str
scraped_at: datetime = datetime.utcnow()

Serialize the output to a timestamped JSON file:

import json
from datetime import datetime
def save_to_json(products, source_url):
timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
filename = f"prices_{timestamp}.json"
validated = []
for product in products:
try:
record = ProductPrice(
product_name=product["name"],
price=product["price"],
currency=product["currency"],
availability=product["availability"],
source_url=source_url,
)
validated.append(record.model_dump())
except Exception as e:
print(f"Validation error: {e}")
continue
with open(filename, "w") as f:
json.dump(validated, f, indent=2, default=str)
print(f"Saved {len(validated)} products to {filename}")
return filename

Print a formatted summary to the console so you can sanity-check results without opening the file:

def print_summary(products):
print(f"\n{'Product':<50} {'Price':>10} {'Availability':<20}")
print("-" * 82)
for p in products[:10]: # show first 10
print(f"{p['product_name'][:48]:<50} {p['currency']} {p['price']:>7.2f} {p['availability']:<20}")

Full script put it all together:

import asyncio
import json
import os
import random
from datetime import datetime
from typing import Optional
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from playwright.async_api import async_playwright
from pydantic import BaseModel
load_dotenv()
class ProductPrice(BaseModel):
product_name: str
price: float
currency: str
original_price: Optional[float] = None
discount: Optional[float] = None
availability: str
source_url: str
scraped_at: datetime = datetime.utcnow()
def parse_prices(html, source_url):
soup = BeautifulSoup(html, "html.parser")
products = []
for article in soup.select("article.product_pod"):
try:
name = article.select_one("h3 a")["title"]
raw_price = article.select_one(".price_color").text.strip()
price = float(raw_price.replace("£", "").replace(",", ""))
availability = article.select_one(".availability").text.strip()
record = ProductPrice(
product_name=name,
price=price,
currency="GBP",
availability=availability,
source_url=source_url,
)
products.append(record.model_dump())
except (AttributeError, TypeError, ValueError):
continue
return products
def save_to_json(products):
timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
filename = f"prices_{timestamp}.json"
with open(filename, "w") as f:
json.dump(products, f, indent=2, default=str)
print(f"Saved {len(products)} products to {filename}")
def print_summary(products):
print(f"\n{'Product':<50} {'Price':>10} {'Availability':<20}")
print("-" * 82)
for p in products[:10]:
print(
f"{p['product_name'][:48]:<50} "
f"{p['currency']} {p['price']:>7.2f} "
f"{p['availability']:<20}"
)
async def main():
url = "https://books.toscrape.com/catalogue/category/books_1/index.html"
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
proxy={
"server": "http://gate.decodo.com:7000",
"username": os.getenv("DECODO_PROXY_USERNAME"),
"password": os.getenv("DECODO_PROXY_PASSWORD"),
},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
)
page = await context.new_page()
# block unnecessary resources
await page.route(
"**/*",
lambda route: route.abort()
if route.request.resource_type in ["image", "font", "stylesheet"]
else route.continue_(),
)
await page.goto(url, wait_until="domcontentloaded")
await page.wait_for_selector(".price_color", timeout=10000)
await asyncio.sleep(random.uniform(2, 5))
html = await page.content()
await browser.close()
products = parse_prices(html, url)
print_summary(products)
save_to_json(products)
asyncio.run(main())

For alternative output formats, CSV for spreadsheet analysis or SQLite for historical price tracking, see how to save scraped data for a full breakdown. If you'd rather skip building a custom scraper entirely, Decodo's eCommerce price scraper API returns structured pricing data without writing or maintaining any selector logic.

A black terminal window displays many white monospaced lines: crawled URLs and repeated famous-people quotes filling the console.

Prices change, scrapers get blocked

Skip the proxy configs, CAPTCHA solving, and anti-bot workarounds entirely with Decodo's Web Scraping API.

Testing your scraper with real results

Building the scraper is the first step. To make sure it works properly and to catch any errors, you need to check its results against the actual website. Run the scraper on books.toscrape.com and compare the output to what you see on the page.

Running the scraper

Make sure your virtual environment is active, then run:

source venv/bin/activate
python3 price_scraper.py

You should see a summary table printed to the console immediately:

A summary table

And a timestamped JSON file in your project directory:

[
{
"product_name": "A Light in the Attic",
"price": 51.77,
"currency": "GBP",
"original_price": null,
"discount": null,
"availability": "In stock",
"source_url": "https://books.toscrape.com/catalogue/category/books_1/index.html",
"scraped_at": "2026-04-28 21:02:04.653382"
},
...
]

Verifying accuracy

Open https://books.toscrape.com/catalogue/category/books1/index.html_ in your browser and manually check 5 products against your JSON output:

  • Does the product name match exactly?
  • Does the price match down to the decimal?
  • Is the availability status correct?

A well-configured scraper should achieve 100% accuracy on books.toscrape.com. This site is simple and has no anti-bot measures, JavaScript-rendered content, or changing prices. If you notice any errors, the problem is with your selector logic, not the website itself.

Calculate your capture rate:

# add this to price_scraper.py to check coverage
def calculate_accuracy(products, expected_count):
captured = len(products)
accuracy = (captured / expected_count) * 100
print(f"\nCapture rate: {captured}/{expected_count} products ({accuracy:.1f}%)")

The website books.toscrape.com shows 20 products on each page. If you see fewer than 20 products, it means some are missing.

Diagnosing common failures

Zero products returned

The selector isn't finding anything. This might happen because the page didn't load completely before the parser started, or the HTML layout is different from what you expected. To troubleshoot, add a step to check what the scraper actually sees:

# temporarily add to main() after getting page content
with open("debug.html", "w") as f:
f.write(html)

Open debug.html in your browser. If it looks different from the live page, such as missing products or incomplete HTML, it means the page did not finish loading before calling page.content(). To fix this, increase the timeout or use a more specific wait_for_selector command.

Prices returning as None or 0.0

The price element is in the HTML, but your selector isn't working. Open your browser's DevTools on the page you're checking. Right-click the price element and choose Inspect. Make sure the class name matches your selector exactly, as any small difference will cause it to fail.

Stale selectors after a site redesign

Websites often change their HTML structure unexpectedly. This means a selector that worked last week might not work today. This problem is known as selector rot, and it's the main reason why many scrapers stop working. To fix this issue, follow best practices: store selectors in a config file instead of hardcoding them in your script.

Incorrect prices from cached content

Some pages show older prices that may not be accurate. If a scraped price seems strange, either much lower or much higher than expected, make sure to check it for validation:

def validate_price(price, min_price=0.50, max_price=1000.00):
if not min_price <= price <= max_price:
print(f"Warning: price {price} outside expected range")
return False
return True

Flag outliers for review instead of simply discarding them; an exceptionally low price often signals a genuine sale rather than a processing error.

Why periodic test scrapes matter

A scraper that functions today offers no guarantee for next week. Websites frequently update their layouts, rotate class names, and deploy new anti-bot layers without warning. By running a test scrape against a known baseline, even on a weekly schedule, you can identify selector rot and breakage before they result in significant gaps in your collected data.

Set a simple check: if the number of products returned drops below a threshold, send an alert:

def check_scraper_health(products, minimum_expected=15):
if len(products) < minimum_expected:
print(f"Alert: only {len(products)} products returned. Scraper may be broken.")

This won't catch every failure mode, but it catches the most common one: a selector breaking silently and returning an empty list.

Common obstacles and how to handle them

books.toscrape.com is easy to access. It doesn't have anti-bot measures, JavaScript rendering, or IP blocks. Real eCommerce sites will present challenges. Here’s what you can expect and how to deal with each issue.

CAPTCHAs and bot detection

Major retailers use aggressive methods to detect bots. They check your browser settings and look at how you make requests. If they see a single IP address making many requests to the same product pages quickly, it raises a red flag. They might ask you to complete a CAPTCHA if they suspect something is automated.

Three things reduce detection risk significantly:

  • Rotate IPs. Decodo's residential proxies send each request through a unique residential IP address. This makes your scraper appear like regular traffic from various users
  • Realistic browser fingerprints. Playwright already runs a real Chromium browser, which passes most fingerprint checks that headless detection targets. You can enhance this by setting a realistic user_agent and viewport size
  • Request pacing. Adding random delays between requests is more important than many realize. Fixed delays are easy to detect, but random ones are not

For tough websites that use Cloudflare, Akamai, or custom bot protection, Decodo's Web Scraping API can handle fingerprinting, solve CAPTCHA, and JavaScript rendering as a managed layer. You provide a URL, and it returns the rendered HTML. For more details on how to bypass CAPTCHA, check out the guide that explains all the methods.

IP blocks and rate limiting

Getting blocked from a website isn’t always clear. You might see a 403 error. More often, the site returns a soft block, like a CAPTCHA page, an empty product page, or sends you to a login screen, while your scraper saves the incorrect HTML.

Here are some practical guidelines for most major retailers:

  • Wait 5 to 10 seconds between requests to the same website.
  • Limit to 1 or 2 active sessions per IP address.
  • Change user agents along with IPs. Using the same user agent string for thousands of requests can identify your activity.
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
]
def get_random_user_agent():
return random.choice(USER_AGENTS)

Dynamic content and JavaScript rendering

Prices loaded with AJAX or client-side rendering are not included in the initial HTML response. They only appear after JavaScript runs. When you make a basic Requests call, you get the page structure without any price data.

Playwright addresses this issue automatically. It uses a real browser and waits for the page to load completely before you access the content. However, for pages where prices show up only after user actions like scrolling, clicking a size selector, or accepting a cookie banner, you need to be more specific in your approach:

# wait for a specific price element to appear
await page.wait_for_selector("[data-testid='price']", timeout=15000)
# or wait for network activity to settle
await page.wait_for_load_state("networkidle")

Use Playwright for pages that need JavaScript to load their content. For static pages where prices are already in the HTML, using requests with BeautifulSoup is faster and costs less. For more on what a headless browser is and when you need one, that covers the trade-offs. For more details on scraping dynamic content, refer to the deeper guide.

Anti-scraping countermeasures

Beyond CAPTCHAs and IP blocks, eCommerce sites use subtler traps:

  • Honeypot links. Invisible links in the HTML that legitimate users never click, but scrapers follow all <a> tags will. Clicking one flags your session immediately. Filter links by visibility before following them:
# only follow visible links
links = await page.query_selector_all("a:visible")
  • Obfuscated class names. dynamically generated class names like a3B9x_price that change on every deploy. Selector rot sets in fast when you target these; they break silently after every site update. Build selectors that rely on structural position or ARIA attributes instead:
# fragile -- breaks when class name changes
soup.select_one(".a3B9x_price")
# more resilient -- targets semantic role
soup.select_one("[aria-label='price']")
soup.select_one("[data-testid='product-price']")
  • ARIA attributes (aria-labelaria-describedbyrole) are set by developers for accessibility purposes. They're tied to the element's function, not its visual styling, which makes them far more stable across redesigns than class names.

For a full breakdown of anti-scraping techniques and how to handle them.

Geo-specific pricing

The same product can have different prices based on where you are. For example, a pair of headphones might cost $299 in the US, €319 in Germany, and £259 in the UK. These differences are not just due to currency conversion; they reflect actual pricing differences.

Use Decodo's geo-targeted residential proxies to scrape from specific regions:

# target a specific country via session parameter
proxy = {
"server": "http://gate.decodo.com:7000",
"username": f"{os.getenv('DECODO_PROXY_USERNAME')}-country-de",
"password": os.getenv("DECODO_PROXY_PASSWORD")
}

Run the same scraper with different geographic targets and compare the results to understand pricing in different regions.

Data quality issues

Bad data can cause more problems than having no data at all. If a price seems valid but is incorrect, it can mess up any analysis without showing an error.

Here are some common sources of bad price data:

  • Cached pages. Sometimes, a server sends an outdated version of a price from the previous day. Check scraped prices against a sensible range and mark any that are too far off
  • Missing discount information. If the sale price is recorded but the original price is missing, you can't calculate the discount
  • Currency mismatches. A price taken from a German website may be stored as if it were in USD, which leads to confusion

Add a validation layer before saving:

def validate_price(price, currency, min_price=0.01, max_price=50000.00):
if not min_price <= price <= max_price:
print(f"Warning: {currency} {price} outside expected range -- flagging for review")
return False
return True

Flag outliers instead of throwing them away. An unusually low price might be a real flash sale, not an error from data scraping. To learn more about cleaning and preparing scraped data before using it, check out what is data cleaning?

Using AI to automate and optimize price scraping

Using manual selectors to write data works well when you're scraping information from one website. When you try to scrape data from 50 different retailers, each with its own HTML structure, class names, and page layout, this method doesn’t work effectively. AI can improve this process in several important ways.

AI-assisted selector generation

Instead of manually inspecting the DOM for every new target site, send the page HTML to an LLM, then ask the model to provide the CSS selector or XPath for the price element.

Here's a practical example using the Anthropic API. First, install Anthropic

pip install Anthropic SDK:

pip install anthropic

Create a new file, get_price_selector.py, and add the code below

import asyncio
import os
from dotenv import load_dotenv
from playwright.async_api import async_playwright
import anthropic
load_dotenv()
async def get_price_selector(url: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
proxy={
"server": "http://gate.decodo.com:7000",
"username": os.getenv("DECODO_PROXY_USERNAME"),
"password": os.getenv("DECODO_PROXY_PASSWORD"),
},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
)
page = await context.new_page()
# wait for dynamic content to finish loading
await page.goto(url, wait_until="networkidle")
await asyncio.sleep(3)
# strip junk and pass only the body
html = await page.evaluate('''() => {
const clone = document.body.cloneNode(true);
clone.querySelectorAll('script, style, noscript, svg').forEach(el => el.remove());
return clone.innerHTML;
}''')
await browser.close()
html_sample = html[:50000]
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"""Analyze this HTML and return only the CSS selector
for the product price element. Return nothing else -- just the selector string.
No markdown, no code fences, no explanation.
HTML:
{html_sample}"""
}
]
)
raw_selector = message.content[0].text.strip()
# clean up common formatting artifacts
selector = (
raw_selector
.replace("```css", "")
.replace("`" * 3, "")
.replace("css", "")
.strip()
.rstrip(".")
)
return selector

Cache the returned selector you get in a config file. Use it for each scrape until it stops giving results. Then, create a new one. This way, you let AI do the hard discovery work only once, not with every request.

LLM-based extraction from unstructured pages

Some product pages have prices that are hard to find because they use different types of HTML or plain text. This makes it difficult to target with CSS selectors. can read these pages and provide the information in a clear JSON format without needing any selectors.

def normalize_prices_with_llm(raw_prices: list) -> list:
client = anthropic.Anthropic()
# We pre-fill the assistant response with '[' to lock it into a JSON array
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
system="You're a data normalization expert. Return ONLY valid JSON arrays.",
messages=[
{
"role": "user",
"content": f"Normalize to JSON (amount: float, currency: ISO 4217): {raw_prices}"
},
{
"role": "assistant",
"content": "[" # This 'pre-fill' trick forces JSON output
}
]
)
import json
# Re-attach the bracket we forced
full_json = "[" + message.content[0].text.strip()
return json.loads(full_json)

This is especially helpful for:

  • Websites that have prices stored in JavaScript variables or in inline scripts
  • Pages where prices are shown in unstructured text, like "Now only £29.99!"
  • Aggregator pages that combine different formats from various sellers

Price normalization and currency conversion

Scraped prices come in inconsistent formats across different sites and regions. For example, you might see them as £1,299.001.299,00 €USD 1299, or $1,299. If you put these raw formats directly into a database, it can cause problems when you try to analyze the data later. Use an LLM to normalize prices into a consistent schema:

def normalize_prices_with_llm(raw_prices: list) -> list:
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=2048,
messages=[
{
"role": "user",
"content": f"""Normalize these raw price strings into consistent JSON objects.
Each object should have: amount (float), currency (ISO 4217 code).
Return only a valid JSON array. No explanation, no markdown.
Raw prices:
{raw_prices}"""
}
]
)
import json
raw = message.content[0].text.strip()
clean = raw.replace("```json", "").replace("`" * 3, "").strip()
return json.loads(clean)
# example
raw = ["£1,299.00", "1.299,00 €", "USD 1299", "$1,299"]
normalized = normalize_prices_with_llm(raw)
print(normalized)
# [
# {"amount": 1299.00, "currency": "GBP"},
# {"amount": 1299.00, "currency": "EUR"},
# {"amount": 1299.00, "currency": "USD"},
# {"amount": 1299.00, "currency": "USD"}
# ]

Anomaly detection

Price changes, like a product dropping 80% overnight or its price jumping 10 times, can mean a real sale, a mistake in data entry, or a problem with data collection. Finding these issues early helps stop bad data from affecting other systems.

A basic statistical approach catches most outliers:

import statistics
def flag_price_anomalies(prices: list[float], threshold: float = 2.5) -> list:
if len(prices) < 3:
return []
mean = statistics.mean(prices)
stdev = statistics.stdev(prices)
anomalies = []
for price in prices:
z_score = abs((price - mean) / stdev) if stdev > 0 else 0
if z_score > threshold:
anomalies.append({
"price": price,
"z_score": round(z_score, 2),
"deviation": f"{round(((price - mean) / mean) * 100, 1)}% from mean"
})
return anomalies

If context is important, for example, to tell the difference between a real flash sale and a scraping error, send the flagged prices along with the relevant details to the LLM:

def interpret_anomaly_with_llm(product_name: str, current_price: float,
historical_prices: list, currency: str) -> str:
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[
{
"role": "user",
"content": f"""A price anomaly was detected for this product.
Product: {product_name}
Current price: {currency} {current_price}
Historical prices (last 30 days): {historical_prices}
Is this likely a genuine sale, a data error, or a scraping failure?
Give a brief 1-2 sentence assessment."""
}
]
)
return message.content[0].text.strip()

When to use AI and when not to

AI can slow down processes and increase costs for every request. A well-designed scraper that visits 1,000 product pages doesn’t need an LLM for each page.

Use AI for:

  • Finding selectors for new sites. Do this once per site and cache the results
  • Handling exceptions. If a selector fails, switch to LLM extraction instead of letting it crash
  • Normalizing data. Process raw data in batches, not during the scraping
  • Understanding anomalies. Only check for outliers when statistics indicate a problem

Don't use AI for:

  • Every page request in a high-volume scrape. The cost and latency add up fast
  • Replacing working selectors on stable sites. If a selector functions well, keep using it
  • Simple tasks that a regular expression or string operation can handle

Using a smaller, faster model like claude-haiku-4-5-20251001 keeps costs low for high-volume tasks like normalization and anomaly detection. Reserve larger models for complex extraction tasks where accuracy matters more than speed.

For connecting AI agents to live web data as part of a larger pipeline, Decodo's MCP server integration covers the infrastructure side.

Best practices for reliable price scraping at scale

A scraper that works well on one site may not work the same way when it's used on 50 sites with thousands of SKUs. These practices help keep it reliable over time, not just on the first day.

Rotate everything

Using predictable patterns can lead to getting blocked quickly. Here are two ways to avoid this:

  • IPs. Decodo's residential proxies handle IP rotation automatically. Each request goes out through a different residential address, making repeated scrapes look like organic traffic from different users.
  • User agents. Store a collection of realistic user agent strings and change them for each session, rather than for each request:
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
]
def get_random_user_agent():
return random.choice(USER_AGENTS)
  • Viewport sizes. A single viewport across all sessions is a fingerprint. Vary it:
VIEWPORTS = [
{"width": 1920, "height": 1080},
{"width": 1440, "height": 900},
{"width": 1366, "height": 768},
]
context = await browser.new_context(
viewport=random.choice(VIEWPORTS),
user_agent=get_random_user_agent()
)
  • Request timing. Randomize delays between requests. Never use fixed intervals:
import asyncio
import random
# instead of this
await asyncio.sleep(5)
# do this
await asyncio.sleep(random.uniform(3, 8))

Version-control your selectors

Hardcoding selectors in your script is a major problem for maintaining production scrapers. When a website changes its layout, which they all do eventually, you end up editing Python files instead of a simple config file.

It's better to store selectors in a separate config file:

# selectors.json
{
"books.toscrape.com": {
"product_container": "article.product_pod",
"product_name": "h3 a",
"price": ".price_color",
"availability": ".availability"
},
"another-retailer.com": {
"product_container": ".product-card",
"product_name": "[data-testid='product-title']",
"price": "[data-testid='product-price']",
"availability": "[aria-label='stock status']"
}
}

Load them at runtime:

import json
def load_selectors(domain: str) -> dict:
with open("selectors.json") as f:
all_selectors = json.load(f)
return all_selectors.get(domain, {})

When a site redesigns and a selector breaks, you update one JSON entry. The scraper logic stays untouched.

Monitor for breakage

Silent failures can be very risky. When a scraper returns 0 results or incorrect data, it doesn't show an error. Instead, it just saves bad information, leading you to believe that everything is fine.

Make sure to set up health checks that run after every scrape:

def check_scraper_health(products: list, domain: str,
minimum_expected: int = 15) -> bool:
if len(products) < minimum_expected:
print(
f"Alert: {domain} returned only {len(products)} products. "
f"Expected at least {minimum_expected}. Check selectors or connectivity."
)
return False
prices = [p["price"] for p in products]
if not any(prices):
print(f"Alert: {domain} returned products but all prices are null.")
return False
return True

If a scrape fails the health check, log it and skip writing to your main output. Bad data in is worse than no data in.

Also, check logs/ after each run, Nutch users will recognize this pattern. For scrapers, build the equivalent: a per-run log file that captures fetch status, parse success rate, and record count so you have a paper trail when something breaks.

Validate output data

A price that passes the health check can still be wrong. Add field-level validation before any scraped record touches your database:

from pydantic import BaseModel, validator
from typing import Optional
class ProductPrice(BaseModel):
product_name: str
price: float
currency: str
original_price: Optional[float] = None
discount: Optional[float] = None
availability: str
source_url: str
@validator("price")
def price_must_be_positive(cls, v):
if v <= 0:
raise ValueError(f"Price must be positive, got {v}")
return v
@validator("discount")
def discount_must_be_valid(cls, v):
if v is not None and not 0 <= v <= 100:
raise ValueError(f"Discount must be between 0-100%, got {v}")
return v
@validator("currency")
def currency_must_be_valid(cls, v):
valid_currencies = {"GBP", "USD", "EUR", "CAD", "AUD"}
if v not in valid_currencies:
raise ValueError(f"Unrecognized currency: {v}")
return v

Pydantic will raise a ValidationError if any record doesn't meet the required criteria. Make sure to catch this error, log it, and continue processing the rest instead of allowing it to disrupt the entire batch.

Schedule scrapes thoughtfully

When you scrape data, how you do it is just as important as when you do it.

  • Run your scrapes during off-peak hours for the site you're targeting. For example, a US retailer has peak traffic between 9 AM and 9 PM Eastern time. Scraping at 3 am local time puts less pressure on their servers and lowers the risk of hitting rate limits during busy times.
  • Don’t scrape the same page multiple times in a short period. If you're tracking 500 products on one site, spread your requests out instead of gathering them all at once
  • Adjust your scraping frequency based on how often the data changes. Flash sale prices can change every hour, while standard retail prices might change weekly. Scraping stable data too often wastes resources and increases the chance of being blocked. Scraping volatile data too infrequently means you might miss important updates.

A simple frequency config per domain:

SCRAPE_INTERVALS = {
"flash-sale-site.com": 3600, # every hour
"standard-retailer.com": 86400, # every 24 hours
"slow-moving-catalog.com": 604800 # every 7 days
}

Build in retry logic

A single failed request shouldn't kill the entire scrape. Network hiccups, temporary rate limits, and transient server errors are normal. The scraper should handle them automatically and move on.

Use tenacity for clean, configurable retry behavior:

pip install tenacity
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type
)
from playwright.async_api import TimeoutError as PlaywrightTimeout
@retry(
retry=retry_if_exception_type((PlaywrightTimeout, ConnectionError)),
wait=wait_exponential(multiplier=1, min=4, max=60),
stop=stop_after_attempt(3)
)
async def fetch_with_retry(page, url: str) -> str:
await page.goto(url, wait_until="domcontentloaded")
await page.wait_for_selector(".price_color", timeout=10000)
return await page.content()

How this works:

  • retry_if_exception_type only retries on specific errors. A 404 isn't worth retrying. A timeout is.
  • wait_exponential waits 4 seconds before the first retry, then doubles each time up to 60 seconds. This is called exponential backoff; it avoids hammering a struggling server while still recovering automatically.
  • stop_after_attempt(3) gives up after 3 attempts and raises the exception so you can log it and move on

If you'd rather not add a dependency, a basic backoff loop achieves the same result:

import asyncio
import random
async def fetch_with_backoff(page, url: str, max_retries: int = 3) -> str:
for attempt in range(max_retries):
try:
await page.goto(url, wait_until="domcontentloaded")
await page.wait_for_selector(".price_color", timeout=10000)
return await page.content()
except Exception as e:
if attempt == max_retries - 1:
print(f"Failed after {max_retries} attempts: {url} -- {e}")
raise
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Attempt {attempt + 1} failed. Retrying in {wait:.1f}s...")
await asyncio.sleep(wait)

The random.uniform(0, 1) adds jitter.  A small random offset that prevents multiple concurrent scrapers from retrying in lockstep and hitting the server at the same moment.

For a production scraper hitting hundreds of URLs, retry logic is the difference between a run that completes with a few logged failures and one that crashes halfway through and leaves you with incomplete data.

Use a scraping API for critical targets

For high-value websites like major retailers and platforms using Cloudflare or Akamai, keeping a custom scraper updated is a constant challenge. When these sites strengthen their defenses, your scraping methods can break, or your IPs can get blocked.

Decodo's eCommerce price scraper API returns structured pricing data without the maintenance burden. You send a URL, it handles rendering, fingerprinting, CAPTCHA bypassing, and IP rotation, and then returns clean JSON. For production-grade price monitoring across critical targets, the time saved on maintenance alone justifies the switch. For more on using proxies and scraping solutions to monitor pricing, that guide covers the infrastructure considerations in depth.

Final thoughts

Price scraping at scale comes down to three things: getting the data, keeping the scraper running, and trusting the output. This guide covered all three: building a working Python scraper with Playwright and BeautifulSoup, handling the real-world obstacles that break production scrapers, and validating output before it touches anything downstream.

The jump from a working scraper to a reliable one is mostly infrastructure, rotating proxies, versioned selectors, health checks, and thoughtful scheduling. Get those right, and the scraper runs itself. For targets where maintaining custom selectors isn't worth the effort, Decodo's eCommerce price scraper API and residential proxies handle the heavy lifting so you can focus on the data itself.

A stylized blue ampersand-shaped ribbon forming a simplified human silhouette, curving into leg-like shapes, centered on a plain muted green background.

Build it, then scale it

Your scraper works on 50 products. Now try 50,000 across six regions. Decodo handles the infrastructure, so your code stays the same.

Share article:

About the author

Lukas Mikelionis

Senior Account Manager

Lukas is a seasoned enterprise sales professional with extensive experience in the SaaS industry. Throughout his career, he has built strong relationships with Fortune 500 technology companies, developing a deep understanding of complex enterprise needs and strategic account management.

Connect with Lukas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

What is price scraping?

Price scraping is the automated extraction of product pricing data from ecommerce websites. A typical scraper pulls the product name, current price, currency, availability status, and sometimes historical pricing from product pages. It's used for competitive monitoring, dynamic pricing, MAP compliance, and market research. The first section of this guide covers the concept and common use cases in more detail.

How do I scrape prices from Amazon?

Amazon's terms of service explicitly prohibit scraping, and the site uses some of the most aggressive anti-bot detection in eCommerce: CAPTCHAs, TLS fingerprinting, behavioral analysis, and frequent layout changes. A custom scraper with rotating residential proxies can work, but it requires constant maintenance and carries the risk of account or IP bans. For a more reliable approach, Decodo's Web Scraping API handles the anti-bot layer and returns structured pricing data without you building and maintaining the bypass infrastructure yourself.

Can AI help with price scraping?

Yes, in a few practical ways. AI can generate selectors automatically when you're onboarding new sites, extract pricing from unstructured or inconsistent markup, normalize price formats across different currencies and locales, and flag anomalies like sudden price drops or suspected mislabels. That said, running an LLM on every single scrape adds latency and cost that usually isn't justified at scale. The sweet spot is using AI for initial setup, edge cases, and exception handling rather than the main extraction loop. Section 6 of this guide covers the full breakdown.

Three software interface panels float on a dark dotted background: a web-scraping URL form, a JSON response window, and a product-results page showing Wi‑Fi router listings and prices.

How to Scrape Amazon Prices

Amazon is the ultimate shopping platform, serving as a vast database of current, competitive pricing information. For anyone looking to track eCommerce prices, explore trends, or gain insights for competitive analysis, scraping Amazon prices is a powerful way to gather such data. In this guide on how to scrape Amazon prices, we’ll dive into the essential methods and tools available to help you gather pricing data and keep an eye on the latest deals and price changes.

Code panel displays Python web-scraping request example over a dark neon background with a glowing circle.

The Ultimate Guide to Scraping eCommerce Websites: Tools, Techniques, and Best Practices

Manual eCommerce data collection breaks because the data doesn’t stay stable. Prices change daily, products disappear and reappear under the same URL, and even mid-sized stores list tens of thousands of SKUs. On top of that, much of the content is rendered with JavaScript, layouts shift due to constant A/B testing, and anti-bot systems detect repeated automated access. This guide shows you how to analyze a target site and choose the right extraction approach.

Using Proxies and Scraping Solutions to Monitor Pricing, Trends, and Competitors

Staying Ahead of the Curve: Using Proxies and Scraping Solutions to Monitor Pricing, Trends, and Competitors

Businesses rely on real-time data about pricing, market trends, and their rivals' activities to make well-informed decisions and maintain a competitive edge. In fact, 98%¹ of executives agree it’s somewhat or very important to increase data analysis in their companies over the upcoming 1 to 3 years. Manual data collection methods are time-consuming, less efficient, and sometimes even provide outdated data, resulting in missed opportunities and potential loss of market share.

Fortunately, thanks to technological advancements, innovative and easy-to-use solutions like proxies and web scraping tools offer businesses a powerful way to monitor and analyze vital information efficiently.

With this in-depth article, we invite you to explore the proxy and scraping landscape, discover the pros and cons of such solutions, and learn how to identify some of the best real-time data collection options in the market matching the most popular use cases. Grab a cup of coffee and continue reading.

If you can't access the whole article, make sure you have disabled your ad blocker.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved