Back to blog

How To Use ScrapeGraph AI for Web Scraping in 2026

Web scraping used to mean extracting data with CSS selectors, and then rebuilding your scraper every time a target changes its layout. Here's the good news: ScrapeGraph AI takes a new approach as it uses LLMs to extract data from websites based on meaning, so you can describe what you need in natural language while the library handles the rest for you. In this guide, you'll learn how ScrapeGraph AI works and how to configure it to export structured datasets in the right formats. The tools we'll use are Python, ScrapeGraph AI, and Decodo proxies.

TL;DR

  • ScrapeGraph AI is a Python scraping library that uses LLMs to extract structured data from websites and local files based on meaning, not fixed selectors.
  • It works well for messy, semi-structured, or frequently changing pages where selector-based scrapers are harder to maintain.
  • You can use different graphs for different tasks, including SmartScraperGraph for single pages, SmartScraperMultiGraph for multiple URLs, and JSONScraperGraph for local JSON files.
  • ScrapeGraph AI supports both hosted and local models, including OpenAI, Groq, Gemini, and Ollama-based setups.
  • For live targets, Playwright handles JavaScript rendering, while Decodo residential proxies help reduce blocks, support geo-targeting, and make large-scale scraping more reliable.
  • It can also process local data formats like JSON, XML, CSV, and Markdown, which makes it useful for post-processing API responses and exported datasets.
  • Once extracted, the output can be validated with Pydantic and saved in formats like JSON or CSV for downstream use. 

What is ScrapeGraph AI, and why use it?

ScrapeGraph AI is a powerful Python scraping library that uses Large Language Models (LLMs) to extract structured data from websites and documents. 

It's designed to be user-friendly and efficient, as users can simply specify the information they need instead of manually defining selectors, and ScrapeGraph AI handles the rest.

ScrapeGraph AI does this by working from meaning rather than fixed page patterns. This makes it useful when page structures are inconsistent or likely to change over time. It can also automate the creation of scraping pipelines from user prompts, which reduces the need for manual coding. If you want to compare it with similar tools, see our roundup of the best AI data collection tools.

Lastly, it's compatible with multiple LLM providers like OpenAI, Claude, and Gemini, as well as local models through Ollama. It also supports both single-page and multi-page scraping, making it flexible enough for a range of data extraction workflows.

Key features

  • Multi-LLM support. Works with OpenAI, Anthropic, Google Gemini, Groq, Mistral, DeepSeek, and local models via Ollama. That's 20+ providers and counting.
  • Multiple document formats. Beyond HTML, it can also extract from JSON, XML, CSV, Markdown, and PDFs.
  • Graph-based architecture. Each stage of the scraping pipeline is a node in a directed graph, which makes it easier to swap components, reuse steps, or build custom flows.
  • Developer-friendly workflow. You describe the output you want, and the library builds much of the extraction flow around that prompt, reducing setup friction.

When to use ScrapeGraph AI vs. traditional scraping

Approach

Best for

Why

ScrapeGraph AI

Unstructured, semi-structured, or frequently changing targets

It extracts by meaning, so it's easier to maintain when layouts shift

Traditional scraping

Stable, high-volume, simple extractions

Selectors are faster, more predictable, and more cost-efficient for fixed page structures

Understanding ScrapeGraph AI's architecture

The name "ScrapeGraph" comes from the library's graph-based pipeline architecture. Every scraping operation runs as a directed graph, where each node handles one task, and data moves from node to node until you get a structured result. 

This design makes the library more modular and easier to follow. Instead of treating scraping as one large process, ScrapeGraph AI breaks it into smaller steps that can be reused, adjusted, or extended when needed.

The core pipeline stages

A typical SmartScraperGraph run moves through 5 stages:

  1. Content acquisition. Fetch the page from a URL or load a local file. For web pages, Playwright spins up a headless browser in the background and renders JavaScript before passing the content downstream.
  2. Preprocessing and chunking. Raw HTML is cleaned and split into smaller chunks that fit within the LLM's context window. This is how the library handles large pages without running into token limits.
  3. LLM analysis. The chunks and your prompt go to the configured LLM. The model interprets what's on the page and matches it against what you're asking for.
  4. Intelligent extraction. The LLM pulls the fields you requested based on meaning rather than fixed markup or selectors.
  5. Result formatting: Output is structured as a dictionary, JSON, or whatever shape you specified in the prompt.

LLM integration

ScrapeGraph AI supports multiple LLM providers, allowing you to pick the model that best fits your needs and budget. It also handles the complexities of token limits, prompt engineering, and response parsing, making it easy to apply these powerful LLMs without bogging users with technical details.

Embedding models

For larger pages, ScrapeGraph AI uses an embedding model to chunk content semantically rather than just splitting by character count. Chunks that are topically similar stay together, which improves extraction accuracy.

To learn more about how modern AI systems process and structure web content, read our blog on how AI processes data.

Prerequisites and installation

Before you start, make sure you've got a few basics in place. You should be comfortable writing simple Python scripts and running commands in your terminal. Let's quickly run down a few extra things you need to follow along:

  • A basic understanding of Python
  • Python 3.10 or newer. You can confirm your version with python --version
  • pip package manager – comes with most Python installations
  • An OpenAI API key, or a machine that can run a local LLM through Ollama
  • At least 8 GB of RAM for smaller local models, plus enough disk space for model files.

Install ScrapeGraph AI

Start with a clean virtual environment to keep dependencies isolated:

python -m venv scrapegraph-env
source scrapegraph-env/bin/activate

On Windows:

scrapegraph-env\Scripts\activate

If you like to keep your Python projects organized, you can also create a new environment with  Poetry and add scrapegraphai as a dependency in that environment. However, this is optional.

Next, proceed to install the ScrapeGraph AI library:

pip install scrapegraphai

ScrapeGraph AI now requires Python 3.10+ in current releases. Some older tutorials still mention Python 3.9, but that is no longer accurate for recent versions. 

Installing ScrapeGraph AI also installs Playwright, a tool for automating browsers that powers ScrapeGraph AI's web scraping abilities. We've covered Playwright more thoroughly in a previous blog post. 

If this is your first time using Playwright, there's an extra step. Run this command to download the necessary browser binaries:

playwright install

Local LLM setup with Ollama

If you want to run ScrapeGraph AI locally, Ollama is the easiest way to get started. It handles downloading and serving local models so that ScrapeGraph AI can connect to them through a local endpoint. 

If you're looking to save some extra money, this is a smart call because every ScrapeGraph AI extraction is an LLM call, and API bills add up fast at scale. Running locally means you pay for electricity instead of tokens.

Here's how to install Ollama:

  1. Visit the official Ollama website.
  2. Select the version that matches your operating system.
  3. Follow the installation guide on their website.

Once the install is complete, run ollama, and you should see its version along with an interactive prompt.

Ollama 0.21.0

Installing Ollama doesn't automatically include a language model. You'll need to select and pull one separately.

Choosing a model

The right model depends on your hardware and the kind of extraction you want to run. Here are some solid options:

  • LLaMA is a good general-purpose option
  • Mistral works well for many scraping and text-heavy tasks
  • Phi can be a lighter choice for precise tasks
  • Gemma can be useful for multilingual work

Understanding parameter counts

You'll see model names like llama3:8b or llama3:70b. The 'B' stands for billions of parameters, which refers to the number of parameters in the model. More parameters usually mean better reasoning and language understanding, but it also means higher memory and storage requirements.

For reference, LLaMA 8B needs at least 8GB of RAM and around 4.9GB of disk space. LLaMA 70B is significantly larger at 40GB and requires over 32GB of RAM to run smoothly.

Note that bigger isn't always better. It's more about balancing the model to your task and your hardware. For this guide, LLaMA 8B is a safe starting point unless you're running a high-spec machine with plenty of disk space.

Pulling and running the model

In your terminal, download Llama 3.1 along with an embedding model ScrapeGraph AI will use for semantic chunking:

ollama pull llama3 && ollama pull nomic-embed-text

We use llama3.1 here instead of llama3 because the 8B version has a 128K context window rather than 8K. That means it can keep much more page content in working memory at once, which is useful for larger documents, longer prompts, and extraction tasks that need more context. 

The download may take a few minutes, depending on your connection. Once it's done, verify the setup by running the model:

ollama run llama3

You should see an interactive prompt. Type something like "introduce yourself" to confirm the model responds, then type /bye to exit.

If you get a response, it means everything is working as expected. Finally, start serving the model so that ScrapeGraph AI can reach it:

ollama serve

This spins up a local Ollama instance at 127.0.0.1:11434. Keep the terminal window open while you work. If the port is already in use, Ollama may have started automatically after installation, so check your running processes before troubleshooting further.

Cloud LLM setup

If you'd rather skip the local setup, you can point ScrapeGraph AI at a cloud LLM provider. OpenAI is the most common choice, but ScrapeGraph AI also supports Anthropic, Google Gemini, Azure OpenAI, and Groq out of the box.

For OpenAI, generate an API key from your OpenAI dashboard and keep it handy:

OPENAI_API_KEY="sk-..."

For other providers, use the corresponding variable (ANTHROPIC_API_KEY, GOOGLE_API_KEY, etc.). 

Configuring and using scraping pipelines

After installation, you'll need to configure ScapeGraph AI to use your preferred LLM provider. The library supports various providers, including the popular ones like OpenAI, Google Gemini, and Anthropic Claude, and local models through Ollama. Once that is in place, you can choose the pipeline that matches your task.

ScrapeGraph AI comes with several built-in graphs, each designed for a specific use case. Instead of forcing every job into one workflow, you pick the graph that matches your input type and output goal.

Graph

Best for

SmartScraperGraph

Scraping a single web page

SmartScraperMultiGraph

Extracting data from multiple URLs with one prompt

JSONScraperGraph

Working with JSON files or JSON content

XMLScraperGraph

Extracting from XML

SearchGraph

Searching the web, then extracting from the top results

ScriptCreatorGraph

Generating reusable Python extraction scripts

SpeechGraph

Turning extracted content into audio

For most use cases, SmartScraperGraph is the starting point. It's the main graph for single-page scraping and the one you'll likely use most often.

Basic usage example

ScrapeGraph AI provides an intuitive API that makes it easy to extract data from websites. You tell it what to extract, provide the source, and pass a config object that defines the model settings. The library then handles the technical details of finding and extracting that information.

For a quick demo, we'll use SmartScraperGraph to scrape recent blog titles and publication dates from the Decodo blog.

Declaring imports

Start by importing the graph class you need:

from scrapegraphai.graphs import SmartScraperGraph

The graph_config dictionary

Every pipeline takes a graph_config dict with at least an LLM block. A simple config looks like this:

graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "openai/gpt-5-mini",
"temperature": 0,
},
"verbose": True,
"headless": True,
}

The model key uses a provider/model format. Setting temperature to 0 keeps the output more deterministic, which is what you usually want for scraping tasks. If you're extracting product prices, dates, or structured fields, randomness isn't your friend.

The other two args are straightforward:

  • verbose helps you inspect what the pipeline is doing
  • headless runs the browser without opening a visible window

Instantiating the scraper

This is where the prompt does the magic that selectors used to do. Instead of telling the scraper where every field sits in the markup, you describe the data you want.

Pass your prompt, source URL, and config into the SmartScraperGraph object:

scraper = SmartScraperGraph(
prompt="Extract all article headlines and publication dates.",
source="https://decodo.com/blog",
config=graph_config,
)

Running a pipeline

Every graph follows the same pattern – create the graph, run it, and collect the result. This is the full script:

from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "openai/gpt-5-mini",
"temperature": 0,
},
"verbose": True,
"headless": True,
}
scraper = SmartScraperGraph(
prompt="Extract all article headlines and publication dates.",
source="https://decodo.com/blog",
config=graph_config,
)
result = scraper.run()
print(result)

The result is a Python dictionary shaped by your prompt:

[
{
"headline": "How To Scrape Emails From a Website: Python Tutorial",
"date": "Apr 20, 2026"
},
{
"headline": "Browser-use: Step-by-Step AI Browser Automation Guide",
"date": "Apr 17, 2026"
},
{
"headline": "How to Scrape All Text From a Website: Methods, Tools, and Best Practices",
"date": "Apr 15, 2026"
},
{
"headline": "Rust Web Scraping: Step-by-Step Tutorial With Code Examples",
"date": "Apr 15, 2026"
},
{
"headline": "Crawl4AI Tutorial: Build Powerful AI Web Scrapers",
"date": "Apr 15, 2026"
},
{
"headline": "No-Code Web Scraper With Playwright MCP: How to Scrape Any Website With Playwright MCP",
"date": "Apr 14, 2026"
},
{
"headline": "What Is a Characteristic of the REST API? A Complete Guide",
"date": "Apr 13, 2026"
},
{
"headline": "How to Scrape Glassdoor: Tools, Methods, and Tips",
"date": "Apr 13, 2026"
}
]

If you ask for headlines and dates, that's what the graph returns. If you ask for titles, authors, summaries, and links, the output will follow that structure instead.

Prompt engineering tips

With ScrapeGraph AI, the prompt is the new selector. If you give it a vague prompt, it won't return the best output, so it helps to be precise about both the fields and the format.

Here are some handy tips:

  • Be explicit about structure. For example, "Return a JSON array of objects, each with title, author, published_at, and summary fields."
  • Specify types when they matter. For example, ask for prices as numbers without currency symbols, or availability as a boolean.
  • List the fields clearly
  • Add fallback instructions. If some pages may not contain every field, tell the model to return null for missing values.

Knowing this, a better prompt for our task would have been:

Extract all blog posts as a JSON array. Each object should include title, url, published_at, and author. If an author is missing, return null.

That gives you cleaner and more predictable output than a shorter prompt like "Extract blog post details."

Extracting data from JSON and local files

ScrapeGraph AI isn't limited to websites. It can also process local JSON, XML, CSV, and Markdown files, which makes it useful for post-processing API responses, cleaning exported datasets, or normalizing stored content before you save or analyze it further.

If you want to go deeper into output handling after extraction, see our guide on how to save scraped data.

A working JSONScraperGraph example

For this example, we're going to extract some data from a JSON file using JSONScraperGraph. Create a local catalog.json file in your Python project. 

You can use any JSON data for this task or simply copy and paste this sample data below:

[
{
"id": 1,
"name": "Apple MacBook Air 13-inch (2020, M1, 8GB RAM, 256GB SSD)",
"price": 589.0,
"availability": true,
"description": "Pre-owned MacBook Air in good working condition with the Apple M1 chip, 13-inch Retina display, and strong battery health. Typical light wear on the casing.",
"categories": ["laptops", "ultrabooks", "apple"],
"manufacturer": {
"name": "Apple",
"location": "Cupertino, California, USA"
},
"reviews": [
{
"user": "MilesCarter",
"rating": 5,
"comment": "Battery life is still excellent and the laptop feels fast for school and office work."
},
{
"user": "NinaRowe",
"rating": 4,
"comment": "Arrived with minor cosmetic marks, but performance matched the listing."
}
]
},
{
"id": 2,
"name": "Apple MacBook Air 13-inch (2020, M1, 8GB RAM, 512GB SSD)",
"price": 679.0,
"availability": true,
"description": "Used MacBook Air with upgraded 512GB storage, responsive keyboard, and quiet fanless design. Suitable for students, browsing, and productivity tasks.",
"categories": ["laptops", "ultrabooks", "apple"],
"manufacturer": {
"name": "Apple",
"location": "Cupertino, California, USA"
},
"reviews": [
{
"user": "TheoGrant",
"rating": 4,
"comment": "Good value compared with newer Air models, especially with the larger SSD."
},
{
"user": "AishaMorgan",
"rating": 5,
"comment": "Runs quietly, wakes instantly, and handles everyday workloads without any issues."
}
]
},
{
"id": 3,
"name": "Apple MacBook Air 13.6-inch (2022, M2, 8GB RAM, 256GB SSD)",
"price": 789.0,
"availability": false,
"description": "Slim redesigned MacBook Air with the M2 chip, MagSafe charging, and brighter display. This listing reflects a sold-out unit in very good condition.",
"categories": ["laptops", "ultrabooks", "apple"],
"manufacturer": {
"name": "Apple",
"location": "Cupertino, California, USA"
},
"reviews": [
{
"user": "JordanPrice",
"rating": 5,
"comment": "The newer design feels more premium than the M1 version and the screen is noticeably better."
},
{
"user": "KemiAdesina",
"rating": 4,
"comment": "Great machine overall, though the base storage fills up quickly if you keep lots of media."
}
]
},
{
"id": 4,
"name": "Apple MacBook Air 13.6-inch (2022, M2, 16GB RAM, 512GB SSD)",
"price": 969.0,
"availability": true,
"description": "Higher-spec used MacBook Air with 16GB memory and 512GB SSD. Good fit for heavier multitasking, development work, and longer-term daily use.",
"categories": ["laptops", "ultrabooks", "apple"],
"manufacturer": {
"name": "Apple",
"location": "Cupertino, California, USA"
},
"reviews": [
{
"user": "VictorLang",
"rating": 5,
"comment": "Worth paying more for the extra RAM if you keep many apps and browser tabs open."
},
{
"user": "RitaSolomon",
"rating": 4,
"comment": "Condition was better than expected and the upgrade in storage made a difference."
}
]
},
{
"id": 5,
"name": "Apple MacBook Air 15-inch (2023, M2, 8GB RAM, 256GB SSD)",
"price": 1049.0,
"availability": true,
"description": "Large-screen MacBook Air with the M2 chip, thin chassis, and strong everyday performance. Best suited for buyers who want more screen space without moving to a MacBook Pro.",
"categories": ["laptops", "ultrabooks", "apple"],
"manufacturer": {
"name": "Apple",
"location": "Cupertino, California, USA"
},
"reviews": [
{
"user": "SamuelIbe",
"rating": 5,
"comment": "The 15-inch display is the main reason I picked this model and it has been great for spreadsheets."
},
{
"user": "LeahTurner",
"rating": 4,
"comment": "Excellent screen and battery life, but I would still prefer more storage at this price."
}
]
}
]

This sample JSON contains structured product data, including prices, availability, manufacturer details, and user reviews. It's readable enough on its own, but pulling out useful data manually still takes time. 

But with JSONScraperGraph, you can simply describe the output you want and let the graph handle the extraction.

from scrapegraphai.graphs import JSONScraperGraph
with open("catalog.json", "r", encoding="utf-8") as f:
raw_json = f.read()
graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "openai/gpt-5-mini",
"temperature": 0,
},
"verbose": False,
}
scraper = JSONScraperGraph(
prompt=(
"For each product, return the product name, price, availability, "
"and the average rating across all reviews rounded to one decimal place."
),
source=raw_json,
config=graph_config,
)
result = scraper.run()
print(result)

This script does 3 things. It loads the local JSON file, passes it to JSONScraperGraph, and asks the model to return a smaller, cleaner dataset. 

In this case, we're extracting the product name, price, availability and the average review rating for each item.

When the script finishes running, you'll get back a Python dictionary with exactly the fields you asked for:

{
"content": [
{
"name": "Apple MacBook Air 13-inch (2020, M1, 8GB RAM, 256GB SSD)",
"price": 589.0,
"availability": true,
"average_rating": 4.5
},
{
"name": "Apple MacBook Air 13-inch (2020, M1, 8GB RAM, 512GB SSD)",
"price": 679.0,
"availability": true,
"average_rating": 4.5
},
{
"name": "Apple MacBook Air 13.6-inch (2022, M2, 8GB RAM, 256GB SSD)",
"price": 789.0,
"availability": false,
"average_rating": 4.5
},
{
"name": "Apple MacBook Air 13.6-inch (2022, M2, 16GB RAM, 512GB SSD)",
"price": 969.0,
"availability": true,
"average_rating": 4.5
},
{
"name": "Apple MacBook Air 15-inch (2023, M2, 8GB RAM, 256GB SSD)",
"price": 1049.0,
"availability": true,
"average_rating": 4.5
}
]
}

Notice how you never told the LLM where the review data lives in the nested structure or how to calculate an average. 

You just described what you wanted, and the model worked out the rest. The same pattern applies to other formats, too. Swap JSONScraperGraph for XMLScraperGraph or CSVScraperGraph, and the rest of your code stays identical.

Scraping web pages

Now it's time for a proper live target. Wikipedia's Current Events portal publishes an ongoing daily record of major news stories grouped by category, with source publications cited in parentheses.

Each day's block contains multiple categories, each category holds several news items, and each item ends with one or more news outlets.

Notice how structurally, a lot is going on in the page. The content is grouped by date, divided into categories, and written as natural-language bullet points rather than clean table rows or fixed card layouts. 

That's where LLM-based extraction becomes useful. Instead of relying on selectors that may change, you can describe the fields you want and let the model work from meaning.

For this task, we'll use SmartScraperGraph, keeping the main configuration the same as before:

from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "openai/gpt-5-mini",
"temperature": 0,
},
"verbose": True,
"headless": True,
}
scraper = SmartScraperGraph(
prompt=("Extract every news item from the page. For each item, return "
"the date, the category it falls under, the summary text, and "
"the list of source publications cited in parentheses."),
source="https://en.wikipedia.org/wiki/Portal:Current_events",
config=graph_config,
)
result = scraper.run()
print(result)

What you'll get back is a list of objects with date, category, summary, and sources fields, all parsed out of natural-language text:

{
"content": [
{
"date": "2026-04-21",
"category": "Law and crime",
"summary": "South Korean police seek an arrest warrant for Hybe founder and chairperson Bang Si-hyuk over alleged violations of capital market laws related to Hybe's initial public offering, accusing him of misleading early investors and receiving profits through a related private equity fund.",
"sources": [
"Reuters"
]
},
{
"date": "2026-04-21",
"category": "Politics and elections",
"summary": "New Zealand prime minister Christopher Luxon secures the support of his caucus after initiating and winning a confidence vote on his leadership within the National Party.",
"sources": [
"Reuters"
]
},
{
"date": "2026-04-21",
"category": "Politics and elections",
"summary": "The Indonesian parliament passes the Domestic Protection Workers Bill into law after 22 years of deliberation.",
"sources": [
"Tempo English"
]
},
{
"date": "2026-04-20",
"category": "Armed conflicts and attacks",
"summary": "An Iranian official says Iran may attend ceasefire talks with the United States in Islamabad following moves by Pakistan to end the blockade of Iranian ports, but a decision had yet to be made.",
"sources": [
"Al Jazeera"
]
},
… data truncated to conserve space
]
}

See how easy it is to extract structured data from messy, sentence-based content without having to map every nested list, heading, and citation pattern by hand. 

If the page layout changes later, you're much less exposed than you would be with a selector-based parser.

Handling dynamic and JavaScript-heavy sites

Some websites load content only after the initial page render. In those cases, Playwright does the heavy lifting. Because SmartScraperGraph runs with browser automation under the hood, it can work with JavaScript-heavy pages and single-page apps more reliably than a simple request-based scraper.

If a page loads content slowly or asynchronously, you can tune the browser behavior with loader_kwargs:

graph_config["loader_kwargs"] = {
"timeout": 60,
"wait_until": "networkidle",
}

The timeout value is in seconds, and wait_until: "networkidle" tells Playwright to wait until there are no active network requests before handing the page off to the LLM. This is often enough to catch slow-loading content that a default fetch would miss.

For a deeper look at dynamic rendering, see how to scrape websites with dynamic content and our guide on Playwright web scraping.

Using proxies with ScrapeGraph AI

Scraping without proxies works fine for small tests, but the moment you need to scale up, proxies become an integral part of your setup. They help reduce rate limits, lower the chance of IP-based blocks, and let you access location-specific versions of a page. 

As each request already costs more than a simple HTTP call, it makes sense to protect that request with more reliable infrastructure instead of risking wasted runs.

Residential proxies are usually the best fit here. Each IP comes from a real device connected to a local network, so the traffic looks more like ordinary user activity to the target site. Datacenter proxies are cheaper and faster, but they're also easier to flag because large IP ranges are tied to cloud providers and server infrastructure. 

Decodo's residential proxies are a strong fit for this kind of workflow. They give you access to 115M+ IPs across 195+ locations, support both rotating and sticky sessions, and offer targeting at country, state, city, ASN, ZIP, and continent-level. If you want a deeper look at residential proxies, see our guide on what is a residential proxy network.

Stay risk-free with residential proxies

Activate your 3-day free trial and collect data without CAPTCHAs or IP bans.

Configuring Decodo residential proxies

Getting started with Decodo is simple, and you can do so in a few easy steps:

  1. Register or log in to the Decodo dashboard.
  2. Then, open the residential proxies section, and activate your 3-day free trial or choose a subscription that best matches your scraping needs.
  3. Once you're up and running, Decodo provides the proxy credentials you need for authentication, including the server address, username, and password.

Integrating Decodo residential proxies

In ScrapeGraph AI, proxy settings are passed through loader_kwargs. That lets the browser and request layer route traffic through your proxy session while keeping the rest of the graph config unchanged.

graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "openai/gpt-5-mini",
"temperature": 0,
},
"loader_kwargs": {
"proxy": {
"server": "http://gate.decodo.com:PORT",
"username": "YOUR_DECODO_USERNAME",
"password": "YOUR_DECODO_PASSWORD",
}
},
"verbose": True,
"headless": True,
}

This tells ScrapeGraph AI to run the scraping session using proxies while still using your chosen LLM provider for extraction. Once the proxy credentials are in place, you can use them with SmartScraperGraph just like any other scraping task.

from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "openai/gpt-5-mini",
"temperature": 0,
},
"loader_kwargs": {
"proxy": {
"server": "http://gate.decodo.com:PORT",
"username": "YOUR_DECODO_USERNAME",
"password": "YOUR_DECODO_PASSWORD",
}
},
"verbose": True,
"headless": True,
}
scraper = SmartScraperGraph(
prompt="Extract every news item with date, category, summary, and sources.",
source="https://en.wikipedia.org/wiki/Portal:Current_events",
config=graph_config,
)
result = scraper.run()
print(result)

This is usually enough for standard residential proxy use. If you're scraping a site that's sensitive to request volume or repeated access, you can switch between rotating and sticky proxies depending on the job or website. 

Working with output data and formats

ScrapeGraph AI returns structured output, which is one of the main reasons it's easier to work with than a raw scraping setup.

By default, the result comes back as a Python dictionary or list from executing .run(). 

This might be fine for quick experiments, but in production, you'll need to validate the shape of your data before it hits a database or feeds into another system. 

This is where Pydantic comes in.

Validating extractions with Pydantic

Pydantic lets you define the exact structure you expect from an extraction and then check every result against it. If the LLM returns something malformed, like a missing field, a wrong type, or an unexpected nested object, you can catch it immediately.

Install Pydantic if you don't have it already:

pip install pydantic

Then define a schema that matches your expected output:

from datetime import date
from typing import List
from pydantic import BaseModel, ValidationError
class NewsItem(BaseModel):
date: date
category: str
summary: str
sources: List[str] = []
class NewsExtraction(BaseModel):
content: List[NewsItem]
raw_result = scraper.run()
try:
validated = NewsExtraction(**raw_result)
print(f"Extracted {len(validated.content)} validated items.")
except ValidationError as e:
print("Schema mismatch, raw output kept for review:")
print(e.errors())

This gives you 2 practical wins. First, you can reference the schema directly in your prompt ("Return data matching this schema: …"), which nudges the LLM toward producing the right shape on the first try. 

Second, your downstream code works with typed objects rather than loose dictionaries. news_item.date is an actual date object, not a string you'd have to parse again later.

Saving the output

Once your data is validated, serialize it to JSON with proper Unicode handling so international characters can survive:

import json
with open("news.json", "w", encoding="utf-8") as f:
json.dump(validated.model_dump(mode="json"), f, ensure_ascii=False, indent=2)

For CSV output, iterate over the validated items and write them row by row:

import csv
with open("news.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["date", "category", "summary", "sources"])
writer.writeheader()
for item in validated.content:
writer.writerow({
"date": str(item.date),
"category": item.category,
"summary": item.summary,
"sources": ", ".join(item.sources),
})

The same pattern extends to SQLite, Postgres, or any other storage backend. Validate first, then write. For a full reference on output handling and storage options, see our guide on how to save scraped data.

Real-world applications and examples

ScrapeGraph AI fits best in scenarios where the data is messy, the sources are varied, or the target layouts change often. Here are a few practical use cases where it earns its place over a traditional scraper.

eCommerce price monitoring

Product pages look different on every retailer's site. One store might put the price inside a span, while another renders it dynamically in a JavaScript object or promo banner. An LLM-based extractor handles all of these with a single prompt with no need for a per-site parsing logic.

A good example is price monitoring across different stores. Let's say you want to compare prices for the Sony WH-1000XM5 across eBay and Walmart.

Both pages show product listings, but they don't present them in the same way. But we can easily handle this with ScrapeGraphAI like so:

from scrapegraphai.graphs import SmartScraperMultiGraph
sources = [
"https://www.ebay.com/sch/i.html?_nkw=sony+wh-1000xm5",
"https://www.walmart.com/search?q=sony+wh-1000xm5",
]
graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "openai/gpt-5-mini",
"temperature": 0,
},
"loader_kwargs": {
"proxy": {
"server": "http://gate.decodo.com:7000",
"username": "user-YOUR_USERNAME-country-us",
"password": "YOUR_PASSWORD",
},
},
"verbose": True,
"headless": True,
}
scraper = SmartScraperMultiGraph(
prompt=(
"For each product listing, extract the product name, current price, "
"currency, seller or retailer name, and whether it is in stock."
),
source=sources,
config=graph_config,
)
results = scraper.run()
print(results)

Even though the HTML is different on each site, the model can still return the same output shape from both. To learn more about ecommerce scraping, see our guide on how to scrape Amazon prices.

Financial data extraction

Financial pages are also a good fit because they often mix structured data with narrative text. Earnings pages on sites like Yahoo Finance pack a lot of information into dense tables, but they're scattered across different elements and formats.

Instead of writing regex or selectors for each table row, you can ask ScrapeGraph AI to extract the metrics directly:

from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "openai/gpt-5-mini",
"temperature": 0,
},
"verbose": True,
"headless": True,
}
scraper = SmartScraperGraph(
prompt=(
"Extract all financial metrics from the page. For each metric, "
"return the metric name, the reported value as a number, the unit, "
"and the comparison period if mentioned."
),
source="https://finance.yahoo.com/quote/AAPL/financials/",
config=graph_config,
)
print(scraper.run())

For more finance-focused scraping patterns, see our guide on how to scrape Google Finance.

Content aggregation

Content aggregation is another strong use case. News sites, blogs, and publisher pages all present the same general information, such as headline, date, author, and summary, but each one uses a different layout.

But with ScrapeGraph AI, one prompt can pull the same fields from multiple pages and return them in one format, which makes it easier to organize, deduplicate, and store the results in a research database.

Research and academic use

Research pages are another natural fit. Listings for papers, journals, and conference content often include useful fields such as title, authors, abstract, publication date, or citations, but the markup isn't always consistent.

Lead generation

Lead generation is another practical use case, especially when working with business directories and company listing pages. These pages often include company names, industries, locations, websites, and other public business details, but the layouts can vary widely from one directory to another.

Across all of these use cases, the pattern is the same. ScrapeGraph AI is most useful when the content is readable, but the structure isn't stable enough to justify maintaining a separate parser for every source.

Advanced features and configurations

Once you're comfortable with SmartScraperGraph, ScrapeGraph AI gives you a few ways to go beyond simple single-page extraction. This is where the library starts to feel more like a flexible scraping framework – you can scrape multiple pages with a single prompt, search the web before extraction, generate reusable scripts, or build custom pipelines.

SmartScraperMultiGraph for batch scraping

If you need to extract the same kind of data from several pages at once, SmartScraperMultiGraph is usually the next step. Instead of running the same graph over and over for each URL, you pass in a list of sources and use one prompt to extract from all of them:

from scrapegraphai.graphs import SmartScraperMultiGraph
sources = [
"https://en.wikipedia.org/wiki/Portal:Current_events",
"https://en.wikipedia.org/wiki/Portal:Current_events/June_2025",
"https://en.wikipedia.org/wiki/Portal:Current_events/May_2025",
]
multi = SmartScraperMultiGraph(
prompt="Extract every news item with date, category, and summary.",
source=sources,
config=graph_config,
)
aggregated = multi.run()

This is useful for comparison workflows where you need the same fields extracted from multiple sources in one run. 

SearchGraph for research workflows

Sometimes you don't have a fixed list of URLs. You just know the topic you want to research. That's where SearchGraph becomes useful. Instead of starting with a page, it starts with a search query, pulls the top results, and then extracts the data you ask for.

from scrapegraphai.graphs import SearchGraph
search = SearchGraph(
prompt="Find the top 5 open-source Python web scraping frameworks and list their name, GitHub stars, and primary use case.",
config=graph_config,
)
results = search.run()
print(results)

This can be useful for:

  • Market research
  • Competitor tracking
  • Collecting articles or public references on a topic

SpeechGraph for audio output

SpeechGraph takes extracted content and turns it into audio output. That's not a core scraping feature, but it can be useful if your workflow includes accessibility or summaries.

from scrapegraphai.graphs import SpeechGraph
speech_config = {
"llm": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "openai/gpt-5-mini",
},
"tts_model": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "tts-1-hd",
"voice": "alloy",
},
"output_path": "summary.mp3",
}
speech = SpeechGraph(
prompt="Summarize the main news events on this page in 3 paragraphs.",
source="https://en.wikipedia.org/wiki/Portal:Current_events",
config=speech_config,
)
speech.run()

The result is both a text summary and an .mp3 file saved to your specified output path.

Custom graph pipelines

For advanced users, ScrapeGraph AI lets you build your own graphs by composing built-in nodes like FetchNode, ParseNode, RAGNode, and GenerateAnswerNode. You can also write custom nodes for specialized preprocessing, filtering, or post-processing steps.

This is power-user territory and goes beyond most scraping workflows, but it's there when you need it. The ScrapeGraph AI documentation covers node composition in detail.

Best practices and troubleshooting

Here are some tips to keep in mind:

Choose the smallest model that gets the job done

Larger models can handle harder extraction tasks, but they also cost more and usually run slower. For simple fields such as names, prices, dates, and availability, a smaller model is often enough. Save larger models for pages that are text-heavy or ambiguous. 

Keep prompts narrow

It's better to start with a small, well-defined structure, confirm that it works, and then expand it than to ask for too many fields at once. This also makes debugging easier because you can tell whether the problem comes from the page, the model, or the prompt itself. 

Validate output before saving it

Even when the structure looks good, model output should still be checked before it moves into a file, spreadsheet, or database. A simple validation layer helps catch missing fields, wrong types, or malformed values early.

Running ScrapeGraph AI in Google Colab or Jupyter notebooks

If you're running ScrapeGraph AI in Google Colab or a Jupyter notebook, you'll likely hit this error the first time you call scraper.run():

asyncio.run() cannot be called from a running event loop

This is a Colab/Jupyter event loop issue, not a ScrapeGraph AI logic issue. Notebooks already run an asyncio event loop in the kernel, and ScrapeGraph AI internally uses asyncio.run() for parts of its pipeline. When the two collide, Python raises the error above.

The fix is simple. Install and apply nest_asyncio before creating your scraper:

!pip install nest_asyncio
import nest_asyncio
nest_asyncio.apply()

Then run your existing ScrapeGraph AI code as normal.

Error handling with retry logic

Network hiccups, LLM rate limits, and transient page failures all need retries. Exponential backoff is a solid baseline:

import time
from scrapegraphai.graphs import SmartScraperGraph
def run_with_retry(scraper, max_attempts=3):
for attempt in range(1, max_attempts + 1):
try:
return scraper.run()
except Exception as exc:
if attempt == max_attempts:
raise
wait = 2 ** attempt
print(f"Attempt {attempt} failed: {exc}. Retrying in {wait}s.")
time.sleep(wait)

For more advanced retry patterns with jitter and decorator-based configuration, the tenacity library is worth exploring. See our guide on Python requests retry for the wider pattern.

Polite scraping

Respect robots.txt, read the terms of service, and throttle your requests. A 1 to 2 second delay between calls is enough to stay off anti-bot radars in most cases. Add retry logic where necessary, and avoid hammering a target with back-to-back requests.

Debugging tips

  • Set "verbose": True in your config to watch each node execute. You'll spot exactly where things go wrong.
  • Call scraper.get_execution_info() after a run to inspect token counts and per-node timing.
  • Test your prompt against a single page before fanning out to SmartScraperMultiGraph. 
  • If the LLM returns empty results, try a simpler prompt first.

Bypass CAPTCHAs, IP bans, and geo-restrictions

Equip residential proxies with 99.86% for all your data collection needs.

Final thoughts

In this article, you looked at how to set up ScrapeGraph AI, configure its scraping pipelines, and use it to extract structured data from local files and live web pages.

ScrapeGraph AI is one of the easiest ways to build AI-powered scrapers for bulky or fast-changing pages. It gives you a practical way to extract structured data using natural language, rather than relying solely on CSS selectors.

If you need a more resilient setup for live targets, Decodo residential proxies can help reduce blocks, support geo-targeting, and make large-scale scraping workflows more reliable.

About the author

Kipras Kalzanauskas

Senior Account Manager

Kipras is a strategic account expert with a strong background in sales, IT support, and data-driven solutions. Born and raised in Vilnius, he studied history at Vilnius University before spending time in the Lithuanian Military. For the past 3.5 years, he has been a key player at Decodo, working with Fortune 500 companies in eCommerce and Market Intelligence.


Connect with Kipras on LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

What is ScrapeGraph AI and how does it work?

ScrapeGraph AI is an open-source Python library that uses large language models (LLMs) to extract data based on meaning rather than fixed HTML patterns. Instead of writing complex CSS selectors or XPath queries, you describe what you want in plain language, and the library handles the extraction. It implements a graph-based architecture where each node represents a specific operation in the scraping pipeline, making it modular and adaptable to changing website structures.

What programming languages does ScrapeGraph AI support?

ScrapeGraph AI offers SDKs in both Python and Node.js. The core open-source library is Python-based and installable via pip, while the hosted API can be accessed from any language that supports HTTP requests. It's also compatible with multiple LLM providers like OpenAI, Groq, Azure, and Gemini, as well as local models through Ollama.

How does ScrapeGraph AI compare to Beautiful Soup or Scrapy?

Traditional methods like Beautiful Soup require carefully inspecting HTML, selecting the right elements, and writing specific code to extract data – and those scripts break whenever the site layout changes. ScrapeGraph AI uses LLMs to understand page content semantically, so your script will always work even if the website changes. The trade-off is that it depends on an LLM (either cloud-hosted or local), which adds cost and latency compared to the purely rule-based approach of Beautiful Soup or Scrapy.

What's the difference between ScrapeGraph AI and using ChatGPT for scraping?

ChatGPT can help you write scraping code or analyze pasted HTML, but it doesn't directly fetch or interact with live web pages on its own. ScrapeGraph AI is a purpose-built scraping tool that handles the full pipeline: fetching pages, rendering JavaScript via Playwright, and using an LLM to extract structured data – all in an automated, repeatable workflow. It also offers features like multi-page crawling, search-based extraction, and monitoring schedules that a chat-based LLM can't replicate.

Does ScrapeGraph AI work with JavaScript-heavy websites?

Yes. Installing ScrapeGraph AI also installs Playwright for fetching website content, which means it launches a real browser under the hood to render JavaScript before passing the page to the LLM for extraction. This lets it handle dynamic, single-page applications built with React, Vue, Angular, and similar frameworks. You can also configure it with proxies and headless browser settings for more demanding scraping scenarios.

How to Build an LLM: Key Steps, Challenges, and Best Practices

Building an LLM from scratch in 2026? It's totally doable if you know what you're doing. This guide covers everything – from architecture, training, fine-tuning, and deployment to tips on how to handle the tricky parts. You'll walk away with a clear plan and best practices for building your own high-performing large language model.

AI Web Scraping With Python: A Comprehensive Guide

AI web scraping with Python lets you extract data from websites without relying on fragile parsing rules. AI helps handling page inconsistencies and dynamic content, while Python continues to manage fetching. In this guide, you'll see how models extract data from unstructured pages, reduce manual parsing rules, support automation, and scale into reliable pipelines.

How Does AI Process Data: From Bytes to Brilliance

AI has revolutionized how we process data, enabling machines to analyze and interpret vast amounts of information quickly and efficiently. In this comprehensive guide, we'll explore how AI processes data, understand the importance of quality data, and delve into the challenges it faces.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved