Back to blog

How to Build Production-Ready RAG with LlamaIndex and Web Scraping (2025 Guide)

Production RAG fails when it relies on static knowledge that goes stale. This guide shows you how to build RAG systems that scrape live web data, integrate with LlamaIndex, and actually survive production. You'll learn to architect resilient scraping pipelines, optimize vector storage for millions of documents, and deploy systems that deliver real-time intelligence at scale.

Zilvinas Tamulis

Oct 24, 2025

16 min read

Why traditional RAG systems struggle in production

Let's cut through the hype. RAG isn't magic.

A RAG system combines two things: a search system that finds relevant information and an LLM that uses that information to answer questions. When someone asks a question, your system searches your knowledge base, pulls the most relevant chunks, and feeds them to an LLM. The LLM generates an answer based on that context.

Sounds simple, right? Here's the catch: your answers are only as good as your data. In other words, stale data means stale answers. LLMs trained on frozen datasets can't tell you about product launches from last week, breaking news from this morning, or market shifts happening right now.

This is where web scraping makes a significant difference.

Web scraping turns RAG from a static research assistant into a real-time intelligence engine. By continuously feeding scraped content into your pipeline, you ensure answers stay accurate, relevant, and trustworthy.

This approach beats using an LLM alone because:

  • You get accurate answers grounded in real data
  • You can cite sources for every claim
  • You can update knowledge without retraining models
  • You control what information the LLM has access to

When building a RAG system for production, one of the best tools to use is LlamaIndex. LlamaIndex handles all the annoying infrastructure work for you. It manages document loading, text chunking, embedding generation, vector storage, and query processing. You don't need to build these components from scratch or figure out how they fit together.

LlamaIndex supports major vector databases like Pinecone, Weaviate, Chroma, and Qdrant. It integrates with OpenAI, Anthropic, and local models, while including smart chunking tools that actually understand document structure.

The framework is designed to scale from prototypes to production without major rewrites. You can start with local storage during development, then switch to a production vector database with minimal code changes.

Planning your production RAG architecture

Want to know why most RAG systems crash at scale? They weren't architected for it from the start. Retrofitting production requirements onto prototypes costs 10 times more than designing correctly from the start. To avoid that pitfall, let’s plan our system right.

System design essentials

Your RAG architecture needs four core components working together.

  • Data ingestion pipelines. Design pipelines that handle both real-time streams and batch processes. Each stage should fail independently, so one broken scraper shouldn't take down your entire system.
  • Vector storage. Plan for index sizes 10x your initial estimates, as they always grow faster than expected. Choose between managed services (easier to maintain) or self-hosted (more control, more work).
  • Query processing. Map your entire workflow: embedding generation, similarity search, reranking, context assembly, LLM call, response streaming. Each step adds latency, so profile them individually so you know where to optimize.
  • Monitoring. Instrument from the start with metrics for pipeline health, data freshness, retrieval quality, query latency, and cost per query. You can do this by building dashboards that catch problems before users notice them.

Web scraping infrastructure

Your scraping systems need to be more resilient than your retrieval layer, because if scraping breaks, your entire knowledge base goes stale.

Select appropriate proxy types for each target and implement rotation logic proactively. Respect rate limits, robots.txt, and back off exponentially when hitting 429 errors. Patient scraping runs indefinitely while aggressive scraping gets blocked permanently.

Validate content at the time of scraping and implement circuit breakers with dead letter queues for enhanced reliability.

Scalability planning

Design for horizontal scaling from the start using stateless workers. Partition scraping targets by domain, distribute query processing with load balancing, and shard indexing by content type or timestamp.

Track costs per component and optimize your most expensive operations first, typically embedding generation or LLM calls. Architect for scale from day one, because retrofitting production requirements onto prototypes costs 10x more than designing them correctly from the start.

Now that you've planned your architecture, let's set up a development environment that supports it.

Setting up your environment

Let's build this thing. Start with a clean, reproducible setup.

First, ensure you have Python 3.9 or higher. Next, create a virtual environment and install the core packages:

python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install llama-index llama-index-vector-stores-chroma llama-index-embeddings-openai \
llama-index-embeddings-huggingface llama-index-llms-openai \
fastapi uvicorn chromadb sentence-transformers \
python-dotenv requests beautifulsoup4

Install Ollama (for the local LLM):

ollama pull llama3.2:3b

Organize your project so code, configs, and data are clearly separated:

your-project/
├─ rag_system.py # your FastAPI app
├─ scraper.py # your scraping app
├─ .env
└─ requirements.txt

Create your .env file with these settings:

# API
RAG_API_KEY=your-secret-api-key
HOST=0.0.0.0
PORT=8000
# Vector store
CHROMA_PERSIST_DIR=./chroma_db
# Decodo API (used by scraper.py)
DECODO_USERNAME=your-decodo-username
DECODO_PASSWORD=your-decodo-password
# LLM (optional, set ENABLE_LLM=true to use)
ENABLE_LLM=false
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
OPENAI_MODEL=gpt-3.5-turbo

Create a requirements.txt that looks like this:

# Web Framework
fastapi==0.115.*
uvicorn[standard]==0.30.*
# Vector Database
chromadb==0.5.*
# Embeddings
sentence-transformers==2.7.*
# Web Scraping
requests==2.32.*
beautifulsoup4==4.12.*
# Environment Configuration
python-dotenv==1.0.*
# Data Validation (FastAPI dependency)
pydantic==2.5.*
# Optional for full RAG
openai==1.40.*

Building the web scraping data pipeline

Your RAG system is only as good as the data feeding it. Let's build a scraper that doesn't break.

Production scrapers need three things: proxy rotation to avoid bans, proper error handling for when things go wrong, and validation to catch bad data before it enters your system.

Here's a production-grade scraper using Decodo's Web Scraping API:

import os
import requests
from bs4 import BeautifulSoup
import logging
from typing import Optional
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
logger = logging.getLogger(__name__)
class DecodoScraper:
"""Web scraper using Decodo's Web Scraping API"""
def __init__(self):
self.username = os.getenv("DECODO_USERNAME")
self.password = os.getenv("DECODO_PASSWORD")
self.base_url = "https://scraper-api.decodo.com/v2/scrape"
self.session = requests.Session()
if self.username and self.password:
self.session.auth = (self.username, self.password)
self.session.headers.update({'Content-Type': 'application/json'})
else:
logger.warning("Decodo credentials not found in environment variables")
def scrape_url(self, url: str) -> str:
"""Scrape a webpage using Decodo's Web Scraping API"""
try:
logger.info(f"Scraping URL: {url}")
task_params = {
'url': url,
'parse': True # Enable automatic parsing
}
response = self.session.post(
self.base_url,
json=task_params,
timeout=30
)
logger.info(f"Response status: {response.status_code}")
if response.status_code == 200:
data = response.json()
logger.info(f"Response data keys: {list(data.keys())}")
logger.info(f"Full response structure: {data}")
# Check if we have results
if 'results' in data and len(data['results']) > 0:
result = data['results'][0]
logger.info(f"Result keys: {list(result.keys())}")
# Try to extract the actual HTML content
if 'content' in result:
content = result['content']
logger.info(f"Content keys: {list(content.keys()) if isinstance(content, dict) else 'Not a dict'}")
# If we have HTML content directly
if 'html' in content:
soup = BeautifulSoup(content['html'], 'html.parser')
text = soup.get_text()
return text
# If we have text content directly
elif 'text' in content:
return content['text']
# If we have description
elif 'description' in content:
return content['description']
# If content is a string
elif isinstance(content, str):
return content
# If we have parsed content
elif 'results' in content:
return str(content['results'])
else:
return str(content)
# If result has HTML directly
elif 'html' in result:
soup = BeautifulSoup(result['html'], 'html.parser')
text = soup.get_text()
return text
# If result has text directly
elif 'text' in result:
return result['text']
# If result has description
elif 'description' in result:
return result['description']
else:
# Return the entire result as string for debugging
logger.warning(f"Unexpected result structure: {result}")
return str(result)
else:
raise Exception(f"No results found in response: {data}")
else:
raise Exception(f"HTTP error {response.status_code}: {response.text}")
except requests.exceptions.RequestException as e:
logger.error(f"Request failed: {e}")
raise Exception(f"Request failed: {e}")
except Exception as e:
logger.error(f"Scraping failed: {e}")
raise Exception(f"Scraping failed: {e}")
def is_ready(self) -> bool:
"""Check if scraper is ready (has credentials)"""
return bool(self.username and self.password)
def test_connection(self) -> bool:
"""Test the connection to Decodo API"""
try:
# Test with a simple URL
test_url = "https://httpbin.org/html"
content = self.scrape_url(test_url)
return len(content) > 0
except Exception as e:
logger.error(f"Connection test failed: {e}")
return False
# Global scraper instance
_scraper_instance = None
def get_scraper() -> DecodoScraper:
"""Get the global scraper instance"""
global _scraper_instance
if _scraper_instance is None:
_scraper_instance = DecodoScraper()
return _scraper_instance
# Test function
if __name__ == "__main__":
import asyncio
import logging
# Set up logging to see debug information
logging.basicConfig(level=logging.INFO)
async def test_scraper():
scraper = get_scraper()
if not scraper.is_ready():
print("Scraper not ready - missing credentials")
print("Make sure to set DECODO_USERNAME and DECODO_PASSWORD in .env file")
return
print("Scraper ready")
try:
# Test with a simple URL
test_url = "https://httpbin.org/html"
print(f"Testing with URL: {test_url}")
content = scraper.scrape_url(test_url)
print(f"Successfully scraped {len(content)} characters")
print(f"Content preview: {content[:200]}...")
# Also test with a more complex URL
print("\n" + "="*50)
test_url2 = "https://example.com"
print(f"Testing with URL: {test_url2}")
content2 = scraper.scrape_url(test_url2)
print(f"Successfully scraped {len(content2)} characters")
print(f"Content preview: {content2[:200]}...")
except Exception as e:
print(f"Scraping failed: {e}")
asyncio.run(test_scraper())

This scraper automatically loads your Decodo credentials from the .env file and uses them for all requests. It handles JavaScript rendering, proxy rotation, and anti-bot measures through Decodo's Web Scraping API. You just pass in a URL and get clean text back.

The scraper includes proper error handling, logging, and a test_connection() method to verify everything works before you start scraping real data. The get_scraper() function provides a global instance, so you don't need to create new connections for every request.

Raw scraped data needs cleaning before it enters your vector database. Start by stripping out HTML tags, scripts, ads, and irrelevant sections using Beautiful Soup or regex, then fix spacing and encoding issues.

Once cleaned, break large texts into chunks of 500 to 1000 tokens, a size small enough for LLM context windows but large enough to preserve meaning. Enrich each chunk with metadata like source URL, timestamp, and section title so you can filter results and provide citations later. Finally, remove duplicate content by either keeping the newest version or maintaining multiple versions tagged by source.

This preprocessing step is critical because clean data during ingestion saves you weeks of debugging poor retrieval quality caused by garbage in your index.

Tired of managing proxy pools and fighting CAPTCHAs?

Decodo's residential proxies deliver 99.95% success rates across 115M+ IPs in 195+ locations with <0.6s response times. Your scraping pipeline remains resilient while you focus on RAG architecture.

Integrating web-scraped data with LlamaIndex

Now comes the good part. Feeding your clean scraped data into LlamaIndex and querying it.

A production RAG loop has three parts: ingesting fresh web data, indexing it efficiently, and querying with precision. This implementation uses FastAPI for endpoints, ChromaDB for vector storage, and LlamaIndex for all the AI heavy lifting.

Data ingestion workflows

How often you update your data should match how often your sources change. Real-time ingestion reduces the delay between when information is published and when users can find it in your system. This matters most for time-sensitive content like news feeds and market data, where minutes can make a difference.

First, your RAG system needs to be able to handle scraping through background tasks, so API requests return immediately while scraping happens asynchronously. This prevents timeout issues when scraping large sites:

# rag_system.py
import os
import asyncio
import logging
from typing import List, Dict, Any, Optional
from datetime import datetime, timezone
import uuid
# Core dependencies
from fastapi import FastAPI, HTTPException, Depends, BackgroundTasks, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, HttpUrl, Field
import uvicorn
from dotenv import load_dotenv
load_dotenv()
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# LlamaIndex imports
try:
from llama_index.core import Document, VectorStoreIndex, StorageContext
from llama_index.core import Settings as LISettings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.llms.openai import OpenAI as LIOpenAI
import chromadb
from chromadb.config import Settings as ChromaSettings
LLAMAINDEX_AVAILABLE = True
except ImportError as e:
LLAMAINDEX_AVAILABLE = False
logger.warning(f"LlamaIndex not available: {e}")
class Settings:
"""Application settings from environment variables"""
# API Configuration
API_KEY = os.getenv("RAG_API_KEY", "your-secret-api-key")
HOST = os.getenv("HOST", "0.0.0.0")
PORT = int(os.getenv("PORT", 8000))
# Database Configuration
CHROMA_PERSIST_DIR = os.getenv("CHROMA_PERSIST_DIR", "./chroma_db")
# LLM Configuration
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-3.5-turbo")
ENABLE_LLM = os.getenv("ENABLE_LLM", "true").lower() == "true"
# Embedding Configuration
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "openai")
HUGGINGFACE_MODEL = os.getenv("HUGGINGFACE_MODEL", "BAAI/bge-small-en-v1.5")
# Chunking Configuration
CHUNK_SIZE = int(os.getenv("CHUNK_SIZE", 1024))
CHUNK_OVERLAP = int(os.getenv("CHUNK_OVERLAP", 200))

The configuration loads from environment variables, making it easy to deploy across different environments without requiring code changes. You can add your API key to .env.

Vector storage and indexing

LlamaIndex integrates with ChromaDB to handle vector storage with automatic persistence. The system lets LlamaIndex manage chunking, embeddings, and indexing.

class LlamaIndexVectorDB:
"""Vector DB via LlamaIndex on top of Chroma"""
def __init__(self, persist_directory: str):
self.persist_directory = persist_directory
self.client = None
self.is_initialized = False
async def initialize(self, settings: Settings):
if not LLAMAINDEX_AVAILABLE:
raise Exception("LlamaIndex not available. Install with: pip install llama-index")
os.makedirs(self.persist_directory, exist_ok=True)
self.client = chromadb.PersistentClient(
path=self.persist_directory,
settings=ChromaSettings(anonymized_telemetry=False, allow_reset=True),
)
# Configure embedding model through LlamaIndex
if settings.EMBEDDING_MODEL.lower() == "openai" and settings.OPENAI_API_KEY:
LISettings.embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
api_key=settings.OPENAI_API_KEY
)
else:
LISettings.embed_model = HuggingFaceEmbedding(
model_name=settings.HUGGINGFACE_MODEL
)
# Configure chunking through LlamaIndex
LISettings.node_parser = SentenceSplitter(
chunk_size=settings.CHUNK_SIZE,
chunk_overlap=settings.CHUNK_OVERLAP
)
self.is_initialized = True
logger.info("LlamaIndexVectorDB initialized")
async def add_document(self, collection_name: str, content: str, metadata: Dict[str, Any]):
collection = self.client.get_or_create_collection(collection_name)
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Let LlamaIndex handle chunking automatically
docs = [Document(text=content, metadata=metadata)]
VectorStoreIndex.from_documents(docs, storage_context=storage_context)
logger.info(f"Indexed document into '{collection_name}'")
async def query(self, collection_name: str, query_text: str, n_results: int = 5) -> Dict[str, Any]:
collection = self.client.get_or_create_collection(collection_name)
vector_store = ChromaVectorStore(chroma_collection=collection)
index = VectorStoreIndex.from_vector_store(vector_store)
qe = index.as_query_engine(similarity_top_k=n_results)
resp = await asyncio.to_thread(qe.query, query_text)
docs, metas, dists = [], [], []
for sn in getattr(resp, "source_nodes", []):
docs.append(sn.get_text())
metas.append(sn.node.metadata or {})
# sn.score is similarity (higher is better)
relevance_score = max(0.0, min(1.0, sn.score or 0.0))
dists.append(relevance_score)
return {"documents": docs, "metadatas": metas, "distances": dists}

There are several reasons why you should use this approach:

  • It has flexible embedding models. You can choose between OpenAI embeddings (high quality, paid) or HuggingFace embeddings (free, local), and configure them through environment variables without code changes.
  • There’s smart chunking. LlamaIndex's SentenceSplitter respects sentence boundaries while hitting your target chunk size. The default is 1024 tokens with 200 token overlap for optimal context preservation.
  • It has persistent storage. ChromaDB automatically saves to disk, so your index survives restarts without manual saves.
  • There’s a collection-based organization. Each scraped site or project gets its own collection, and there’s query documentation separately from blog posts, product data separately from support tickets.

Now we tie it all together. Scraping, indexing, and querying in one system:

class LlamaIndexRAGSystem:
"""Production-ready RAG system using LlamaIndex"""
def __init__(self, settings: Settings):
self.settings = settings
self.vector_db = LlamaIndexVectorDB(settings.CHROMA_PERSIST_DIR)
self.is_initialized = False
async def initialize(self):
"""Initialize LlamaIndex components"""
try:
await self.vector_db.initialize(self.settings)
# Configure LlamaIndex LLM
if self.settings.ENABLE_LLM and self.settings.OPENAI_API_KEY:
LISettings.llm = LIOpenAI(
model=self.settings.OPENAI_MODEL,
api_key=self.settings.OPENAI_API_KEY
)
logger.info(f"LlamaIndex LLM configured: {self.settings.OPENAI_MODEL}")
self.is_initialized = True
logger.info("LlamaIndex RAG system initialized successfully")
except Exception as e:
logger.error(f"Failed to initialize LlamaIndex RAG system: {e}")
raise
async def add_document(self, content: str, metadata: Dict[str, Any], collection_name: str = None):
"""Add a document to LlamaIndex"""
try:
if not self.is_initialized:
raise Exception("RAG system not initialized")
await self.vector_db.add_document(
collection_name or "default",
content,
metadata
)
logger.info(f"Added document to LlamaIndex with {len(content)} characters")
except Exception as e:
logger.error(f"Failed to add document: {e}")
raise
async def query(self, query_text: str, collection_name: str = None) -> Dict[str, Any]:
"""Query the LlamaIndex RAG system"""
try:
if not self.is_initialized:
raise Exception("RAG system not initialized")
# Query with LlamaIndex
results = await self.vector_db.query(
collection_name or "default",
query_text,
n_results=5
) if results["documents"]:
context = "\n\n".join(results["documents"])
# Try to use LlamaIndex LLM for response generation
if hasattr(LISettings, 'llm') and LISettings.llm:
try:
prompt = f"""Based on the following context, please answer the question.
Context:
{context}
Question: {query_text}
Answer:"""
response = await asyncio.to_thread(LISettings.llm.complete, prompt)
answer = str(response)
except Exception as e:
logger.warning(f"LlamaIndex LLM generation failed, using simple assembly: {e}")
answer = f"Based on the context: {context[:500]}..."
else:
# Simple assembly (R+A mode)
answer = f"Based on the context: {context[:500]}..."
sources = []
for i, (doc, metadata) in enumerate(zip(results["documents"], results["metadatas"])):
sources.append({
"content": doc[:200] + "..." if len(doc) > 200 else doc,
"metadata": metadata,
"relevance_score": results["distances"][i] if i < len(results["distances"]) else 0
})
mode = "LlamaIndex RAG" if getattr(LISettings, "llm", None) else "LlamaIndex R+A"
return {
"answer": answer,
"sources": sources,
"mode": mode
}
else:
mode = "LlamaIndex RAG" if getattr(LISettings, "llm", None) else "LlamaIndex R+A"
return {
"answer": "No relevant information found.",
"sources": [],
"mode": mode
}
except Exception as e:
logger.error(f"Query failed: {e}")
raise
async def scrape_and_store(self, url: str, collection_name: str = None, max_content_length: int = 1_000_000) -> str:
"""Scrape a URL and store the content in LlamaIndex"""
try:
from scraper import get_scraper
scraper = get_scraper()
content = scraper.scrape_url(url)
# Guard against empty pages
if not content or len(content.strip()) < 50:
raise Exception(f"Scraped content too short or empty: {len(content)} characters")
# Size limit check
if len(content) > max_content_length:
logger.warning(f"Content truncated from {len(content)} to {max_content_length} characters")
content = content[:max_content_length]
# Create metadata
metadata = {
"url": url,
"scraped_at": datetime.now(timezone.utc).isoformat(),
"task_id": uuid.uuid4().hex,
"content_length": len(content)
}
await self.add_document(content, metadata, collection_name)
logger.info(f"Successfully scraped and stored content from {url}")
return collection_name or "default"
except Exception as e:
logger.error(f"Failed to scrape and store: {e}")
raise

The LlamaIndexRAGSystem orchestrates the complete RAG pipeline, initializing the vector database and optionally configuring an OpenAI LLM for answer generation. It operates in two modes: full RAG mode (when an LLM is available) generates natural language answers from retrieved context, while R+A mode (fallback) simply assembles the retrieved context without generation.

Each search result includes a relevance score between 0 and 1, and source documents are truncated to 200 characters for display. The system includes a scrape_and_store method that fetches web content, validates it meets minimum length requirements, and indexes it with metadata like URL and timestamp. Most errors propagate up via exceptions, with the main graceful fallback occurring when LLM generation fails.

Now, it’s time to create the FastAPI endpoints to expose your RAG system:

# Initialize FastAPI app
app = FastAPI(
title="LlamaIndex RAG System",
description="Production-ready RAG system with LlamaIndex and web scraping",
version="1.0.0"
)
# Add CORS middleware
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Security
security = HTTPBearer()
# Initialize settings and RAG system
settings = Settings()
rag_system = LlamaIndexRAGSystem(settings)
# API MODELS
class ScrapeRequest(BaseModel):
url: HttpUrl
collection_name: Optional[str] = Field(default=None, description="Custom collection name")
max_content_length: Optional[int] = Field(default=1000000, ge=1000, le=5000000, description="Maximum content length")
class QueryRequest(BaseModel):
query: str = Field(..., min_length=1, max_length=1000)
collection_name: Optional[str] = Field(default=None)
class QueryResponse(BaseModel):
answer: str
sources: List[Dict[str, Any]]
timestamp: datetime
mode: str
# AUTHENTICATION
async def verify_api_key(credentials: HTTPAuthorizationCredentials = Depends(security)):
"""Verify API key"""
if credentials.credentials != settings.API_KEY:
raise HTTPException(status_code=401, detail="Invalid API key")
return True
# API ENDPOINTS
@app.get("/")
async def root():
"""Root endpoint"""
return {"message": "LlamaIndex RAG System with Web Scraping"}
@app.post("/scrape", status_code=status.HTTP_202_ACCEPTED)
async def scrape_url(
request: ScrapeRequest,
background_tasks: BackgroundTasks,
_: bool = Depends(verify_api_key)
):
"""Scrape a URL and store in LlamaIndex"""
try:
task_id = uuid.uuid4().hex
background_tasks.add_task(
process_scraping_task,
task_id,
str(request.url),
request.collection_name,
request.max_content_length
)
return {
"task_id": task_id,
"status": "accepted",
"message": "Scraping started in background"
}
except Exception as e:
logger.error(f"Scraping request failed: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.post("/query", response_model=QueryResponse)
async def query_rag(
request: QueryRequest,
_: bool = Depends(verify_api_key)
):
"""Query the LlamaIndex RAG system"""
try:
result = await rag_system.query(
query_text=request.query,
collection_name=request.collection_name
)
return QueryResponse(
answer=result["answer"],
sources=result["sources"],
timestamp=datetime.now(timezone.utc),
mode=result["mode"]
)
except Exception as e:
logger.error(f"Query failed: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health_check():
"""Health check endpoint"""
from scraper import get_scraper
scraper = get_scraper()
return {
"status": "healthy",
"timestamp": datetime.now(timezone.utc),
"mode": "LlamaIndex RAG",
"services": {
"rag_system": "healthy" if rag_system.is_initialized else "unhealthy",
"scraper": "healthy" if scraper.is_ready() else "unhealthy",
"llamaindex": "enabled" if LLAMAINDEX_AVAILABLE else "disabled"
}
}
# BACKGROUND TASKS
async def process_scraping_task(task_id: str, url: str, collection_name: str, max_content_length: int = 1_000_000):
"""Background task to process scraping"""
try:
logger.info(f"Processing scraping task {task_id} for {url}")
await rag_system.scrape_and_store(url, collection_name, max_content_length)
logger.info(f"Scraping task {task_id} completed successfully")
except Exception as e:
logger.error(f"Scraping task {task_id} failed: {e}")
# STARTUP
@asynccontextmanager
async def lifespan(app: FastAPI):
logger.info("Starting LlamaIndex RAG system...")
try:
await rag_system.initialize()
logger.info("LlamaIndex RAG system started successfully")
yield
finally:
logger.info("Shutting down RAG system...")
# MAIN
if __name__ == "__main__":
uvicorn.run(
app,
host=settings.HOST,
port=settings.PORT,
reload=False,
log_level="info"
)

With this setup, scraping runs in the background so your API requests return immediately with a task ID instead of making you wait for large sites to finish. Bearer token authentication also keeps your API secure by blocking unauthorized users.

The system includes a health check endpoint that monitors both the RAG system and scraper status, making it easy to see if everything is running properly. All responses use structured formats through Pydantic models, ensuring consistency and automatic data validation.

Running your RAG API

Finally, you can start the server:

python rag_system.py

The API runs on http://localhost:8000. You'll see LlamaIndex initialize, ChromaDB connect, and your embedding model load. Once you see "LlamaIndex RAG system started successfully," you're ready to roll.

Hit the API directly from your terminal:

# Scrape a URL
curl -X POST "http://localhost:8003/scrape" \
-H "Authorization: Bearer your-secret-api-key" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/docs", "collection_name": "my_docs"}'
# Query the scraped content
curl -X POST "http://localhost:8000/query" \
-H "Authorization: Bearer your-secret-api-key" \
-H "Content-Type: application/json" \
-d '{"query": "What are the main features?", "collection_name": "my_docs", "n_results": 5}'
# Check health
curl "http://localhost:8000/health"

The /scrape endpoint returns a task ID instantly. Your scraping runs in the background while your API stays responsive. No timeouts, no blocked requests.

Integrate with your Python applications:

import requests
API_URL = "http://localhost:8000"
API_KEY = "your-secret-api-key"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Scrape a URL
scrape_response = requests.post(
f"{API_URL}/scrape",
json={"url": "https://example.com/docs", "collection_name": "docs"},
headers=headers
)
print(scrape_response.json())
# Query after scraping completes
query_response = requests.post(
f"{API_URL}/query",
json={
"query": "How do I get started?",
"collection_name": "docs",
"n_results": 5
},
headers=headers
)
print(query_response.json())

The query responses include everything you need:

1
{
2
"answer": "To get started, install the required packages...",
3
"sources": [
4
{
5
"content": "Installation instructions: First, create a virtual environment...",
6
"metadata": {
7
"url": "https://example.com/docs/getting-started",
8
"scraped_at": "2025-10-17T10:30:00Z",
9
"content_length": 5234
10
},
11
"relevance_score": 0.89
12
}
13
],
14
"timestamp": "2025-10-17T10:35:00Z",
15
"mode": "LlamaIndex RAG"
16
}

The "mode" field tells you whether you're running full RAG with LLM-generated answers or R+A mode with assembled context. The “relevance_score” shows how well each source matches your query.

So far, the system utilises asynchronous operations throughout, which means it can handle multiple requests simultaneously without any of them blocking each other. Long-running scraping jobs run in the background, so users receive immediate responses instead of waiting.

ChromaDB automatically saves all data to disk, ensuring your data remains safe even if the server restarts. Each project gets its own collection to keep data separate and searches focused. This production-ready system combines reliable scraping with a clean API interface, and FastAPI automatically generates documentation at http://localhost:8000/docs so you can easily test and connect it to other applications.

With the API running, you can easily scrape and organize multiple sources into separate collections:

import requests
API_URL = "http://localhost:8000"
API_KEY = "your-secret-api-key"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Scrape multiple sources into organized collections
sources = {
"documentation": [
"https://example.com/docs/getting-started",
"https://example.com/docs/api-reference"
],
"blog": [
"https://example.com/blog/latest",
"https://example.com/blog/tutorials"
]
}
for collection, urls in sources.items():
for url in urls:
response = requests.post(
f"{API_URL}/scrape",
json={"url": url, "collection_name": collection},
headers=headers
)
print(f"Scraping {url} into {collection}: {response.json()}")
# Query specific collections
docs_query = requests.post(
f"{API_URL}/query",
json={
"query": "How do I authenticate?",
"collection_name": "documentation"
},
headers=headers
)
print(docs_query.json())

To enable full RAG mode with LLM-generated answers, update your .env file:

# Enable LLM mode
ENABLE_LLM=true
OPENAI_API_KEY=sk-your-actual-api-key-here
OPENAI_MODEL=gpt-3.5-turbo # or gpt-4

Restart your RAG API. The health check endpoint should now show "mode": "LlamaIndex RAG" and the services status will confirm LLM is enabled. In RAG mode, the LLM reads the context and generates a natural language answer. If LLM generation fails for any reason (API error, timeout, rate limit), the system automatically falls back to R+A mode.

You have three main options for choosing an LLM to power your RAG system.

  • The simplest approach is running Ollama locally on your own hardware, which eliminates API costs and keeps everything private, making it ideal for development and moderate production workloads.
  • If you need faster responses and higher quality answers, cloud APIs like OpenAI or Anthropic work well for high-volume production, though you'll pay for each query.
  • For maximum control, self-hosted options like vLLM or TGI let you run larger models on your own GPU servers with higher upfront costs but unlimited queries afterward.

Choose based on your budget, query volume, and quality requirements, and remember that your code structure remains the same across providers, so switching later is straightforward.

Production deployment considerations

Deployment reveals whether your planning was solid or full of shortcuts. There are no second chances in production.

Infrastructure needs

Production RAG systems handling 100 to 1000 daily queries need 8 to 16 CPU cores, 32 to 64GB RAM, and SSD storage. Add a GPU if you're generating embeddings at high volume. When you need more capacity, scale horizontally by adding workers rather than upgrading single machines for better fault tolerance.

Your database setup is equally critical. Use connection pooling to reuse connections efficiently, set up read replicas to distribute query load, and configure automated backups with point-in-time recovery. When queries slow down, rebuild your indexes before throwing hardware at the problem.

Security comes next. Protect your network with TLS encryption, API authentication, rate limiting, CORS policies, and load balancers. For scraping infrastructure, deploy residential proxies with credential rotation and fallback strategies. Decodo's residential proxies achieve 99.95% success rates with sub-second response times across 115M+ IPs in 195+ locations, handling the complexity for you.

Finally, prepare for failures. Automate vector database backups, keep configurations in version control, save production snapshots for rollback, and test your recovery procedures before you need them.

Monitoring and observability

You can't fix what you can't see. Start by tracking query latency at the 50th, 95th, and 99th percentiles, along with embedding generation time and throughput. These metrics help you catch bottlenecks before users notice.

Monitor data quality continuously by checking content staleness, retrieval accuracy, and answer relevance. Set up alerts that warn you when scraped data becomes outdated or quality drops.

Beyond system metrics, watch how users interact with your RAG system. Log errors, scraping failures, API rate limits, and performance issues; also track user satisfaction scores, query patterns, response times, and feedback to understand real-world performance.

Security and compliance

Start with the basics by encrypting data at rest and in transit. Implement role-based access control to limit who can access data, API endpoints, and admin functions. Keep detailed audit logs of system access, data modifications, and configuration changes for security reviews and incident investigations.

For regulatory compliance, implement data retention policies, user data deletion workflows, and consent management. Document your data sources and processing activities to meet GDPR, CCPA, or other relevant regulations.

Troubleshooting common issues

Even well-built systems fail. What separates production-ready RAG from prototypes is how gracefully your system handles failures and how quickly you can debug problems.

Pipeline failures

Your scraping infrastructure faces constant challenges as target websites change structure, rate limits trigger unexpectedly, proxies fail, and network issues interrupt requests. Handle these by:

  • Categorizing errors to distinguish transient from permanent failures
  • Implementing retry logic with exponential backoff
  • Deploying circuit breakers that pause scraping when error rates spike
  • Maintaining dead letter queues for URLs requiring manual review

When scraping fails consistently, check if targets changed their HTML structure, updated anti-bot measures, implemented new rate limits, or blocked your IPs. Monitor validation failure rates to catch encoding issues and broken content extraction before they corrupt your index.

Beyond scraping, vector databases become unavailable, LLM APIs rate limit requests, and authentication tokens expire. Implement health checks for all dependencies, automatic retry with backoff, fallback to cached data, and alerts when failures persist.

Performance problems

Common bottlenecks include vector database query latency degrading with index size, embedding generation limiting throughput, memory exhaustion from large documents, and network bandwidth constraints. Profile under realistic load before deploying.

Address memory issues by streaming large documents, configuring cache limits with LRU eviction, and restarting workers periodically. Optimize slow queries by reducing retrieved chunks, caching results, using faster embedding models, and streaming responses.

Monitoring and maintenance

Implement synthetic queries that verify functionality, check database responsiveness, and monitor data freshness. Deploy to staging first, use blue-green deployments for zero-downtime updates, and maintain rollback procedures. Track retrieval relevance scores, query success rates, and user feedback over time, catching problems during development instead of production.

Real-world RAG implementation examples

Customer support knowledge base

An AI assistant can scrape your product documentation, FAQ pages, and ticket logs to provide on-demand answers. For example, if a customer asks a support bot “How do I reset my device?” the system retrieves the relevant manual section (scraped from your docs site) and answers precisely. Any changes to the documents (new firmware) are scraped and indexed immediately.

ServiceNow's Now Assist in AI Search demonstrates this approach in production by using RAG to retrieve relevant knowledge articles and generate actionable Q&A cards for customer support queries. The system retrieves articles from the customer's knowledge base, augments queries with context from top-ranked content, and generates answers that cite sources - delivering "answers instead of links" for features launched as recently as the previous week.

Market research automation

You can also build a RAG system that constantly scrapes news sites, competitor blogs, and social media feeds for mentions of your industry. Analysts can query it for trends or competitor moves. For instance, “What did Company X announce this week?” triggers a search over the latest scraped press releases and news articles, yielding a summary.

AlphaSense uses RAG technology with over 500 million premium business documents to deliver competitive intelligence at scale. Its Generative Search feature interprets natural language queries like an analyst and provides cited responses to minimize hallucinations.

The platform automates competitive benchmarking, tracks pricing and product changes, and monitors market trends in real-time, condensing hours of research into seconds. With hundreds of new expert interviews added weekly, it enables firms to spot demand shifts, supply chain disruptions, and emerging opportunities as they happen.

Content & insights pipeline

Rag systems let you aggregate and analyze public data (e.g. social media posts, review sites, forums). A RAG model can answer questions like “What are common complaints about Product Y?” by retrieving scraped user reviews and summarizing sentiment. Live monitoring (scraping Twitter or Reddit) lets the system alert on shifts in public opinion.

Microsoft Copilot Studio uses RAG to monitor public websites and generate conversational responses by retrieving relevant content from specified domains and providing cited summaries. The system performs grounding checks, provenance validation, and semantic similarity analysis on retrieved content while applying content moderation to filter inappropriate material.

Knowledge sources can include public websites and internal SharePoint sites, enabling organizations to aggregate news, monitor sentiment, and synthesize insights from multiple sources. The platform reduces manual work by finding and presenting information from internal and external sources as either primary responses or fallback when predefined topics can't address queries.

Best practices and lessons learned

  • Test everything before production. Run unit tests for scraping logic, integration tests for pipeline stages, end-to-end tests for queries, and load tests with realistic data volumes. Systems that work with 1,000 documents may break with 1 million.
  • Simulate failures during development. Build test fixtures for different website structures, content edge cases, database outages, and API failures. Catch problems early, not in production.
  • Automate quality checks with CI. Set up pipelines that test every code change, check for security issues, and deploy to staging automatically. This catches bugs immediately instead of days later.
  • Review code thoroughly. Check scraping logic, error handling, logging, and documentation. Fresh eyes catch what the original developer missed.
  • Document everything clearly. Write down scraping target details, pipeline transformations, deployment steps, and troubleshooting guides. Turn individual knowledge into team knowledge.
  • Plan for incidents. Define severity levels and response times, create playbooks for common problems, and review every incident to prevent repeats. Learning from failures separates good teams from great ones.
  • Monitor growth and scale proactively. Track data volume, query load, and resource usage trends. Add capacity before you run out, not during peak traffic.
  • Review performance regularly. Schedule periodic checks of system speed, costs, data quality, and user feedback. Focus improvements where they matter most.
  • Share knowledge across the team. Document architecture decisions, share incident lessons, rotate on-call duties, and hold knowledge transfer sessions. Your system is only as reliable as your team.

Conclusion and next steps

Building production-ready RAG applications is about designing for resilience, scale, and continuous freshness from day one. By integrating web scraping, you transform your application into a real-time intelligence engine that adapts at the speed of the internet.

LlamaIndex provides the backbone for scaling vector storage, retrieval, and optimization, but your data pipeline is only as strong as its weakest scraper. With Decodo’s Web Scraping API handling proxy rotation, dynamic rendering, and anti-bot challenges automatically, you can focus on building and optimizing your RAG architecture rather than firefighting scrapers.

If your next step is moving from prototype to production, the path is clear:

  • Architect for scale from the start
  • Keep your pipelines clean and resilient
  • Optimize retrieval for performance and cost
  • Leverage Decodo to scrape reliably at global scale

Production RAG is no longer about "can it work", it's about "can it survive?" With the right design choices and Decodo powering your data pipelines, the answer is "yes".

Ready to build production-ready RAG?

Stop worrying about broken scrapers and stale indexes. Decodo gives you reliable, scalable web scraping infrastructure out of the box.

About the author

Zilvinas Tamulis

Technical Copywriter

A technical writer with over 4 years of experience, Žilvinas blends his studies in Multimedia & Computer Design with practical expertise in creating user manuals, guides, and technical documentation. His work includes developing web projects used by hundreds daily, drawing from hands-on experience with JavaScript, PHP, and Python.


Connect with Žilvinas via LinkedIn

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

What is RAG and how does web scraping improve it?

Retrieval-augmented generation combines LLMs with external knowledge bases to generate accurate, verifiable responses by retrieving relevant context before answering questions. Web scraping transforms RAG from static to dynamic by keeping knowledge bases current with fresh data, enabling your system to answer questions about what's happening right now instead of relying on outdated information.

How do you build a RAG pipeline with LlamaIndex?

Install LlamaIndex, choose a vector database, load your documents, chunk them into 512 tokens with 50 token overlap, generate embeddings, index the chunks, and create a query engine that retrieves relevant content and passes it to an LLM. Add web scraping by scheduling jobs that fetch new content and update your index incrementally.

What are the best practices for implementing RAG in production?

Architect for horizontal scalability from day one, implement comprehensive monitoring for query latency and data freshness, build robust error handling, use task queues instead of cron jobs, validate data quality before indexing, cache common queries, test with production-scale data, maintain documentation, and plan for capacity growth proactively.

How do you optimize chunk size for RAG in production?

Start with 512 tokens and 50 tokens overlap as defaults, then test with your specific content to find the right balance between context preservation and retrieval precision. Technical documentation often needs larger chunks (1024 tokens) while news articles work well with smaller chunks (256 tokens), so measure retrieval quality and adjust based on answer quality.

What infrastructure is needed for production RAG systems?

Start with 8 to 16 CPU cores, 32 to 64GB RAM, and SSD storage, with adequate network bandwidth for sustained data ingestion. Choose between managed services like Pinecone for simplicity or self-hosted options like Weaviate for more control, add GPU instances if embedding generation becomes a bottleneck, and design for horizontal scaling.

How do you handle data freshness in RAG applications?

Implement scheduled scraping that matches content update patterns (hourly for news, daily for documentation), use incremental updates to process only changed content, track document modification times, validate freshness, implement version control for rollback capability, and balance real-time versus batch processing based on requirements and costs.

What are the main challenges when implementing RAG in real-world applications?

Data quality is the hidden challenge, as scraped content contains noise that degrades retrieval. Scalability hits harder than expected, with systems failing when document counts jump from thousands to hundreds of thousands. Performance requires continuous tuning of embeddings and chunk sizes, costs escalate quickly at scale, and operational complexity increases significantly.

How do you integrate web scraping data into RAG systems?

Build pipelines that scrape content on schedules, validate quality, clean and normalize text, chunk appropriately, generate embeddings, and update indexes incrementally using task schedulers like Celery. Implement real-time ingestion for time-sensitive sources or batch processing for high-volume sources, add QA checkpoints, monitor pipeline health, and consider using Decodo's API to handle proxy rotation and anti-bot measures.

What are the best practices for web scraping in RAG applications?

Respect rate limits with delays between requests, use residential proxies with automatic rotation, implement exponential backoff for failures, validate scraped content before ingestion, monitor robots.txt, add error handling with dead letter queues, and log activity for debugging. Consider Decodo's residential proxies with 99.95% success rates to eliminate infrastructure complexity while maintaining reliable access.

Top 10 MCPs for AI Workflows in 2025

In 2025, MCP has shifted from niche adoption to widespread use, with major platforms like OpenAI, Microsoft, and Google supporting it natively. Public directories now feature thousands of MCP servers from community developers and vendors, covering everything from developer tools to business solutions.

In this guide, you'll learn what MCP is and why it matters for real-world AI agents, which 10 MCP servers are currently most useful, and how to safely choose and combine MCPs for your setup.

Mykolas Juodis

Aug 13, 2025

9 min read

What Is Janitor AI? Features, Pricing, and Use Cases Guide

Launched in June 2023, Janitor AI quickly became a standout in the conversational AI space. More than just a chatbot platform, it combines human creativity with AI flexibility, making it ideal for developers building dynamic tools and casual users seeking lifelike, role-play-ready companions. Time to meet your chiseled, charismatic AI partners and see what they’re really made of.

Zilvinas Tamulis

Aug 05, 2025

13 min read

© 2018-2025 decodo.com. All Rights Reserved