How to Build Production-Ready RAG with LlamaIndex and Web Scraping (2025 Guide)
Production RAG fails when it relies on static knowledge that goes stale. This guide shows you how to build RAG systems that scrape live web data, integrate with LlamaIndex, and actually survive production. You'll learn to architect resilient scraping pipelines, optimize vector storage for millions of documents, and deploy systems that deliver real-time intelligence at scale.
Zilvinas Tamulis
Oct 24, 2025
16 min read

Why traditional RAG systems struggle in production
Let's cut through the hype. RAG isn't magic.
A RAG system combines two things: a search system that finds relevant information and an LLM that uses that information to answer questions. When someone asks a question, your system searches your knowledge base, pulls the most relevant chunks, and feeds them to an LLM. The LLM generates an answer based on that context.
Sounds simple, right? Here's the catch: your answers are only as good as your data. In other words, stale data means stale answers. LLMs trained on frozen datasets can't tell you about product launches from last week, breaking news from this morning, or market shifts happening right now.
This is where web scraping makes a significant difference.
Web scraping turns RAG from a static research assistant into a real-time intelligence engine. By continuously feeding scraped content into your pipeline, you ensure answers stay accurate, relevant, and trustworthy.
This approach beats using an LLM alone because:
- You get accurate answers grounded in real data
- You can cite sources for every claim
- You can update knowledge without retraining models
- You control what information the LLM has access to
When building a RAG system for production, one of the best tools to use is LlamaIndex. LlamaIndex handles all the annoying infrastructure work for you. It manages document loading, text chunking, embedding generation, vector storage, and query processing. You don't need to build these components from scratch or figure out how they fit together.
LlamaIndex supports major vector databases like Pinecone, Weaviate, Chroma, and Qdrant. It integrates with OpenAI, Anthropic, and local models, while including smart chunking tools that actually understand document structure.
The framework is designed to scale from prototypes to production without major rewrites. You can start with local storage during development, then switch to a production vector database with minimal code changes.
Planning your production RAG architecture
Want to know why most RAG systems crash at scale? They weren't architected for it from the start. Retrofitting production requirements onto prototypes costs 10 times more than designing correctly from the start. To avoid that pitfall, let’s plan our system right.
System design essentials
Your RAG architecture needs four core components working together.
- Data ingestion pipelines. Design pipelines that handle both real-time streams and batch processes. Each stage should fail independently, so one broken scraper shouldn't take down your entire system.
- Vector storage. Plan for index sizes 10x your initial estimates, as they always grow faster than expected. Choose between managed services (easier to maintain) or self-hosted (more control, more work).
- Query processing. Map your entire workflow: embedding generation, similarity search, reranking, context assembly, LLM call, response streaming. Each step adds latency, so profile them individually so you know where to optimize.
- Monitoring. Instrument from the start with metrics for pipeline health, data freshness, retrieval quality, query latency, and cost per query. You can do this by building dashboards that catch problems before users notice them.
Web scraping infrastructure
Your scraping systems need to be more resilient than your retrieval layer, because if scraping breaks, your entire knowledge base goes stale.
Select appropriate proxy types for each target and implement rotation logic proactively. Respect rate limits, robots.txt, and back off exponentially when hitting 429 errors. Patient scraping runs indefinitely while aggressive scraping gets blocked permanently.
Validate content at the time of scraping and implement circuit breakers with dead letter queues for enhanced reliability.
Scalability planning
Design for horizontal scaling from the start using stateless workers. Partition scraping targets by domain, distribute query processing with load balancing, and shard indexing by content type or timestamp.
Track costs per component and optimize your most expensive operations first, typically embedding generation or LLM calls. Architect for scale from day one, because retrofitting production requirements onto prototypes costs 10x more than designing them correctly from the start.
Now that you've planned your architecture, let's set up a development environment that supports it.
Setting up your environment
Let's build this thing. Start with a clean, reproducible setup.
First, ensure you have Python 3.9 or higher. Next, create a virtual environment and install the core packages:
Install Ollama (for the local LLM):
Organize your project so code, configs, and data are clearly separated:
Create your .env file with these settings:
Create a requirements.txt that looks like this:
Building the web scraping data pipeline
Your RAG system is only as good as the data feeding it. Let's build a scraper that doesn't break.
Production scrapers need three things: proxy rotation to avoid bans, proper error handling for when things go wrong, and validation to catch bad data before it enters your system.
Here's a production-grade scraper using Decodo's Web Scraping API:
This scraper automatically loads your Decodo credentials from the .env file and uses them for all requests. It handles JavaScript rendering, proxy rotation, and anti-bot measures through Decodo's Web Scraping API. You just pass in a URL and get clean text back.
The scraper includes proper error handling, logging, and a test_connection() method to verify everything works before you start scraping real data. The get_scraper() function provides a global instance, so you don't need to create new connections for every request.
Raw scraped data needs cleaning before it enters your vector database. Start by stripping out HTML tags, scripts, ads, and irrelevant sections using Beautiful Soup or regex, then fix spacing and encoding issues.
Once cleaned, break large texts into chunks of 500 to 1000 tokens, a size small enough for LLM context windows but large enough to preserve meaning. Enrich each chunk with metadata like source URL, timestamp, and section title so you can filter results and provide citations later. Finally, remove duplicate content by either keeping the newest version or maintaining multiple versions tagged by source.
This preprocessing step is critical because clean data during ingestion saves you weeks of debugging poor retrieval quality caused by garbage in your index.
Tired of managing proxy pools and fighting CAPTCHAs?
Decodo's residential proxies deliver 99.95% success rates across 115M+ IPs in 195+ locations with <0.6s response times. Your scraping pipeline remains resilient while you focus on RAG architecture.
Integrating web-scraped data with LlamaIndex
Now comes the good part. Feeding your clean scraped data into LlamaIndex and querying it.
A production RAG loop has three parts: ingesting fresh web data, indexing it efficiently, and querying with precision. This implementation uses FastAPI for endpoints, ChromaDB for vector storage, and LlamaIndex for all the AI heavy lifting.
Data ingestion workflows
How often you update your data should match how often your sources change. Real-time ingestion reduces the delay between when information is published and when users can find it in your system. This matters most for time-sensitive content like news feeds and market data, where minutes can make a difference.
First, your RAG system needs to be able to handle scraping through background tasks, so API requests return immediately while scraping happens asynchronously. This prevents timeout issues when scraping large sites:
The configuration loads from environment variables, making it easy to deploy across different environments without requiring code changes. You can add your API key to .env.
Vector storage and indexing
LlamaIndex integrates with ChromaDB to handle vector storage with automatic persistence. The system lets LlamaIndex manage chunking, embeddings, and indexing.
There are several reasons why you should use this approach:
- It has flexible embedding models. You can choose between OpenAI embeddings (high quality, paid) or HuggingFace embeddings (free, local), and configure them through environment variables without code changes.
- There’s smart chunking. LlamaIndex's SentenceSplitter respects sentence boundaries while hitting your target chunk size. The default is 1024 tokens with 200 token overlap for optimal context preservation.
- It has persistent storage. ChromaDB automatically saves to disk, so your index survives restarts without manual saves.
- There’s a collection-based organization. Each scraped site or project gets its own collection, and there’s query documentation separately from blog posts, product data separately from support tickets.
Now we tie it all together. Scraping, indexing, and querying in one system:
The LlamaIndexRAGSystem orchestrates the complete RAG pipeline, initializing the vector database and optionally configuring an OpenAI LLM for answer generation. It operates in two modes: full RAG mode (when an LLM is available) generates natural language answers from retrieved context, while R+A mode (fallback) simply assembles the retrieved context without generation.
Each search result includes a relevance score between 0 and 1, and source documents are truncated to 200 characters for display. The system includes a scrape_and_store method that fetches web content, validates it meets minimum length requirements, and indexes it with metadata like URL and timestamp. Most errors propagate up via exceptions, with the main graceful fallback occurring when LLM generation fails.
Now, it’s time to create the FastAPI endpoints to expose your RAG system:
With this setup, scraping runs in the background so your API requests return immediately with a task ID instead of making you wait for large sites to finish. Bearer token authentication also keeps your API secure by blocking unauthorized users.
The system includes a health check endpoint that monitors both the RAG system and scraper status, making it easy to see if everything is running properly. All responses use structured formats through Pydantic models, ensuring consistency and automatic data validation.
Running your RAG API
Finally, you can start the server:
The API runs on http://localhost:8000. You'll see LlamaIndex initialize, ChromaDB connect, and your embedding model load. Once you see "LlamaIndex RAG system started successfully," you're ready to roll.
Hit the API directly from your terminal:
The /scrape endpoint returns a task ID instantly. Your scraping runs in the background while your API stays responsive. No timeouts, no blocked requests.
Integrate with your Python applications:
The query responses include everything you need:
The "mode" field tells you whether you're running full RAG with LLM-generated answers or R+A mode with assembled context. The “relevance_score” shows how well each source matches your query.
So far, the system utilises asynchronous operations throughout, which means it can handle multiple requests simultaneously without any of them blocking each other. Long-running scraping jobs run in the background, so users receive immediate responses instead of waiting.
ChromaDB automatically saves all data to disk, ensuring your data remains safe even if the server restarts. Each project gets its own collection to keep data separate and searches focused. This production-ready system combines reliable scraping with a clean API interface, and FastAPI automatically generates documentation at http://localhost:8000/docs so you can easily test and connect it to other applications.
With the API running, you can easily scrape and organize multiple sources into separate collections:
To enable full RAG mode with LLM-generated answers, update your .env file:
Restart your RAG API. The health check endpoint should now show "mode": "LlamaIndex RAG" and the services status will confirm LLM is enabled. In RAG mode, the LLM reads the context and generates a natural language answer. If LLM generation fails for any reason (API error, timeout, rate limit), the system automatically falls back to R+A mode.
You have three main options for choosing an LLM to power your RAG system.
- The simplest approach is running Ollama locally on your own hardware, which eliminates API costs and keeps everything private, making it ideal for development and moderate production workloads.
- If you need faster responses and higher quality answers, cloud APIs like OpenAI or Anthropic work well for high-volume production, though you'll pay for each query.
- For maximum control, self-hosted options like vLLM or TGI let you run larger models on your own GPU servers with higher upfront costs but unlimited queries afterward.
Choose based on your budget, query volume, and quality requirements, and remember that your code structure remains the same across providers, so switching later is straightforward.
Production deployment considerations
Deployment reveals whether your planning was solid or full of shortcuts. There are no second chances in production.
Infrastructure needs
Production RAG systems handling 100 to 1000 daily queries need 8 to 16 CPU cores, 32 to 64GB RAM, and SSD storage. Add a GPU if you're generating embeddings at high volume. When you need more capacity, scale horizontally by adding workers rather than upgrading single machines for better fault tolerance.
Your database setup is equally critical. Use connection pooling to reuse connections efficiently, set up read replicas to distribute query load, and configure automated backups with point-in-time recovery. When queries slow down, rebuild your indexes before throwing hardware at the problem.
Security comes next. Protect your network with TLS encryption, API authentication, rate limiting, CORS policies, and load balancers. For scraping infrastructure, deploy residential proxies with credential rotation and fallback strategies. Decodo's residential proxies achieve 99.95% success rates with sub-second response times across 115M+ IPs in 195+ locations, handling the complexity for you.
Finally, prepare for failures. Automate vector database backups, keep configurations in version control, save production snapshots for rollback, and test your recovery procedures before you need them.
Monitoring and observability
You can't fix what you can't see. Start by tracking query latency at the 50th, 95th, and 99th percentiles, along with embedding generation time and throughput. These metrics help you catch bottlenecks before users notice.
Monitor data quality continuously by checking content staleness, retrieval accuracy, and answer relevance. Set up alerts that warn you when scraped data becomes outdated or quality drops.
Beyond system metrics, watch how users interact with your RAG system. Log errors, scraping failures, API rate limits, and performance issues; also track user satisfaction scores, query patterns, response times, and feedback to understand real-world performance.
Security and compliance
Start with the basics by encrypting data at rest and in transit. Implement role-based access control to limit who can access data, API endpoints, and admin functions. Keep detailed audit logs of system access, data modifications, and configuration changes for security reviews and incident investigations.
For regulatory compliance, implement data retention policies, user data deletion workflows, and consent management. Document your data sources and processing activities to meet GDPR, CCPA, or other relevant regulations.
Troubleshooting common issues
Even well-built systems fail. What separates production-ready RAG from prototypes is how gracefully your system handles failures and how quickly you can debug problems.
Pipeline failures
Your scraping infrastructure faces constant challenges as target websites change structure, rate limits trigger unexpectedly, proxies fail, and network issues interrupt requests. Handle these by:
- Categorizing errors to distinguish transient from permanent failures
- Implementing retry logic with exponential backoff
- Deploying circuit breakers that pause scraping when error rates spike
- Maintaining dead letter queues for URLs requiring manual review
When scraping fails consistently, check if targets changed their HTML structure, updated anti-bot measures, implemented new rate limits, or blocked your IPs. Monitor validation failure rates to catch encoding issues and broken content extraction before they corrupt your index.
Beyond scraping, vector databases become unavailable, LLM APIs rate limit requests, and authentication tokens expire. Implement health checks for all dependencies, automatic retry with backoff, fallback to cached data, and alerts when failures persist.
Performance problems
Common bottlenecks include vector database query latency degrading with index size, embedding generation limiting throughput, memory exhaustion from large documents, and network bandwidth constraints. Profile under realistic load before deploying.
Address memory issues by streaming large documents, configuring cache limits with LRU eviction, and restarting workers periodically. Optimize slow queries by reducing retrieved chunks, caching results, using faster embedding models, and streaming responses.
Monitoring and maintenance
Implement synthetic queries that verify functionality, check database responsiveness, and monitor data freshness. Deploy to staging first, use blue-green deployments for zero-downtime updates, and maintain rollback procedures. Track retrieval relevance scores, query success rates, and user feedback over time, catching problems during development instead of production.
Real-world RAG implementation examples
Customer support knowledge base
An AI assistant can scrape your product documentation, FAQ pages, and ticket logs to provide on-demand answers. For example, if a customer asks a support bot “How do I reset my device?” the system retrieves the relevant manual section (scraped from your docs site) and answers precisely. Any changes to the documents (new firmware) are scraped and indexed immediately.
ServiceNow's Now Assist in AI Search demonstrates this approach in production by using RAG to retrieve relevant knowledge articles and generate actionable Q&A cards for customer support queries. The system retrieves articles from the customer's knowledge base, augments queries with context from top-ranked content, and generates answers that cite sources - delivering "answers instead of links" for features launched as recently as the previous week.
Market research automation
You can also build a RAG system that constantly scrapes news sites, competitor blogs, and social media feeds for mentions of your industry. Analysts can query it for trends or competitor moves. For instance, “What did Company X announce this week?” triggers a search over the latest scraped press releases and news articles, yielding a summary.
AlphaSense uses RAG technology with over 500 million premium business documents to deliver competitive intelligence at scale. Its Generative Search feature interprets natural language queries like an analyst and provides cited responses to minimize hallucinations.
The platform automates competitive benchmarking, tracks pricing and product changes, and monitors market trends in real-time, condensing hours of research into seconds. With hundreds of new expert interviews added weekly, it enables firms to spot demand shifts, supply chain disruptions, and emerging opportunities as they happen.
Content & insights pipeline
Rag systems let you aggregate and analyze public data (e.g. social media posts, review sites, forums). A RAG model can answer questions like “What are common complaints about Product Y?” by retrieving scraped user reviews and summarizing sentiment. Live monitoring (scraping Twitter or Reddit) lets the system alert on shifts in public opinion.
Microsoft Copilot Studio uses RAG to monitor public websites and generate conversational responses by retrieving relevant content from specified domains and providing cited summaries. The system performs grounding checks, provenance validation, and semantic similarity analysis on retrieved content while applying content moderation to filter inappropriate material.
Knowledge sources can include public websites and internal SharePoint sites, enabling organizations to aggregate news, monitor sentiment, and synthesize insights from multiple sources. The platform reduces manual work by finding and presenting information from internal and external sources as either primary responses or fallback when predefined topics can't address queries.
Best practices and lessons learned
- Test everything before production. Run unit tests for scraping logic, integration tests for pipeline stages, end-to-end tests for queries, and load tests with realistic data volumes. Systems that work with 1,000 documents may break with 1 million.
- Simulate failures during development. Build test fixtures for different website structures, content edge cases, database outages, and API failures. Catch problems early, not in production.
- Automate quality checks with CI. Set up pipelines that test every code change, check for security issues, and deploy to staging automatically. This catches bugs immediately instead of days later.
- Review code thoroughly. Check scraping logic, error handling, logging, and documentation. Fresh eyes catch what the original developer missed.
- Document everything clearly. Write down scraping target details, pipeline transformations, deployment steps, and troubleshooting guides. Turn individual knowledge into team knowledge.
- Plan for incidents. Define severity levels and response times, create playbooks for common problems, and review every incident to prevent repeats. Learning from failures separates good teams from great ones.
- Monitor growth and scale proactively. Track data volume, query load, and resource usage trends. Add capacity before you run out, not during peak traffic.
- Review performance regularly. Schedule periodic checks of system speed, costs, data quality, and user feedback. Focus improvements where they matter most.
- Share knowledge across the team. Document architecture decisions, share incident lessons, rotate on-call duties, and hold knowledge transfer sessions. Your system is only as reliable as your team.
Conclusion and next steps
Building production-ready RAG applications is about designing for resilience, scale, and continuous freshness from day one. By integrating web scraping, you transform your application into a real-time intelligence engine that adapts at the speed of the internet.
LlamaIndex provides the backbone for scaling vector storage, retrieval, and optimization, but your data pipeline is only as strong as its weakest scraper. With Decodo’s Web Scraping API handling proxy rotation, dynamic rendering, and anti-bot challenges automatically, you can focus on building and optimizing your RAG architecture rather than firefighting scrapers.
If your next step is moving from prototype to production, the path is clear:
- Architect for scale from the start
- Keep your pipelines clean and resilient
- Optimize retrieval for performance and cost
- Leverage Decodo to scrape reliably at global scale
Production RAG is no longer about "can it work", it's about "can it survive?" With the right design choices and Decodo powering your data pipelines, the answer is "yes".
Ready to build production-ready RAG?
Stop worrying about broken scrapers and stale indexes. Decodo gives you reliable, scalable web scraping infrastructure out of the box.
About the author

Zilvinas Tamulis
Technical Copywriter
A technical writer with over 4 years of experience, Žilvinas blends his studies in Multimedia & Computer Design with practical expertise in creating user manuals, guides, and technical documentation. His work includes developing web projects used by hundreds daily, drawing from hands-on experience with JavaScript, PHP, and Python.
Connect with Žilvinas via LinkedIn
All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

