How to Save Your Scraped Data

Web scraping without proper data storage wastes your time and effort. You spend hours gathering valuable information, only to lose it when your terminal closes or your script crashes. This guide will teach you multiple storage methods, from CSV files to databases, with practical examples you can implement immediately to keep your data safe.

Dominykas Niaura

Aug 29, 2025

10 min read

TL;DR

Save scraped data to CSV with pandas.to_csv(). Use to_excel() for Excel files (requires openpyxl). Save to JSON with json.dump() for nested data. Store data in lists as you scrape, then convert to DataFrame. For databases, use sqlite3 for local storage or MongoDB for flexible schemas. Always save incrementally during long scraping sessions.

Why saving scraped data matters

When you run a Python scraping script, all collected data exists only in your computer's memory. Close the terminal or stop the script, and everything disappears. This becomes problematic when scraping large datasets that take hours to collect.

Proper data storage also enables you to resume scraping from where you left off after interruptions, analyze data across multiple scraping sessions, share results with team members or stakeholders, create backups to prevent data loss, and build automated workflows that process saved data.

Setting up your Python environment

Before diving into data storage, make sure you have Python installed and a way to run your code. You'll need either an IDE like PyCharm or VS Code, or another method to access your system's terminal. If you're new to running Python scripts from the terminal, check out our complete guide to running Python code in the terminal for step-by-step instructions.

Installing Python

Windows. Download Python from the official website and run the installer. Check "Add Python to PATH" during installation to enable command-line access.
macOS. Python comes pre-installed, but it's often an older version. Install the latest version using Homebrew (brew install python) or download from their official website.
Linux. Most distributions include Python by default. Update with your package manager if needed (sudo apt update && sudo apt install python3 on Ubuntu/Debian).

Verifying your installation

Open your terminal and run python --version or python3 --version. You should see the output showing your Python version number.

Installing required libraries

Once Python is ready, install the libraries needed for data storage:

pip install pandas openpyxl sqlite3 pymongo

Each library serves specific storage needs:

Pandas – Handles data manipulation and exports to various formats
openpyxl – Works with Excel files (.xlsx format)
sqlite3 – Manages local SQL databases
pymongo – Connects to MongoDB databases

How to save scraped data to JSON files

JSON (JavaScript Object Notation) files are perfect for storing structured data with nested elements. They preserve data types and work well with APIs and web applications.

Basic JSON saving

The following example demonstrates how to save scraped data as JSON using Python's built-in json module. This approach preserves nested data structures and metadata better than CSV files, making it ideal for complex scraped content.

import json
import requests
from bs4 import BeautifulSoup
from datetime import datetime

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    # JSON handles nested data well
    book_data = {
        'name': name,
        'price': price,
        'metadata': {
            'scraped_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
            'source_url': url
        }
    }
    scraped_data.append(book_data)

# Save to JSON file
with open('books.json', 'w', encoding='utf-8') as file:
    json.dump(scraped_data, file, indent=2, ensure_ascii=False)

print("Data saved to books.json")

import json
import requests
from bs4 import BeautifulSoup
from datetime import datetime

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    # JSON handles nested data well
    book_data = {
        'name': name,
        'price': price,
        'metadata': {
            'scraped_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
            'source_url': url
        }
    }
    scraped_data.append(book_data)

# Save to JSON file
with open('books.json', 'w', encoding='utf-8') as file:
    json.dump(scraped_data, file, indent=2, ensure_ascii=False)

print("Data saved to books.json")

JSON with Pandas

Pandas also supports JSON export with different structure options compared to the built-in json module:

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Convert to DataFrame
df = pd.DataFrame(scraped_data)

# Different orient options for various JSON structures
df.to_json('books_records.json', orient='records', indent=2)  # Array of objects
df.to_json('books_index.json', orient='index', indent=2)      # Indexed objects
df.to_json('books_values.json', orient='values', indent=2)    # Array of arrays

print("Saved JSON files with different structures:")
print("- books_records.json: Array of objects format")
print("- books_index.json: Indexed objects format") 
print("- books_values.json: Array of arrays format")

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Convert to DataFrame
df = pd.DataFrame(scraped_data)

# Different orient options for various JSON structures
df.to_json('books_records.json', orient='records', indent=2)  # Array of objects
df.to_json('books_index.json', orient='index', indent=2)      # Indexed objects
df.to_json('books_values.json', orient='values', indent=2)    # Array of arrays

print("Saved JSON files with different structures:")
print("- books_records.json: Array of objects format")
print("- books_index.json: Indexed objects format") 
print("- books_values.json: Array of arrays format")

The orient='records' parameter creates a clean array of objects format that's easy to read and process later. Other orient options include 'index' for numbered objects and 'values' for nested arrays, giving you flexibility in how your JSON data is structured.

How to save scraped data to CSV in Python

CSV (Comma-Separated Values) files offer the simplest storage solution for scraped data. They're lightweight, readable, and compatible with spreadsheet applications.

CSV saving without Pandas

The most common approach is using Python's built-in csv module. Here, the DictWriter class handles dictionary data automatically, writing headers and rows in the correct format:

import csv
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Save to CSV using built-in csv module
with open('books.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['name', 'price'])
    writer.writeheader()
    writer.writerows(scraped_data)

print("Data saved to books.csv")

import csv
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Save to CSV using built-in csv module
with open('books.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['name', 'price'])
    writer.writeheader()
    writer.writerows(scraped_data)

print("Data saved to books.csv")

Basic CSV saving with Pandas

The most common approach to saving scraped data uses Pandas to convert your collected information into a structured DataFrame, then export it as a CSV file. This method handles data organization automatically and works well for most scraping projects.

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Save scraped data to CSV
df = pd.DataFrame(scraped_data)
df.to_csv('books.csv', index=False)
print("Data saved to books.csv")

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Save scraped data to CSV
df = pd.DataFrame(scraped_data)
df.to_csv('books.csv', index=False)
print("Data saved to books.csv")

This approach converts your scraped data into a Pandas DataFrame, then exports it as a CSV file. The index=False parameter prevents Pandas from adding row numbers to your file.

Handling special characters in CSV

When scraping international websites, you might encounter special characters. Always specify UTF-8 encoding to prevent data corruption:

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# After creating your DataFrame from scraped data
df = pd.DataFrame(scraped_data)
df.to_csv('books.csv', index=False, encoding='utf-8')

print(f"Saved {len(scraped_data)} books to books.csv with UTF-8 encoding")

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# After creating your DataFrame from scraped data
df = pd.DataFrame(scraped_data)
df.to_csv('books.csv', index=False, encoding='utf-8')

print(f"Saved {len(scraped_data)} books to books.csv with UTF-8 encoding")

How to save scraped data to Excel files

Excel files provide better formatting options and support multiple sheets within a single file. This makes them ideal for organizing different types of scraped data.

Save scraped data to Excel with Pandas

The simplest way to create Excel files from scraped data uses Pandas' built-in Excel export functionality. This method automatically handles data formatting and creates a clean spreadsheet ready for analysis. Note that you'll need to install openpyxl first with pip install openpyxl (Pandas uses this library internally for Excel operations).

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Convert scraped data to DataFrame
df = pd.DataFrame(scraped_data)

# Save to Excel file
df.to_excel('books.xlsx', index=False, sheet_name='Books')
print("Data saved to books.xlsx")

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Convert scraped data to DataFrame
df = pd.DataFrame(scraped_data)

# Save to Excel file
df.to_excel('books.xlsx', index=False, sheet_name='Books')
print("Data saved to books.xlsx")

Creating multiple sheets in one Excel file

For complex scraping projects, organize different data types into separate sheets within a single workbook. This approach keeps related data together while maintaining clear separation – for example, basic book information on one sheet, ratings on another, and availability data on a third. The script below shows how to collect different types of data during scraping and organize them into separate sheets:

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code for books
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect books data
books_data = []
ratings_data = []
availability_data = []

books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    books_data.append({'name': name, 'price': price})
    
    # Extract rating
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else 'No rating'
    ratings_data.append({'book_name': name, 'rating': rating})
    
    # Extract availability
    availability = book.find('p', class_='instock availability')
    stock_status = availability.text.strip() if availability else 'Unknown'
    availability_data.append({'book_name': name, 'availability': stock_status})

# Convert to DataFrames
books_df = pd.DataFrame(books_data)
ratings_df = pd.DataFrame(ratings_data)
availability_df = pd.DataFrame(availability_data)

# Save to multiple sheets in one Excel file
with pd.ExcelWriter('complete_data.xlsx') as writer:
    books_df.to_excel(writer, sheet_name='Books', index=False)
    ratings_df.to_excel(writer, sheet_name='Ratings', index=False)
    availability_df.to_excel(writer, sheet_name='Availability', index=False)

print("Data saved to complete_data.xlsx with multiple sheets")

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code for books
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect books data
books_data = []
ratings_data = []
availability_data = []

books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    books_data.append({'name': name, 'price': price})
    
    # Extract rating
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else 'No rating'
    ratings_data.append({'book_name': name, 'rating': rating})
    
    # Extract availability
    availability = book.find('p', class_='instock availability')
    stock_status = availability.text.strip() if availability else 'Unknown'
    availability_data.append({'book_name': name, 'availability': stock_status})

# Convert to DataFrames
books_df = pd.DataFrame(books_data)
ratings_df = pd.DataFrame(ratings_data)
availability_df = pd.DataFrame(availability_data)

# Save to multiple sheets in one Excel file
with pd.ExcelWriter('complete_data.xlsx') as writer:
    books_df.to_excel(writer, sheet_name='Books', index=False)
    ratings_df.to_excel(writer, sheet_name='Ratings', index=False)
    availability_df.to_excel(writer, sheet_name='Availability', index=False)

print("Data saved to complete_data.xlsx with multiple sheets")

Excel formatting and styling

Enhance your Excel files with basic formatting to make your scraped data more readable and professional. Adding headers with bold text and background colors helps distinguish data sections, making reports easier to read for stakeholders or team members.

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else 'N/A'
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

# Create formatted Excel file
wb = Workbook()
ws = wb.active
ws.title = "Scraped Books"

# Add headers with formatting
headers = ['Book Title', 'Price', 'Rating']
for col, header in enumerate(headers, 1):
    cell = ws.cell(row=1, column=col, value=header)
    cell.font = Font(bold=True)
    cell.fill = PatternFill(start_color="DDDDDD", end_color="DDDDDD", fill_type="solid")

# Add data rows
for row, book in enumerate(scraped_data, 2):
    ws.cell(row=row, column=1, value=book['name'])
    ws.cell(row=row, column=2, value=book['price'])
    ws.cell(row=row, column=3, value=book['rating'])

wb.save('formatted_books.xlsx')
print("Formatted Excel file saved as formatted_books.xlsx")

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else 'N/A'
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

# Create formatted Excel file
wb = Workbook()
ws = wb.active
ws.title = "Scraped Books"

# Add headers with formatting
headers = ['Book Title', 'Price', 'Rating']
for col, header in enumerate(headers, 1):
    cell = ws.cell(row=1, column=col, value=header)
    cell.font = Font(bold=True)
    cell.fill = PatternFill(start_color="DDDDDD", end_color="DDDDDD", fill_type="solid")

# Add data rows
for row, book in enumerate(scraped_data, 2):
    ws.cell(row=row, column=1, value=book['name'])
    ws.cell(row=row, column=2, value=book['price'])
    ws.cell(row=row, column=3, value=book['rating'])

wb.save('formatted_books.xlsx')
print("Formatted Excel file saved as formatted_books.xlsx")

Proxy considerations for large-scale scraping

When scraping large amounts of data, websites may block your IP address. This interruption can cause data loss if you haven't implemented proper saving mechanisms. Using proxies helps maintain consistent data collection.

At Decodo, we offer residential proxies with a high success rate (99.86%), automatic rotation, a rapid response time (<0.6s), and extensive geo-targeting options (195+ worldwide locations). These features ensure your scraping projects run smoothly without interruptions that could compromise your data collection efforts.

Implementing proxies in your scraping code adds stability:

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Configure proxy settings (change protocol as needed: http, https, or socks5)
proxy_url = "http://YOUR_PROXY_USERNAME:[email protected]:7000"
proxies = {'http': proxy_url, 'https': proxy_url}

# Sample scraping code with proxy
url = "https://books.toscrape.com/"
response = requests.get(url, proxies=proxies)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Save scraped data to CSV
df = pd.DataFrame(scraped_data)
df.to_csv('books_with_proxy.csv', index=False)
print("Data scraped using proxy and saved to books_with_proxy.csv")

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Configure proxy settings (change protocol as needed: http, https, or socks5)
proxy_url = "http://YOUR_PROXY_USERNAME:[email protected]:7000"
proxies = {'http': proxy_url, 'https': proxy_url}

# Sample scraping code with proxy
url = "https://books.toscrape.com/"
response = requests.get(url, proxies=proxies)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Save scraped data to CSV
df = pd.DataFrame(scraped_data)
df.to_csv('books_with_proxy.csv', index=False)
print("Data scraped using proxy and saved to books_with_proxy.csv")

Empower your web scraper with proxies

Claim your 3-day free trial of residential proxies and access any website with full features.

Start free trial

How to save scraped data to databases

Databases are structured storage systems that organize data in tables with rows and columns, similar to advanced spreadsheets but with powerful querying capabilities. Unlike simple files, databases let you search, filter, and combine data efficiently using SQL commands. They offer more sophisticated storage solutions for large datasets or when you need complex queries and relationships between data points.

SQLite database storage

SQLite provides a lightweight, serverless database perfect for local projects. It requires no installation or configuration beyond Python's built-in sqlite3 module, making it ideal for personal scraping projects or when you need SQL querying capabilities without database server complexity.

import sqlite3
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else None
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

# Create database connection
conn = sqlite3.connect('scraped_data.db')

# Create table
cursor = conn.cursor()
cursor.execute('''
    CREATE TABLE IF NOT EXISTS books (
        id INTEGER PRIMARY KEY,
        name TEXT NOT NULL,
        price TEXT,
        rating TEXT,
        scraped_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
''')

# Insert scraped data
for book in scraped_data:
    cursor.execute('''
        INSERT INTO books (name, price, rating)
        VALUES (?, ?, ?)
    ''', (book['name'], book['price'], book['rating']))

conn.commit()
conn.close()
print(f"Saved {len(scraped_data)} books to SQLite database")

import sqlite3
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else None
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

# Create database connection
conn = sqlite3.connect('scraped_data.db')

# Create table
cursor = conn.cursor()
cursor.execute('''
    CREATE TABLE IF NOT EXISTS books (
        id INTEGER PRIMARY KEY,
        name TEXT NOT NULL,
        price TEXT,
        rating TEXT,
        scraped_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
''')

# Insert scraped data
for book in scraped_data:
    cursor.execute('''
        INSERT INTO books (name, price, rating)
        VALUES (?, ?, ?)
    ''', (book['name'], book['price'], book['rating']))

conn.commit()
conn.close()
print(f"Saved {len(scraped_data)} books to SQLite database")

MongoDB storage for flexible data structures

MongoDB excels at storing unstructured or semi-structured scraped data. Unlike SQL databases that require fixed schemas, MongoDB handles varying data structures naturally – perfect for scraping different websites where product pages might have different fields, nested attributes, or missing information. Note that you'll need to install pymongo first with pip install pymongo and have MongoDB running locally (install from their download page and start the service).

Alternatively, you can use MongoDB Atlas, their cloud service, by creating a free account on their website. After setting up a cluster, you'll receive a connection string that looks like this: mongodb+srv://username:[email protected]/database. Replace the MongoClient connection URL in the script below with your Atlas connection string to use the cloud database instead of a local installation.

import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient
from datetime import datetime

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else None
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

try:
    # Connect to MongoDB
    client = MongoClient('mongodb://localhost:27017/', serverSelectionTimeoutMS=5000)
    db = client['scraping_database']
    collection = db['books']
    
    # Test connection
    client.admin.command('ping')
    
    # Prepare data with timestamps
    for book in scraped_data:
        book['scraped_date'] = datetime.now()
    
    # Insert data
    result = collection.insert_many(scraped_data)
    print(f"Inserted {len(result.inserted_ids)} books into MongoDB")
    
    # Close connection
    client.close()
    
except Exception as e:
    print(f"MongoDB connection failed: {e}")
    print("Make sure MongoDB is installed and running locally")
    print("Alternative: Save to CSV instead")
    
    # Fallback to CSV
    import pandas as pd
    df = pd.DataFrame(scraped_data)
    df.to_csv('books_fallback.csv', index=False)
    print("Data saved to books_fallback.csv instead")

import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient
from datetime import datetime

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else None
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

try:
    # Connect to MongoDB
    client = MongoClient('mongodb://localhost:27017/', serverSelectionTimeoutMS=5000)
    db = client['scraping_database']
    collection = db['books']
    
    # Test connection
    client.admin.command('ping')
    
    # Prepare data with timestamps
    for book in scraped_data:
        book['scraped_date'] = datetime.now()
    
    # Insert data
    result = collection.insert_many(scraped_data)
    print(f"Inserted {len(result.inserted_ids)} books into MongoDB")
    
    # Close connection
    client.close()
    
except Exception as e:
    print(f"MongoDB connection failed: {e}")
    print("Make sure MongoDB is installed and running locally")
    print("Alternative: Save to CSV instead")
    
    # Fallback to CSV
    import pandas as pd
    df = pd.DataFrame(scraped_data)
    df.to_csv('books_fallback.csv', index=False)
    print("Data saved to books_fallback.csv instead")

PostgreSQL for production environments

For production applications, PostgreSQL offers robust features and scalability. Unlike SQLite and (to a lesser extent) MongoDB, PostgreSQL handles concurrent access from multiple scrapers, supports advanced indexing for fast queries on millions of records, and provides built-in replication for data backup and high availability.

To use PostgreSQL, you'll need to install it locally (download from their website) or use a cloud service like Heroku Postgres or AWS RDS. You'll also need to install the psycopg2 driver with pip install psycopg2-binary (this allows Python to communicate with PostgreSQL databases). After installation, create a database and update the connection parameters in the script below with your actual host, database name, username, and password.

import requests
from bs4 import BeautifulSoup
import psycopg2
import pandas as pd
from sqlalchemy import create_engine

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else None
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

try:
    # Database connection using SQLAlchemy engine for Pandas compatibility
    engine = create_engine('postgresql://username:password@localhost:5432/scraped_data')
    
    # Use Pandas for easy insertion
    df = pd.DataFrame(scraped_data)
    df.to_sql('books', engine, if_exists='append', index=False)
    
    print(f"Saved {len(scraped_data)} books to PostgreSQL database")
    
except Exception as e:
    print(f"PostgreSQL connection failed: {e}")
    print("Make sure PostgreSQL is running and credentials are correct")
    
    # Fallback to CSV
    df = pd.DataFrame(scraped_data)
    df.to_csv('books_fallback.csv', index=False)
    print("Data saved to books_fallback.csv instead")

import requests
from bs4 import BeautifulSoup
import psycopg2
import pandas as pd
from sqlalchemy import create_engine

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else None
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

try:
    # Database connection using SQLAlchemy engine for Pandas compatibility
    engine = create_engine('postgresql://username:password@localhost:5432/scraped_data')
    
    # Use Pandas for easy insertion
    df = pd.DataFrame(scraped_data)
    df.to_sql('books', engine, if_exists='append', index=False)
    
    print(f"Saved {len(scraped_data)} books to PostgreSQL database")
    
except Exception as e:
    print(f"PostgreSQL connection failed: {e}")
    print("Make sure PostgreSQL is running and credentials are correct")
    
    # Fallback to CSV
    df = pd.DataFrame(scraped_data)
    df.to_csv('books_fallback.csv', index=False)
    print("Data saved to books_fallback.csv instead")

Advanced data saving strategies

Beyond basic file exports, sophisticated scraping projects require robust data handling approaches. This section covers incremental saving to prevent data loss during long scraping sessions, data validation to ensure quality, and comprehensive error handling with backup mechanisms.

Incremental saving during scraping

Save data progressively to avoid losing everything if your script crashes. Incremental saving becomes crucial when scraping large websites that take hours to complete, ensuring you don't lose thousands of records due to network timeouts, website blocks, or system crashes.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

def scrape_book(book_url):
    """Scrape individual book data"""
    try:
        response = requests.get(book_url)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        name = soup.find('h1').text.strip()
        price = soup.find('p', class_='price_color').text.strip()
        rating_class = soup.find('p', class_='star-rating')
        rating = rating_class['class'][1] if rating_class else None
        
        return {'name': name, 'price': price, 'rating': rating}
    except Exception as e:
        print(f"Error scraping {book_url}: {e}")
        return None

# Get list of book URLs from main page
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

book_urls = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    relative_url = book.find('h3').find('a')['href']
    # Fix: Remove duplicate 'catalogue' from URL construction
    full_url = f"https://books.toscrape.com/{relative_url}"
    book_urls.append(full_url)

print(f"Found {len(book_urls)} book URLs to scrape")

# Incremental saving during scraping
scraped_data = []
batch_size = 5  # Small batch for demonstration

for i, book_url in enumerate(book_urls):
    print(f"Scraping book {i+1}/{len(book_urls)}: {book_url}")
    
    # Scrape individual book
    book_data = scrape_book(book_url)
    if book_data:
        scraped_data.append(book_data)
    
    # Save every 5 items
    if (i + 1) % batch_size == 0:
        df = pd.DataFrame(scraped_data)
        df.to_csv(f'backup_batch_{i//batch_size + 1}.csv', index=False)
        print(f"Saved batch {i//batch_size + 1} with {len(scraped_data)} books")
        time.sleep(1)  # Be respectful to the server

# Save final batch
if scraped_data:
    df = pd.DataFrame(scraped_data)
    df.to_csv('final_books_data.csv', index=False)
    print(f"Final save completed: {len(scraped_data)} total books")
else:
    print("No data was successfully scraped")

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

def scrape_book(book_url):
    """Scrape individual book data"""
    try:
        response = requests.get(book_url)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        name = soup.find('h1').text.strip()
        price = soup.find('p', class_='price_color').text.strip()
        rating_class = soup.find('p', class_='star-rating')
        rating = rating_class['class'][1] if rating_class else None
        
        return {'name': name, 'price': price, 'rating': rating}
    except Exception as e:
        print(f"Error scraping {book_url}: {e}")
        return None

# Get list of book URLs from main page
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

book_urls = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    relative_url = book.find('h3').find('a')['href']
    # Fix: Remove duplicate 'catalogue' from URL construction
    full_url = f"https://books.toscrape.com/{relative_url}"
    book_urls.append(full_url)

print(f"Found {len(book_urls)} book URLs to scrape")

# Incremental saving during scraping
scraped_data = []
batch_size = 5  # Small batch for demonstration

for i, book_url in enumerate(book_urls):
    print(f"Scraping book {i+1}/{len(book_urls)}: {book_url}")
    
    # Scrape individual book
    book_data = scrape_book(book_url)
    if book_data:
        scraped_data.append(book_data)
    
    # Save every 5 items
    if (i + 1) % batch_size == 0:
        df = pd.DataFrame(scraped_data)
        df.to_csv(f'backup_batch_{i//batch_size + 1}.csv', index=False)
        print(f"Saved batch {i//batch_size + 1} with {len(scraped_data)} books")
        time.sleep(1)  # Be respectful to the server

# Save final batch
if scraped_data:
    df = pd.DataFrame(scraped_data)
    df.to_csv('final_books_data.csv', index=False)
    print(f"Final save completed: {len(scraped_data)} total books")
else:
    print("No data was successfully scraped")

Implementing data validation

Validate your data before saving to ensure quality. Data validation catches common scraping issues like missing fields, malformed formats, or unexpected data types that could cause problems during analysis.

import pandas as pd
import requests
from bs4 import BeautifulSoup

def validate_book_data(book):
    """Validate scraped book data"""
    required_fields = ['name', 'price']
    
    # Check required fields
    for field in required_fields:
        if field not in book or not book[field]:
            return False
    
    # Validate price format (books use £ symbol)
    try:
        price_value = float(book['price'].replace('£', '').replace(',', ''))
        if price_value < 0:
            return False
    except ValueError:
        return False
    
    return True

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Filter valid data before saving
valid_data = [book for book in scraped_data if validate_book_data(book)]
invalid_count = len(scraped_data) - len(valid_data)

df = pd.DataFrame(valid_data)
df.to_csv('validated_books.csv', index=False)

print(f"Saved {len(valid_data)} valid books to validated_books.csv")
print(f"Filtered out {invalid_count} invalid records")

import pandas as pd
import requests
from bs4 import BeautifulSoup

def validate_book_data(book):
    """Validate scraped book data"""
    required_fields = ['name', 'price']
    
    # Check required fields
    for field in required_fields:
        if field not in book or not book[field]:
            return False
    
    # Validate price format (books use £ symbol)
    try:
        price_value = float(book['price'].replace('£', '').replace(',', ''))
        if price_value < 0:
            return False
    except ValueError:
        return False
    
    return True

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Filter valid data before saving
valid_data = [book for book in scraped_data if validate_book_data(book)]
invalid_count = len(scraped_data) - len(valid_data)

df = pd.DataFrame(valid_data)
df.to_csv('validated_books.csv', index=False)

print(f"Saved {len(valid_data)} valid books to validated_books.csv")
print(f"Filtered out {invalid_count} invalid records")

Error handling and recovery

Implement robust error handling to preserve data during unexpected issues. The script below demonstrates multiple layers of protection to ensure your scraped data never gets lost, even when things go wrong.

import json
import logging
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import shutil

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def safe_save_data(data, filename):
    """Save data with error handling and backup"""
    try:
        # Create backup of existing file
        try:
            shutil.copy(filename, f"{filename}.backup")
            logging.info(f"Created backup: {filename}.backup")
        except FileNotFoundError:
            pass  # No existing file to backup
        
        # Save new data
        df = pd.DataFrame(data)
        df.to_csv(filename, index=False)
        logging.info(f"Successfully saved {len(data)} records to {filename}")
        
        # Save JSON backup for debugging
        json_filename = f"{filename}.json"
        with open(json_filename, 'w') as f:
            json.dump(data, f, indent=2, default=str)
        logging.info(f"JSON backup saved: {json_filename}")
        
    except Exception as e:
        logging.error(f"Error saving data: {e}")
        # Emergency save as JSON
        emergency_filename = f"emergency_save_{int(time.time())}.json"
        with open(emergency_filename, 'w') as f:
            json.dump(data, f, indent=2, default=str)
        logging.info(f"Emergency save completed: {emergency_filename}")

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else None
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

logging.info(f"Scraped {len(scraped_data)} books from {url}")

# Safe save with error handling
safe_save_data(scraped_data, 'books.csv')

import json
import logging
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import shutil

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def safe_save_data(data, filename):
    """Save data with error handling and backup"""
    try:
        # Create backup of existing file
        try:
            shutil.copy(filename, f"{filename}.backup")
            logging.info(f"Created backup: {filename}.backup")
        except FileNotFoundError:
            pass  # No existing file to backup
        
        # Save new data
        df = pd.DataFrame(data)
        df.to_csv(filename, index=False)
        logging.info(f"Successfully saved {len(data)} records to {filename}")
        
        # Save JSON backup for debugging
        json_filename = f"{filename}.json"
        with open(json_filename, 'w') as f:
            json.dump(data, f, indent=2, default=str)
        logging.info(f"JSON backup saved: {json_filename}")
        
    except Exception as e:
        logging.error(f"Error saving data: {e}")
        # Emergency save as JSON
        emergency_filename = f"emergency_save_{int(time.time())}.json"
        with open(emergency_filename, 'w') as f:
            json.dump(data, f, indent=2, default=str)
        logging.info(f"Emergency save completed: {emergency_filename}")

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    rating_class = book.find('p', class_='star-rating')
    rating = rating_class['class'][1] if rating_class else None
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

logging.info(f"Scraped {len(scraped_data)} books from {url}")

# Safe save with error handling
safe_save_data(scraped_data, 'books.csv')

This error handling system implements several protective measures. It creates automatic backups of existing files before overwriting them, ensuring you never lose previous data. The nested try-except blocks handle different failure scenarios gracefully – if the backup creation fails, the script continues with the main save operation.

When the primary save operation fails, an emergency JSON save preserves your data with a timestamped filename. Comprehensive logging tracks all operations, making it easy to diagnose issues and verify successful saves. The default=str parameter in JSON dumps handles non-serializable objects automatically, preventing crashes from complex data types.

Data compression and optimization

For large datasets, implement compression to reduce file sizes and improve transfer speeds. Compression becomes essential when dealing with thousands of scraped records or when you need to share data files frequently.

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Convert to DataFrame
df = pd.DataFrame(scraped_data)

# Save compressed CSV
df.to_csv('books.csv.gz', index=False, compression='gzip')
print(f"Saved compressed CSV: books.csv.gz")

# Save regular Excel (Excel files are already compressed internally)
df.to_excel('books.xlsx', index=False)
print(f"Saved Excel file: books.xlsx")

# Load compressed data to verify
df_loaded = pd.read_csv('books.csv.gz')
print(f"Loaded {len(df_loaded)} records from compressed file")
print(f"First book: {df_loaded.iloc[0]['name']}")

import pandas as pd
import requests
from bs4 import BeautifulSoup

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = book.find('p', class_='price_color').text.strip()
    scraped_data.append({'name': name, 'price': price})

# Convert to DataFrame
df = pd.DataFrame(scraped_data)

# Save compressed CSV
df.to_csv('books.csv.gz', index=False, compression='gzip')
print(f"Saved compressed CSV: books.csv.gz")

# Save regular Excel (Excel files are already compressed internally)
df.to_excel('books.xlsx', index=False)
print(f"Saved Excel file: books.xlsx")

# Load compressed data to verify
df_loaded = pd.read_csv('books.csv.gz')
print(f"Loaded {len(df_loaded)} records from compressed file")
print(f"First book: {df_loaded.iloc[0]['name']}")

This compression approach offers multiple benefits for scraped data management. The gzip compression for CSV files typically reduces file sizes by 60-80%, making them faster to upload, download, or email.

Excel files with zip compression maintain full formatting while using less disk space. Pandas automatically handles decompression when reading these files, so your data loading code remains unchanged. This is particularly valuable when scraping large eCommerce sites or news archives where you might collect hundreds of thousands of records.

Automating data exports

Create automated workflows that handle data saving without manual intervention. This approach ensures consistent data collection even when you're not actively monitoring the process. Note that you'll need to install the schedule library first with pip install schedule.

The script below demonstrates a complete automation system that schedules scraping sessions, saves data with timestamps, and manages file storage automatically:

import schedule
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import glob
import os

def perform_scraping():
    """Scrape books data from the website"""
    url = "https://books.toscrape.com/"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    scraped_data = []
    books = soup.find_all('article', class_='product_pod')
    
    for book in books:
        name = book.find('h3').find('a')['title']
        price = book.find('p', class_='price_color').text.strip()
        scraped_data.append({'name': name, 'price': price})
    
    return scraped_data

def cleanup_old_files(pattern, keep=10):
    """Clean up old files, keeping only the most recent ones"""
    files = glob.glob(pattern)
    if len(files) > keep:
        # Sort by modification time, oldest first
        files.sort(key=os.path.getmtime)
        # Remove oldest files
        for file in files[:-keep]:
            os.remove(file)
            print(f"Removed old file: {file}")

def automated_scrape_and_save():
    """Automated scraping with timestamped saves"""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    print(f"Starting automated scraping at {datetime.now()}")
    
    # Perform scraping
    scraped_data = perform_scraping()
    
    # Save with timestamp
    filename = f"scraped_data_{timestamp}.csv"
    df = pd.DataFrame(scraped_data)
    df.to_csv(filename, index=False)
    
    # Clean up old files (keep last 10)
    cleanup_old_files("scraped_data_*.csv", keep=10)
    
    print(f"Automated save completed: {filename} ({len(scraped_data)} books)")

# Schedule daily runs at 2 PM (change time here: "14:00" = 2 PM, "02:00" = 2 AM)
schedule.every().day.at("14:00").do(automated_scrape_and_save)

print("Automation scheduler started. Press Ctrl+C to stop.")
print("Next scheduled run: Daily at 14:00 (2 PM)")

# For testing, also run immediately (remove this in production)
print("Running initial scrape...")
automated_scrape_and_save()

# Keep the script running and check for scheduled tasks
try:
    while True:
        schedule.run_pending()
        time.sleep(60)
except KeyboardInterrupt:
    print("\nScheduler stopped by user")

import schedule
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import glob
import os

def perform_scraping():
    """Scrape books data from the website"""
    url = "https://books.toscrape.com/"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    scraped_data = []
    books = soup.find_all('article', class_='product_pod')
    
    for book in books:
        name = book.find('h3').find('a')['title']
        price = book.find('p', class_='price_color').text.strip()
        scraped_data.append({'name': name, 'price': price})
    
    return scraped_data

def cleanup_old_files(pattern, keep=10):
    """Clean up old files, keeping only the most recent ones"""
    files = glob.glob(pattern)
    if len(files) > keep:
        # Sort by modification time, oldest first
        files.sort(key=os.path.getmtime)
        # Remove oldest files
        for file in files[:-keep]:
            os.remove(file)
            print(f"Removed old file: {file}")

def automated_scrape_and_save():
    """Automated scraping with timestamped saves"""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    print(f"Starting automated scraping at {datetime.now()}")
    
    # Perform scraping
    scraped_data = perform_scraping()
    
    # Save with timestamp
    filename = f"scraped_data_{timestamp}.csv"
    df = pd.DataFrame(scraped_data)
    df.to_csv(filename, index=False)
    
    # Clean up old files (keep last 10)
    cleanup_old_files("scraped_data_*.csv", keep=10)
    
    print(f"Automated save completed: {filename} ({len(scraped_data)} books)")

# Schedule daily runs at 2 PM (change time here: "14:00" = 2 PM, "02:00" = 2 AM)
schedule.every().day.at("14:00").do(automated_scrape_and_save)

print("Automation scheduler started. Press Ctrl+C to stop.")
print("Next scheduled run: Daily at 14:00 (2 PM)")

# For testing, also run immediately (remove this in production)
print("Running initial scrape...")
automated_scrape_and_save()

# Keep the script running and check for scheduled tasks
try:
    while True:
        schedule.run_pending()
        time.sleep(60)
except KeyboardInterrupt:
    print("\nScheduler stopped by user")

This automation script performs several key functions. It runs scraping sessions at scheduled times (2 PM daily in this example), creates timestamped files to prevent overwrites, automatically cleans up old files to save disk space, and continues running indefinitely to execute scheduled tasks. The script checks every minute for pending scheduled jobs and executes them when the time arrives.

Best practices for data preservation

Follow these guidelines to ensure your scraped data remains accessible and useful:

File naming conventions. Use descriptive, timestamped filenames like products_amazon_20240122.csv to track data sources and collection dates.
Data structure consistency. Maintain consistent column names and data types across different scraping sessions. This simplifies data analysis and merging.
Regular backups. Implement automated backup systems that copy your data files to multiple locations, including cloud storage services.
Documentation. Include metadata files that describe your data structure, scraping parameters, and collection dates.
Version control. Use git or similar systems to track changes in your scraping scripts and data schemas.

Troubleshooting common saving issues

Even with proper setup, you might encounter issues when saving scraped data. These problems often stem from file permissions, memory limitations, or character encoding conflicts. Here are the most common issues and their solutions.

Permission errors. Ensure your script has write permissions in the target directory. On Unix systems, use chmod 755 to grant appropriate permissions.
Memory limitations. For large datasets, process data in chunks rather than loading everything into memory simultaneously.
Encoding problems. Always specify UTF-8 encoding when working with multilingual text data to prevent character corruption.
Concurrent access. When multiple scripts save to the same file, implement file locking mechanisms to prevent data corruption.

Performance optimization tips

Optimize your data saving operations for better performance by reducing memory usage, improving write speeds, and minimizing file sizes. Note that for Parquet file support, you'll need to install pyarrow first with pip install pyarrow (Parquet is a columnar storage format that compresses data much more efficiently than CSV while maintaining fast read speeds).

import pandas as pd
import requests
from bs4 import BeautifulSoup
import sqlite3

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = float(book.find('p', class_='price_color').text.strip().replace('£', ''))
    rating_class = book.find('p', class_='star-rating')
    # Convert rating words to numbers for efficiency
    rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
    rating = rating_map.get(rating_class['class'][1], 0) if rating_class else 0
    
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

# Convert to DataFrame
df = pd.DataFrame(scraped_data)

# Use efficient data types to reduce memory usage
df = df.astype({
    'price': 'float32',      # Smaller than default float64
    'rating': 'int8'         # Very small integers (0-5)
})

print(f"Memory usage after optimization: {df.memory_usage(deep=True).sum()} bytes")

# Batch database operations for better performance
conn = sqlite3.connect('books_optimized.db')
cursor = conn.cursor()

# Create table
cursor.execute('''CREATE TABLE IF NOT EXISTS books 
                 (name TEXT, price REAL, rating INTEGER)''')

# Prepare data for batch insert
data_batch = [(row['name'], row['price'], row['rating']) for _, row in df.iterrows()]

# Batch insert (much faster than individual inserts)
cursor.executemany('INSERT INTO books VALUES (?, ?, ?)', data_batch)
conn.commit()
conn.close()

# Use compression for large files
try:
    df.to_parquet('books.parquet', compression='snappy')
    print("Saved compressed Parquet file: books.parquet")
except ImportError:
    print("Install pyarrow for Parquet support: pip install pyarrow")
    # Fallback to compressed CSV
    df.to_csv('books.csv.gz', index=False, compression='gzip')
    print("Saved compressed CSV instead: books.csv.gz")

print(f"Optimized and saved {len(scraped_data)} books with performance enhancements")

import pandas as pd
import requests
from bs4 import BeautifulSoup
import sqlite3

# Sample scraping code
url = "https://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Collect data in a list
scraped_data = []
books = soup.find_all('article', class_='product_pod')

for book in books:
    name = book.find('h3').find('a')['title']
    price = float(book.find('p', class_='price_color').text.strip().replace('£', ''))
    rating_class = book.find('p', class_='star-rating')
    # Convert rating words to numbers for efficiency
    rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
    rating = rating_map.get(rating_class['class'][1], 0) if rating_class else 0
    
    scraped_data.append({'name': name, 'price': price, 'rating': rating})

# Convert to DataFrame
df = pd.DataFrame(scraped_data)

# Use efficient data types to reduce memory usage
df = df.astype({
    'price': 'float32',      # Smaller than default float64
    'rating': 'int8'         # Very small integers (0-5)
})

print(f"Memory usage after optimization: {df.memory_usage(deep=True).sum()} bytes")

# Batch database operations for better performance
conn = sqlite3.connect('books_optimized.db')
cursor = conn.cursor()

# Create table
cursor.execute('''CREATE TABLE IF NOT EXISTS books 
                 (name TEXT, price REAL, rating INTEGER)''')

# Prepare data for batch insert
data_batch = [(row['name'], row['price'], row['rating']) for _, row in df.iterrows()]

# Batch insert (much faster than individual inserts)
cursor.executemany('INSERT INTO books VALUES (?, ?, ?)', data_batch)
conn.commit()
conn.close()

# Use compression for large files
try:
    df.to_parquet('books.parquet', compression='snappy')
    print("Saved compressed Parquet file: books.parquet")
except ImportError:
    print("Install pyarrow for Parquet support: pip install pyarrow")
    # Fallback to compressed CSV
    df.to_csv('books.csv.gz', index=False, compression='gzip')
    print("Saved compressed CSV instead: books.csv.gz")

print(f"Optimized and saved {len(scraped_data)} books with performance enhancements")

These optimizations significantly improve performance through several mechanisms. Converting data types to smaller variants (float32 instead of float64, int16 instead of int64) reduces memory usage by up to 50% without losing precision for typical scraped data.

Batching database operations with executemany() performs hundreds of inserts in a single transaction instead of individual operations, reducing database overhead dramatically. Using compressed formats like Parquet with Snappy compression can reduce file sizes by 70-90% while maintaining fast read/write speeds, making data transfers and storage much more efficient.

To sum up

Saving your scraped data the right way makes a big difference, and it all starts with picking the storage method that fits your project best. For smaller or simpler datasets, CSV files are a great starting point. As things grow more complex or data-heavy, switching to a database will help you stay organized.

Don’t forget to set up some basic error handling, validation, and backups – they’ll save you from a lot of future headaches. Try starting with simple CSV exports, and as your projects grow, you can ease into more advanced options. What matters most is building a habit of saving your data reliably every step of the way!

About the author

Dominykas Niaura

Technical Copywriter

Dominykas brings a unique blend of philosophical insight and technical expertise to his writing. Starting his career as a film critic and music industry copywriter, he's now an expert in making complex proxy and web scraping concepts accessible to everyone.

Connect with Dominykas via LinkedIn

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

In this article

Industry-leading residential proxies for scraping

Access 115M+ residential IPs with fast response times and high success rates.

Start free trial

Frequently asked questions

Why does my scraping script stop saving data unexpectedly?

This usually happens when websites detect automated scraping and block your IP address, causing your script to fail before completing data collection. The solution is using residential proxies that rotate your IP address automatically with each request.

Proxies prevent detection and ensure your scraping continues uninterrupted, allowing you to save complete datasets instead of partial results. Without proxies, you might only collect a fraction of your target data before getting blocked.

What's the easiest way to save scraped data in Python?

The simplest method is to use Pandas to save data as CSV files. Collect your scraped data in a list of dictionaries, convert it to a DataFrame with pd.DataFrame(data), then save with df.to_csv('filename.csv', index=False). This approach works for most beginner projects and handles data formatting automatically.

What's the difference between CSV, JSON, and Excel formats for saving scraped data?

Choose CSV for simplicity and universal compatibility, JSON for complex structures, or Excel when you need formatting and multiple sheets:

CSV files are best for simple tabular data and work with any spreadsheet application, but can't handle nested structures well.
JSON is perfect for complex, nested data like API responses or when you need to preserve data hierarchies, plus it's easily readable by web applications.
Excel files offer formatting options and multiple sheets, but require the openpyxl library for creation. While Excel files can be opened by Microsoft Excel, they're also compatible with free alternatives like LibreOffice Calc, Google Sheets, or even Pandas itself for programmatic access.

What file format should I use for large amounts of scraped data?

For large datasets, use compressed formats like gzip-compressed CSV files or Parquet format with Snappy compression. These can reduce file sizes by 60-80% compared to regular CSV files. Parquet is especially good for data analysis later, while compressed CSV maintains compatibility with spreadsheet applications.

How do I handle special characters when saving scraped data?

Always specify UTF-8 encoding when saving files to handle international characters, emojis, and special symbols correctly. Use encoding='utf-8' parameter in pandas functions like to_csv() or to_excel(). This prevents character corruption that commonly occurs when scraping international websites or content with non-English text.

How do I prevent losing data if my scraping script crashes?

Implement incremental saving by saving your data every few hundred records instead of waiting until the end. Use a batch approach where you save partial results to timestamped files like data_batch_1.csv, data_batch_2.csv, etc. This way, even if your script fails halfway through, you still have most of your collected data.

PYTHON

DATA COLLECTION

🐍 Python Web Scraping: In-Depth Guide 2025

Welcome to 2025, the year of the snake – and what better way to celebrate than by mastering Python, the ultimate "snake" in the tech world! If you’re new to web scraping, don’t worry – this guide starts from the basics, guiding you step-by-step on collecting data from websites. Whether you’re curious about automating simple tasks or diving into more significant projects, Python makes it easy and fun to start. Let’s slither into the world of web scraping and see how powerful this tool can be!

Zilvinas Tamulis

Feb 28, 2025

15 min read

DATA COLLECTION

PYTHON

How to Run Python Code in Terminal

The terminal might seem intimidating at first, but it's one of the most powerful tools for Python development. The terminal gives you direct control over your Python environment for such tasks as running scripts, managing packages, or debugging code. In this guide, we'll walk you through everything you need to know about using Python in the terminal, from basic commands to advanced troubleshooting techniques.

Dominykas Niaura

Aug 20, 2025

10 min read

How to Save Your Scraped Data

TL;DR

Why saving scraped data matters

Setting up your Python environment

Installing Python

Verifying your installation

Installing required libraries

How to save scraped data to JSON files

Basic JSON saving

JSON with Pandas

How to save scraped data to CSV in Python

CSV saving without Pandas

Basic CSV saving with Pandas

Handling special characters in CSV

How to save scraped data to Excel files

Save scraped data to Excel with Pandas

Creating multiple sheets in one Excel file

Excel formatting and styling

Proxy considerations for large-scale scraping

How to save scraped data to databases

SQLite database storage

MongoDB storage for flexible data structures

PostgreSQL for production environments

Advanced data saving strategies

Incremental saving during scraping

Implementing data validation

Error handling and recovery

Data compression and optimization

Automating data exports

Best practices for data preservation

Troubleshooting common saving issues

Performance optimization tips

To sum up

Frequently asked questions

Why does my scraping script stop saving data unexpectedly?

What's the easiest way to save scraped data in Python?

What's the difference between CSV, JSON, and Excel formats for saving scraped data?

What file format should I use for large amounts of scraped data?

How do I handle special characters when saving scraped data?

How do I prevent losing data if my scraping script crashes?

Related articles