Back to blog

MechanicalSoup Python: A Complete Guide to Scraping, Forms, and Proxies

Share article:

When you need to scrape 50 pages of search results behind a login wall, raw Requests + Beautiful Soup force you to track cookies and assemble form payloads by hand, while Selenium launches a full browser for pages that don't even use JavaScript. MechanicalSoup sits between those extremes. It wraps Requests and Beautiful Soup into a stateful browser that handles web scraping sessions, forms, and navigation automatically. This guide covers everything from installation to proxy-powered production scrapers.

MechanicalSoup Python

TL;DR

  • MechanicalSoup combines Requests and Beautiful Soup into a single stateful browser that tracks cookies, headers, and navigation automatically
  • Use it for scraping static sites that involve forms, logins, or multi-step workflows where manual session management would be tedious
  • Proxy integration works through the standard Requests.Session interface, so you set browser.session.proxies and every request routes through your proxy
  • When you hit JavaScript-rendered content or need thousands of concurrent requests, switch to Playwright or Scrapy, respectively

What is MechanicalSoup?

MechanicalSoup is a Python library for browser automation and web scraping that wraps the Requests library (for HTTP) and Beautiful Soup (for HTML parsing) into a single stateful interface. Its main class, StatefulBrowser, maintains cookies, session headers, and the current page URL between requests automatically. You don't need to pass cookies or headers manually from one request to the next.

The name combines Mechanize (a Python 2-era browser automation library, now unmaintained) and Beautiful Soup. MechanicalSoup is the actively maintained successor, installable via pip and hosted on GitHub.

Essentially, MechanicalSoup gives you HTTP Requests, HTML parsing, and a Beautiful Soup object via browser.page, all wired together with session state. You get the full range of BS4 selectors (select, select_one, find, find_all) without importing Beautiful Soup separately. Beautiful Soup itself is a parser and navigator for HTML and XML, and MechanicalSoup bundles it internally, so there's no need to import or configure it on its own. For a full BS4 reference, see the Beautiful Soup web scraping guide.

MechanicalSoup has a JavaScript limitation, as it only processes raw HTML. Sites that rely on JavaScript to render content will return incomplete or empty pages, so those require Playwright or Selenium.

For more on the parser options MechanicalSoup supports, or a broader look at Python HTTP clients.

Installation and environment setup

Python 3.8+ is required, and no separate browser binary is needed.

# Create and activate a virtual environment
python -m venv mechsoup-env
source mechsoup-env/bin/activate # On Windows: mechsoup-env\Scripts\activate
# Install MechanicalSoup
pip install mechanicalsoup
# Optional but recommended: lxml is 2-5× faster for large pages
pip install lxml

MechanicalSoup defaults to html.parser (Python's built-in parser), but lxml is significantly faster and more lenient with malformed HTML. Use html.parser when portability matters and lxml for performance-sensitive scripts.

Instantiating StatefulBrowser with meaningful defaults

Configure the browser correctly from the start rather than discovering these options mid-project. The following instantiation call sets the lxml parser, enables exceptions on 404s, and identifies the scraper with a custom user agent.

import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser(
soup_config={"features": "lxml"}, # Use lxml parser
raise_on_404=True, # Raise exception on 404 instead of silent failure
user_agent="my-scraper/1.0" # Identify your scraper
)

Setting custom request headers

Because StatefulBrowser wraps a Requests.Session, you can set headers on the session directly. This same access pattern enables proxy configuration later (in the Proxy integration section).

browser.session.headers.update({
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml"
})

For a deeper dive into the underlying session mechanics at the Requests library level.

Core scraping workflow: Navigate, select, and extract

Let's build a working scraper against quotes.toscrape.com, a clean sandbox site with a predictable HTML structure that's ideal for testing.

Opening a page

The browser.open(url) method sends a GET request and loads the response into the stateful browser. It returns a Requests.Response object, so you can check response.status_code immediately. After calling it, browser.url gives you the current URL (useful for confirming redirects landed where you expected), and browser.page gives you the Beautiful Soup object you'll use for all element selection on that page.

browser.open("https://quotes.toscrape.com/page/1/")
print(browser.url) # Current URL after any redirects
page = browser.page # Beautiful Soup object for the current page

Every time you call browser.open() or follow a link, browser.page updates to reflect the new page. You don't need to re-assign it manually.

Selecting elements

There are 3 approaches depending on what you need.

  • browser.page.select_one("css-selector") returns the first element matching the CSS selector, or None if nothing matches. You can use this when you expect exactly 1 result, like a page title or a specific form field.
  • browser.page.select("css-selector") returns all matching elements as a list. You should use this when you're collecting multiple items of the same type, like every quote on a page or every row in a table.
  • browser.page.find() and browser.page.find_all() work similarly but accept tag names and attribute dictionaries instead of CSS selectors (e.g., find("a", {"class": "tag"})). All of these return Beautiful Soup Tag objects, which you can then extract text and attributes from.

Extracting data

Once you have an element, there are a few ways to pull content out of it.

For text, element.get_text(strip=True) is the most reliable option because it grabs all nested text and strips whitespace, and element.text does the same without stripping. element.string is stricter and only returns a value when the element contains exactly 1 text node, returning None otherwise, so it's less useful for elements with nested tags.

For attributes, element.get("href") returns the attribute value or None if absent, while element["href"] raises a KeyError if the attribute doesn't exist. Prefer .get() when you're not certain the attribute is present.

To learn more about selectors and parsing strategies, check the BS4 guide. For readers new to parsing concepts, that primer covers the fundamentals, and there's also a guide on choosing the best parser.

Working example: Scraping quotes

This example puts all 3 steps together. It navigates to the quotes sandbox, selects every quote container using div.quote as the CSS selector, and extracts 3 fields from each container: the quote text (inside span.text), the author name (inside small.author), and the tag list (every a.tag link within the container). We'll extend this running example in later sections.

import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser(
soup_config={"features": "lxml"},
raise_on_404=True,
user_agent="my-scraper/1.0"
)
browser.open("https://quotes.toscrape.com/page/1/")
quotes = []
for container in browser.page.select("div.quote"):
text = container.select_one("span.text").get_text(strip=True)
author = container.select_one("small.author").get_text(strip=True)
tags = [tag.text.strip() for tag in container.select("a.tag")]
quotes.append({"text": text, "author": author, "tags": tags})
for q in quotes:
print(f'{q["author"]}: {q["text"][:60]}... | Tags: {", ".join(q["tags"])}')

Notice that select_one and select are called on container rather than on browser.page. This scopes each search to the individual quote div, so the selectors only match elements inside that specific container. This pattern becomes important on pages where multiple sections share similar class names.

Form handling and multi-step workflows

Form interaction is where MechanicalSoup earns its keep. Doing this with raw Requests means inspecting the page source for hidden fields, extracting CSRF tokens, assembling the correct POST payload, and manually following redirects while preserving cookies. 

MechanicalSoup handles all of that through 3 methods: select_form(), field assignment, and submit_selected().

Selecting a form

The select_form() method takes a CSS selector, finds the matching form on the current page, and loads it into the browser's internal state for filling. If the page has multiple forms (a login form and a search bar, for example), the CSS selector lets you target the right one. Passing just "form" selects the first form on the page.

# Select the first form on the page
browser.select_form("form")
# Or target a specific form by action URL
browser.select_form('form[action="/search"]')

If the selector matches nothing, select_form() raises LinkNotFoundError. This typically means the page didn't load the expected content, either because the site returned a CAPTCHA, redirected to a login page, or the form is rendered by JavaScript (which MechanicalSoup can't execute). Wrapping the call in a try/except block lets you catch this and inspect the page before the script crashes.

from mechanicalsoup.utils import LinkNotFoundError
try:
browser.select_form('form[action="/search"]')
except LinkNotFoundError:
print("No matching form found on this page")

Filling fields and submitting

Once a form is selected, set fields by their HTML name attribute using browser["field_name"] = "value". MechanicalSoup looks up each field in the selected form and assigns the value. This works for text inputs, textareas, and checkboxes. For <select> dropdowns, pass the option's value attribute rather than its visible label, since MechanicalSoup matches against the underlying value.

browser["field_name"] = "value"
response = browser.submit_selected()

Calling browser.submit_selected() serializes all form fields (including any hidden inputs the site set), sends the request using the form's action URL and method (GET or POST), and follows any redirects. It returns the Requests.Response object, and the browser's internal state updates to the response page automatically. The session cookies travel with the request, so if the form submission requires authentication, a prior login (covered in the Session management section) carries through.

Multi-step workflow example

To demonstrate form interaction on a testable target, this example uses httpbin's sample form rather than quotes.toscrape.com (which doesn't have a form that works with static HTML scraping). The code navigates to the form page, fills 5 fields, submits, and reads the response. httpbin echoes the submitted data back as JSON, so you can verify exactly what MechanicalSoup sent.

import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser(
soup_config={"features": "lxml"},
raise_on_404=True,
user_agent="my-scraper/1.0"
)
# 1. Navigate to the form page
browser.open("https://httpbin.org/forms/post")
# 2. Select the form
browser.select_form('form')
# 3. Fill the fields by their name attributes
browser["custname"] = "Jane Doe"
browser["custtel"] = "555-1234"
browser["custemail"] = "jane@example.com"
browser["size"] = "medium" # <select> dropdown: pass the value, not the label
browser["comments"] = "Extra cheese, please."
# 4. Submit
response = browser.submit_selected()
# 5. httpbin echoes submitted data as JSON
print(response.text)

With raw Requests, this same workflow would require inspecting the form's action and method, building the POST body as a dict, and sending it with Requests.post(). On a real site with CSRF tokens and session cookies, the manual version gets significantly more involved.

Debugging forms with hidden fields

MechanicalSoup preserves hidden inputs automatically when filling and submitting, which is how it handles CSRF tokens and other server-side form state without any extra code on your part. To inspect all fields (including hidden ones) before submission, call browser.get_current_form().print_summary(). This prints every field name, type, and current value, which is the fastest way to debug a form submission that silently fails or returns unexpected results. For multi-page form workflows, see the web scraping pagination guide.

Session and authentication management

MechanicalSoup wraps a Requests.Session object, so every cookie set by a response is automatically stored and re-sent with subsequent requests to the same domain. In practice, this means you can log into a site once and then scrape any authenticated page without touching a cookie header yourself. 

That single detail eliminates the most tedious part of authenticated scraping, where raw Requests would require you to extract Set-Cookie headers, store them, and attach them to every follow-up request manually. 

Login workflow

The following example logs into quotes.toscrape.com/login by selecting the form, filling credentials, and submitting. After login, all subsequent browser.open() calls on the same domain carry the session cookies automatically.

browser.open("https://quotes.toscrape.com/login")
browser.select_form('form')
browser["username"] = "admin"
browser["password"] = "admin"
browser.submit_selected()
if "logout" in browser.page.text.lower():
print("Login successful")
print(f"Redirected to: {browser.url}")
# This request automatically includes auth cookies
browser.open("https://quotes.toscrape.com/")

The verification step checks for the word "logout" on the response page, since most sites only show a logout link when the user is authenticated. If login fails (wrong credentials, CAPTCHA, or an unexpected redirect), browser.page will contain the login form again or an error message, and browser.url will typically remain on the login path rather than redirecting to a protected page.

After login succeeds, every subsequent browser.open() call on the same domain sends the session cookies automatically. You can navigate to any authenticated page, submit forms, or follow links, and the session stays active until the cookies expire or the server invalidates them.

Inspecting session cookies

You can iterate over browser.session.cookies or convert the jar to a dict for a quick overview of what the session holds.

for cookie in browser.session.cookies:
print(f"{cookie.name}: {cookie.value}")
print(dict(browser.session.cookies))

This is also useful for confirming that a site set the cookies you expected after login, or for checking whether a specific token cookie is present before attempting to access a protected endpoint. 

Persisting cookies across script runs

To reuse a session between script runs, serialize the cookie jar to a JSON file after login and restore it at the start of the next session.

import json
# Save cookies after login
cookies = dict(browser.session.cookies)
with open("cookies.json", "w") as f:
json.dump(cookies, f)
# Restore cookies in a new session
with open("cookies.json", "r") as f:
saved_cookies = json.load(f)
new_browser = mechanicalsoup.StatefulBrowser()
new_browser.session.cookies.update(saved_cookies)

Keep in mind that session cookies have expiration times set by the server. If a restored session stops working, the cookies have likely expired, and you'll need to log in again. 

Manually injecting cookies

When you already have a known token or want to bootstrap from a saved session, inject cookies directly with browser.session.cookies.set().

browser.session.cookies.set("session_token", "abc123", domain="example.com")

The domain parameter scopes the cookie so it's only sent to requests matching that domain, which mirrors how browsers handle cookies natively. 

For the underlying session mechanics in detail, the Requests guide breaks it down.

Proxy integration with MechanicalSoup

This is the section every other MechanicalSoup guide skips, and it matters more than most of the library's features for anyone scraping beyond a sandbox. MechanicalSoup's session management keeps your cookies and headers in order, but it does nothing about the IP address those requests come from. 

Why proxies matter

Every HTTP request carries your origin IP. Target sites log these IPs and apply rate limits per address, typically allowing a set number of requests per minute before throttling or blocking. MechanicalSoup manages session state well, but sending a few hundred requests from the same IP will trigger those limits regardless. 

Proxies route your requests through intermediate servers with different IPs, spreading the load so no single address accumulates enough requests to get flagged. 

Configuring proxies

Because StatefulBrowser exposes the underlying Requests.Session, proxy configuration uses the standard Requests format. The dict needs both "http" and "https" keys, since Requests uses them to match the URL scheme of each outgoing request. Set the proxy dict before the first browser.open() call so every subsequent request routes through it.

The following example uses Decodo residential proxies with the rotating gateway on port 7000, which assigns a new IP for every request.

import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser(
soup_config={"features": "lxml"},
raise_on_404=True,
user_agent="my-scraper/1.0"
)
browser.session.proxies = {
"http": "http://YOUR_USERNAME:YOUR_PASSWORD@gate.decodo.com:7000",
"https": "http://YOUR_USERNAME:YOUR_PASSWORD@gate.decodo.com:7000"
}
browser.open("https://quotes.toscrape.com/page/1/")

Once set, every request the browser makes (including form submissions, redirects, and follow_link() calls) goes through the proxy. You don't need to pass the proxy to each method individually. 

Keeping credentials out of source code

Store the proxy URL in a .env file and load it with python-dotenv so credentials stay out of version control. If no .env file exists or PROXY_URL is unset, the scraper runs without a proxy.

# .env
PROXY_URL=http://YOUR_USERNAME:YOUR_PASSWORD@gate.decodo.com:7000
import mechanicalsoup
import os
from dotenv import load_dotenv
browser = mechanicalsoup.StatefulBrowser(
soup_config={"features": "lxml"},
raise_on_404=True,
user_agent="my-scraper/1.0"
)
browser.open("https://quotes.toscrape.com/page/1/")
load_dotenv()
proxy_url = os.getenv("PROXY_URL")
if proxy_url:
browser.session.proxies = {
"http": proxy_url,
"https": proxy_url
}

Rotating vs. sticky sessions

Proxy providers typically offer 2 session modes, and which one you need depends on whether your scraper maintains an authentication state. Decodo controls this through the port number you connect to.

Rotating sessions (port 7000) assign a different IP to each request. This works well for unauthenticated bulk collection (scraping product pages, collecting public listings) because each request appears to come from a different user, and no single IP accumulates enough hits to trigger a block.

# Rotating: new IP per request
browser.session.proxies = {
"http": "http://YOUR_USERNAME:YOUR_PASSWORD@gate.decodo.com:7000",
"https": "http://YOUR_USERNAME:YOUR_PASSWORD@gate.decodo.com:7000"
}

Sticky sessions (such as port 10001, 10002, etc.) maintain the same IP for a set duration (up to 24 hours for residential proxies). These are mandatory for authenticated scraping because the target server associates your session cookies with the IP address that logged in. If the IP changes mid-session, the server sees a new address presenting cookies it issued to a different address, and it will often invalidate the session or flag the request.

# Sticky: same IP for the session duration
browser.session.proxies = {
"http": "http://user-YOUR_USERNAME-sessionduration-1440:YOUR_PASSWORD@gate.decodo.com:10001",
"https": "http://user-YOUR_USERNAME-sessionduration-1440:YOUR_PASSWORD@gate.decodo.com:10001"
}

As a rule of thumb, use sticky sessions for anything that involves a login flow (covered in the session management section) and rotating sessions for everything else. You can also target specific countries by swapping the endpoint, for example, us.decodo.com:10000 for US-based IPs. For more on how sticky and rotating sessions work, the docs walk through configuration.

Residential vs. datacenter proxies

MechanicalSoup sends standard HTTP requests without a browser fingerprint (no WebGL canvas, no font list, no screen dimensions). That means the proxy IP itself becomes the primary signal anti-bot systems use to evaluate your request. The IP's ASN (Autonomous System Number) tells the target site which network the request originates from, and datacenter ASNs (AWS, Google Cloud, DigitalOcean) are well-known and easy to filter.

Residential proxies use IPs assigned to real ISPs (Comcast, Vodafone, BT), so the ASN looks like a regular home internet connection. For targets with aggressive IP filtering, this is the difference between getting blocked on the 3rd request and running a full crawl.

browser.session meets better IPs

You've got MechanicalSoup handling forms and sessions. Plug Decodo's residential proxies into that Requests session and stop getting blocked mid-crawl.

Pagination handling

Most real-world scraping jobs span more than 1 page, and how you handle pagination determines whether your scraper is a one-off script or a reusable tool. 2 patterns cover the majority of cases.

URL-based pagination

When the URL pattern is predictable (e.g., /page/1/, /page/2/), you can loop through constructed URLs and stop when no results appear on the page. Some sites return a 404 for out-of-range pages, but many (including quotes.toscrape.com) return a 200 with an empty body instead, so always include an empty-content check as the primary termination condition.

import mechanicalsoup
import time
browser = mechanicalsoup.StatefulBrowser(
soup_config={"features": "lxml"},
raise_on_404=True,
user_agent="my-scraper/1.0"
)
all_quotes = []
page_num = 1
while True:
url = f"https://quotes.toscrape.com/page/{page_num}/"
response = browser.open(url)
# Some sites return 404 for invalid pages; others return 200 with no content.
# The empty-content check handles both cases reliably.
if response.status_code == 404:
break
containers = browser.page.select("div.quote")
if not containers:
break
for container in containers:
text = container.select_one("span.text").get_text(strip=True)
author = container.select_one("small.author").get_text(strip=True)
tags = [tag.text.strip() for tag in container.select("a.tag")]
all_quotes.append({"text": text, "author": author, "tags": tags})
page_num += 1
time.sleep(1) # Courtesy delay between requests
print(f"Collected {len(all_quotes)} quotes across {page_num - 1} pages")

Some sites use unpredictable pagination URLs, or the "Next" link contains query parameters that change per page. In those cases, grab the link element and let browser.follow_link() handle the navigation. The loop terminates when the next link is absent from the page.

all_quotes = []
browser.open("https://quotes.toscrape.com/page/1/")
while True:
for container in browser.page.select("div.quote"):
text = container.select_one("span.text").get_text(strip=True)
author = container.select_one("small.author").get_text(strip=True)
tags = [tag.text.strip() for tag in container.select("a.tag")]
all_quotes.append({"text": text, "author": author, "tags": tags})
next_link = browser.page.select_one("li.next a")
if next_link is None:
break
browser.follow_link(next_link)
time.sleep(1)

Rate limiting and retries

Add time.sleep(1) between requests as a minimum courtesy delay. For sites that respond with 429 or 503, implement exponential backoff rather than hammering the server. The Python Requests retry guide covers the pattern in depth. For deeper pagination strategies like infinite scroll and cursor-based approaches, that guide has you covered. For saving results, see how to save scraped data.

Advanced data extraction

These patterns handle page structures more complex than the flat quote containers we've been working with.

Nested structures and chained selectors

When a page element contains sub-elements, chain select calls on the parent element rather than searching from browser.page each time. This scopes the search to the relevant container and breaks less often when the site's layout changes elsewhere on the page.

for container in browser.page.select("div.quote"):
text = container.select_one("span.text").get_text(strip=True)
author = container.select_one("small.author").get_text(strip=True)
tags = [t.text.strip() for t in container.select("a.tag")]

Table extraction

Many scraping targets present data in HTML tables (quotes.toscrape.com doesn't, but the pattern applies broadly). The following example iterates over table rows, skips the header, and guards against inconsistent column counts with a len(cells) check.

rows = browser.page.select("table tr")
data = []
for row in rows[1:]:
cells = row.select("td")
if len(cells) < 3:
continue
data.append({
"col1": cells[0].get_text(strip=True),
"col2": cells[1].get_text(strip=True),
"col3": cells[2].get_text(strip=True),
})

Safe attribute access

Use .get() rather than element["attr"] when the attribute may be absent, because .get() returns None while direct access raises KeyError. This matters for links (element.get("href")), images (element.get("src")), and custom data attributes (element.get("data-id")).

Exporting to CSV and JSON

The complete pipeline from a list of dicts to output files uses Python's csv.DictWriter for flat data and json.dump for nested structures.

import csv
fieldnames = ["text", "author", "tags"]
with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for q in all_quotes:
writer.writerow({
"text": q["text"],
"author": q["author"],
"tags": ", ".join(q["tags"])
})
import json
with open("quotes.json", "w", encoding="utf-8") as f:
json.dump(all_quotes, f, indent=2, ensure_ascii=False)

For more options, including database storage, see how to save scraped data. For post-extraction cleanup, the data cleaning guide covers common patterns.

Error handling and debugging

A scraper that works in testing and breaks silently in production is worse than one that fails loudly on the first run. MechanicalSoup has specific failure modes worth handling explicitly.

HTTP status errors

By default, browser.open() silently accepts non-200 responses. Setting raise_on_404=True catches 404s, but other status codes still pass through. Production scripts should check status codes after every request and branch on the common failure cases.

response = browser.open(url)
if response.status_code == 429:
print("Rate limited. Backing off.")
time.sleep(30)
elif response.status_code == 403:
print("Blocked. Rotate proxy or adjust headers.")
elif response.status_code >= 500:
print(f"Server error: {response.status_code}")
elif response.status_code != 200:
print(f"Unexpected status: {response.status_code}")

Form not found

The select_form() method raises mechanicalsoup.utils.LinkNotFoundError when the selector matches nothing. Before the call, use browser.page.select("form") to list all forms on the page for debugging. An empty list usually means the page loaded incorrectly or rendered a CAPTCHA instead of the expected content.

from mechanicalsoup.utils import LinkNotFoundError
try:
browser.select_form('form[action="/search"]')
except LinkNotFoundError:
forms = browser.page.select("form")
print(f"Found {len(forms)} forms on the page")
for i, form in enumerate(forms):
print(f" Form {i}: action={form.get('action')}, method={form.get('method')}")

Network errors with retry

Transient failures (timeouts, connection resets, DNS errors) are inevitable at scale. Wrap browser.open() in a retry function with exponential backoff.

import requests
import time
def open_with_retry(browser, url, max_retries=3):
for attempt in range(max_retries):
try:
response = browser.open(url)
return response
except requests.exceptions.RequestException as e:
wait = 2 ** attempt
print(f"Network error: {e}. Retrying in {wait}s...")
time.sleep(wait)
raise Exception(f"Failed to open {url} after {max_retries} retries")

The fastest debugging trick

When a selector returns None unexpectedly, print the raw HTML the browser received. If the output shows a CAPTCHA page, a redirect to a login page, or an empty body, the scraper has been detected or has navigated to the wrong page. This one check saves more debugging time than anything else.

print(browser.page.prettify()[:2000])

Structured logging

Set up Python logging and record each URL, status code, and content length. Small responses (under a few hundred bytes) from pages that should be large are a reliable signal that you're getting blocked or served an error page.

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("scraper")
response = browser.open(url)
logger.info(f"URL: {url} | Status: {response.status_code} | Size: {len(response.content)} bytes")

Performance optimization

MechanicalSoup is lightweight by design, and a few configuration choices make a meaningful difference once you're scraping at volume.

Parser selection and connection reuse

lxml is 2-5× faster than html.parser for large pages, so set soup_config={"features": "lxml"} at instantiation and only fall back to html.parser in restricted environments. On the connection side, MechanicalSoup reuses the requests.Session connection pool automatically. Avoid creating a new StatefulBrowser() inside a loop, and instead create 1 instance and reuse it across pages.

Parallel scraping with ThreadPoolExecutor

MechanicalSoup is synchronous, but you can run multiple StatefulBrowser instances in separate threads for higher throughput. Each thread needs its own browser since StatefulBrowser is not thread-safe when shared.

from concurrent.futures import ThreadPoolExecutor
import mechanicalsoup
def scrape_page(url):
browser = mechanicalsoup.StatefulBrowser(
soup_config={"features": "lxml"},
user_agent="my-scraper/1.0"
)
browser.open(url)
quotes = []
for container in browser.page.select("div.quote"):
text = container.select_one("span.text").get_text(strip=True)
author = container.select_one("small.author").get_text(strip=True)
quotes.append({"text": text, "author": author})
return quotes
urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)]
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(scrape_page, urls))
all_quotes = [q for page_quotes in results for q in page_quotes]
print(f"Scraped {len(all_quotes)} quotes across {len(urls)} pages")

Know the ceiling

MechanicalSoup handles hundreds of pages per minute comfortably for sequential scraping of static pages (with appropriate delays). When you need thousands of concurrent requests, you've outgrown it. At that point, the right tools are an async HTTP client like httpx or a dedicated framework like Scrapy.

Full code: complete MechanicalSoup scraper with proxy support and pagination

import csv
import json
import logging
import os
import time
from urllib.parse import urljoin
import mechanicalsoup
import requests
from dotenv import load_dotenv
from mechanicalsoup.utils import LinkNotFoundError
# --- Configuration ---
load_dotenv()
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("mechanicalsoup-scraper")
BASE_URL = "https://quotes.toscrape.com"
START_PATH = "/page/1/"
OUTPUT_CSV = "quotes.csv"
OUTPUT_JSON = "quotes.json"
COOKIES_FILE = "cookies.json"
REQUEST_DELAY = 1.0
MAX_RETRIES = 3
TIMEOUT_SECONDS = 20
# Set this in .env to test proxy path:
# PROXY_URL=http://username:password@proxy-host:proxy-port
PROXY_URL = os.getenv("PROXY_URL")
if PROXY_URL and ("YOUR_USERNAME" in PROXY_URL or "YOUR_PASSWORD" in PROXY_URL):
logger.warning("PROXY_URL contains placeholder credentials. Running without proxy.")
PROXY_URL = None
# --- Browser Setup ---
def create_browser() -> mechanicalsoup.StatefulBrowser:
browser = mechanicalsoup.StatefulBrowser(
soup_config={"features": "lxml"},
raise_on_404=True,
user_agent="my-scraper/1.0",
)
browser.session.headers.update(
{
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
)
if PROXY_URL:
browser.session.proxies.update(
{
"http": PROXY_URL,
"https": PROXY_URL,
}
)
logger.info("Proxy configured")
return browser
# --- Retry Logic ---
def open_with_retry(
browser: mechanicalsoup.StatefulBrowser,
url: str,
max_retries: int = MAX_RETRIES,
):
for attempt in range(max_retries):
try:
response = browser.open(url, timeout=TIMEOUT_SECONDS)
status = response.status_code
if status == 200:
return response
if status == 429:
wait = (2 ** attempt) * 5
logger.warning("Rate limited at %s. Waiting %ss...", url, wait)
time.sleep(wait)
continue
if status == 403:
logger.error("Blocked at %s. Check/rotate proxy.", url)
return None
logger.warning("Unexpected status %s at %s", status, url)
return response
except requests.exceptions.RequestException as exc:
wait = 2 ** attempt
logger.warning("Network error for %s: %s. Retrying in %ss...", url, exc, wait)
time.sleep(wait)
logger.error("Failed to open %s after %s retries", url, max_retries)
return None
# --- Extraction ---
def extract_quotes_from_page(browser: mechanicalsoup.StatefulBrowser):
quotes = []
for container in browser.page.select("div.quote"):
text_el = container.select_one("span.text")
author_el = container.select_one("small.author")
if not text_el or not author_el:
continue
quotes.append(
{
"text": text_el.get_text(strip=True),
"author": author_el.get_text(strip=True),
"tags": [tag.get_text(strip=True) for tag in container.select("a.tag")],
}
)
return quotes
# --- Pagination ---
def scrape_all_quotes(browser: mechanicalsoup.StatefulBrowser):
all_quotes = []
current_url = urljoin(BASE_URL, START_PATH)
page_count = 0
while current_url:
response = open_with_retry(browser, current_url)
if response is None:
logger.error("Stopping pagination due to open failure at %s", current_url)
break
page_count += 1
page_quotes = extract_quotes_from_page(browser)
all_quotes.extend(page_quotes)
logger.info(
"Page %s: %s quotes (total: %s)",
page_count,
len(page_quotes),
len(all_quotes),
)
next_link = browser.page.select_one("li.next a")
if not next_link or not next_link.get("href"):
break
current_url = urljoin(BASE_URL, next_link["href"])
time.sleep(REQUEST_DELAY)
return all_quotes
# --- Login Example ---
def login(browser: mechanicalsoup.StatefulBrowser, username: str = "admin", password: str = "admin") -> bool:
response = open_with_retry(browser, urljoin(BASE_URL, "/login"))
if response is None:
return False
try:
browser.select_form("form")
browser["username"] = username
browser["password"] = password
browser.submit_selected()
except LinkNotFoundError:
logger.error("Login form not found")
return False
except Exception as exc:
logger.error("Login failed with exception: %s", exc)
return False
if "logout" in browser.page.get_text(" ", strip=True).lower():
logger.info("Login successful")
return True
logger.error("Login failed (logout marker not found)")
return False
# --- Cookie Persistence ---
def save_cookies(browser: mechanicalsoup.StatefulBrowser, filepath: str = COOKIES_FILE):
with open(filepath, "w", encoding="utf-8") as f:
json.dump(dict(browser.session.cookies), f, ensure_ascii=False, indent=2)
logger.info("Cookies saved to %s", filepath)
def load_cookies(browser: mechanicalsoup.StatefulBrowser, filepath: str = COOKIES_FILE) -> bool:
if not os.path.exists(filepath):
return False
with open(filepath, "r", encoding="utf-8") as f:
cookies = json.load(f)
browser.session.cookies.update(cookies)
logger.info("Cookies loaded from %s", filepath)
return True
# --- Export ---
def export_to_csv(quotes, filepath: str = OUTPUT_CSV):
fieldnames = ["text", "author", "tags"]
with open(filepath, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for q in quotes:
writer.writerow(
{
"text": q["text"],
"author": q["author"],
"tags": ", ".join(q["tags"]),
}
)
logger.info("Exported %s quotes to %s", len(quotes), filepath)
def export_to_json(quotes, filepath: str = OUTPUT_JSON):
with open(filepath, "w", encoding="utf-8") as f:
json.dump(quotes, f, indent=2, ensure_ascii=False)
logger.info("Exported %s quotes to %s", len(quotes), filepath)
# --- Main ---
if __name__ == "__main__":
browser = create_browser()
# Optional login demo
# login(browser, username="admin", password="admin")
quotes = scrape_all_quotes(browser)
if quotes:
export_to_csv(quotes)
export_to_json(quotes)
print(f"Done. Collected {len(quotes)} quotes.")
else:
print("No quotes collected.")

MechanicalSoup vs. alternative tools

MechanicalSoup vs. Requests + Beautiful Soup

This is the comparison that matters most. Using Requests + Beautiful Soup directly gives you the same parsing capability but requires manual cookie management, session header tracking, and form field assembly. MechanicalSoup automates all of that.

The tradeoff is that MechanicalSoup adds a dependency and a layer of abstraction, while Requests + BS4 is more explicit and easier to customize for unconventional HTTP patterns. If your scraper fills forms or navigates multi-step workflows, MechanicalSoup saves real development time. If you're making a single GET request and parsing the HTML, raw Requests + BS4 is less overhead.

MechanicalSoup vs. Beautiful Soup (alone)

Beautiful Soup is a parser only, with no built-in HTTP client, session management, or form handling. MechanicalSoup includes Beautiful Soup internally, so the question is whether to add the stateful browser layer on top.

MechanicalSoup vs. Scrapy

Scrapy is a full crawling framework with spiders, middleware, pipelines, scheduling, and async HTTP. It's built for scale and significantly faster when crawling large sites. But it also comes with a learning curve and project structure that's overkill for a 50-line form automation script. Scrapy also lacks native form submission, so MechanicalSoup is the better fit for workflows that depend on form interaction. For a deeper comparison, see Scrapy vs. Beautiful Soup.

MechanicalSoup vs. Playwright / Selenium

Playwright and Selenium control real browser engines and execute JavaScript. MechanicalSoup handles only raw HTML. For JavaScript-rendered content, single-page applications, or interactions requiring real browser events, Playwright or Selenium are required.

The resource gap is worth quantifying. A headless Chromium session uses 200-500 MB of RAM per instance. A MechanicalSoup session uses a few MB. If you're scraping a static site and spinning up Chromium to do it, you're paying a 100× resource premium for a capability you don't need.

Use MechanicalSoup when JavaScript execution is unnecessary, and use Playwright when you need it. For a head-to-head comparison, see Playwright vs. Selenium.

Decision summary

Scenario

Tool

Static site with forms and session management

MechanicalSoup

Static site, no forms, one-off scrape

Requests + Beautiful Soup

Large-scale structured crawl of static sites

Scrapy

JavaScript-rendered content or real browser interaction

Playwright

Best practices and security considerations

Responsible scraping

Check robots.txt before scraping. Python's robotparser module lets you programmatically verify whether scraping a specific path is permitted.

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("my-scraper/1.0", "/target-path"):
browser.open("https://example.com/target-path")
else:
print("Scraping this path is disallowed by robots.txt")

Keep these concepts in mind as well:

  • Rate limiting. Implement time.sleep() between requests as a minimum. 1-2 seconds works for most sites. For sites with explicit rate limit headers (Retry-After), respect those values and use exponential backoff for retries.
  • User agent. Always set a descriptive user agent string (e.g., "MyBot/1.0: contact@example.com"). Generic or missing user agents are more likely to be flagged.

Header hygiene

Set Accept-Language and Accept headers on the session to match a real browser request pattern, because Requests missing these headers are easier to identify as automated traffic.

browser.session.headers.update({
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br"
})

Security

  • Credentials. Store proxy passwords and site login credentials in environment variables or a .env file loaded with python-dotenv (see the Proxy integration section), and keep them out of source code.
  • Input sanitization. MechanicalSoup passes through raw HTML without sanitization or validation. If your scraper feeds extracted data into a database or template system, sanitize strings to prevent injection and avoid passing extracted HTML directly to eval() or similar functions.
  • IP exposure. MechanicalSoup sends real HTTP requests with your origin IP, and the target site logs those IPs. Use proxies (see the Proxy integration section) when anonymity or scale is required.

Maintenance

Site HTML structures change, and selectors that work today may break after a redesign. Build scripts to fail loudly (raise exceptions, log clearly) rather than silently return empty data. Monitor output for unexpected empty results as an indicator of selector breakage.

For the detection landscape, the anti-bot systems guide provides broader context. When MechanicalSoup gets blocked despite best practices, Decodo Site Unblocker handles the unblocking layer automatically.

Final thoughts

MechanicalSoup does one thing well and knows where to stop. It's the right tool for Python scraping projects that involve forms, multi-step navigation, or session management on static sites, without the overhead of a full browser engine. The key design choice is exposing browser.session directly, giving you the full flexibility of Requests.Session for proxy configuration, custom headers, cookie management, and connection pooling. If you already know Requests, you already know how to extend MechanicalSoup. Its two clear ceilings are JavaScript rendering (switch to Playwright) and high-volume concurrent crawling (switch to Scrapy or async HTTP clients). Knowing when to reach for a different tool is as valuable as knowing MechanicalSoup itself.

Static sites done, now what?

When your target needs JS rendering, CAPTCHA solving, or proxy rotation at scale, Decodo's Web Scraping API picks up where MechanicalSoup taps out.

Share article:

About the author

Justinas Tamasevicius

Director of Engineering

Justinas Tamaševičius is Director of Engineering with over two decades of expertise in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.

Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

What is MechanicalSoup?

A Python library that wraps requests and BeautifulSoup into a stateful browser for scraping and automating static websites.

What is the use of Beautiful Soup?

Beautiful Soup is a Python HTML and XML parser. MechanicalSoup uses it internally and exposes it via browser.page, so developers can use BS4 selectors without a separate import.

Can MechanicalSoup handle JavaScript?

No. MechanicalSoup sends HTTP requests and parses raw HTML without executing JavaScript. Use Playwright or Selenium for JavaScript-rendered sites.

How do I add a proxy to MechanicalSoup?

Set browser.session.proxies = {"http": "http://user:pass@gate.decodo.com:7000", "https": "http://user:pass@gate.decodo.com:7000 "} after instantiating the StatefulBrowser.

What is the difference between MechanicalSoup and Beautiful Soup?

Beautiful Soup is a parser only, with no HTTP client, session management, or form handling. MechanicalSoup adds a stateful browser layer on top of Beautiful Soup and Requests.

Beautiful Soup Web Scraping: How to Parse Scraped HTML with Python

Web scraping with Python is a powerful technique for extracting valuable data from the web, enabling automation, analysis, and integration across various domains. Using libraries like Beautiful Soup and Requests, developers can efficiently parse HTML and XML documents, transforming unstructured web data into structured formats for further use. This guide explores essential tools and techniques to navigate the vast web and extract meaningful insights effortlessly.

beautifulsoup-vs-scrapy

Scrapy vs BeautifulSoup – Which is Better for You?

Scrapy and BeautifulSoup are two extremely popular Python-based tools that will enable you to scrape the web. Ah, and they’re free and open-source! So if you’re thinking of building a scraper, you might be a bit lost between the two options. 

Don’t worry, we’ve got you covered. This blog post will compare these two tools by looking over their main fors and againsts. Ready? Let’s go!

Playwright Web Scraping: A Practical Tutorial

Web scraping can feel like directing a play without a script – unpredictable and chaotic. That’s where Playwright steps in: a powerful, headless browser automation tool that makes scraping modern, dynamic websites smoother than ever. In this practical tutorial, you’ll learn how to use Playwright to reliably extract data from any web page.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved