MechanicalSoup Python: A Complete Guide to Scraping, Forms, and Proxies
When you need to scrape 50 pages of search results behind a login wall, raw Requests + Beautiful Soup force you to track cookies and assemble form payloads by hand, while Selenium launches a full browser for pages that don't even use JavaScript. MechanicalSoup sits between those extremes. It wraps Requests and Beautiful Soup into a stateful browser that handles web scraping sessions, forms, and navigation automatically. This guide covers everything from installation to proxy-powered production scrapers.
Justinas Tamasevicius
Last updated: Jun 03, 2026
16 min read

TL;DR
- MechanicalSoup combines Requests and Beautiful Soup into a single stateful browser that tracks cookies, headers, and navigation automatically
- Use it for scraping static sites that involve forms, logins, or multi-step workflows where manual session management would be tedious
- Proxy integration works through the standard Requests.Session interface, so you set browser.session.proxies and every request routes through your proxy
- When you hit JavaScript-rendered content or need thousands of concurrent requests, switch to Playwright or Scrapy, respectively
What is MechanicalSoup?
MechanicalSoup is a Python library for browser automation and web scraping that wraps the Requests library (for HTTP) and Beautiful Soup (for HTML parsing) into a single stateful interface. Its main class, StatefulBrowser, maintains cookies, session headers, and the current page URL between requests automatically. You don't need to pass cookies or headers manually from one request to the next.
The name combines Mechanize (a Python 2-era browser automation library, now unmaintained) and Beautiful Soup. MechanicalSoup is the actively maintained successor, installable via pip and hosted on GitHub.
Essentially, MechanicalSoup gives you HTTP Requests, HTML parsing, and a Beautiful Soup object via browser.page, all wired together with session state. You get the full range of BS4 selectors (select, select_one, find, find_all) without importing Beautiful Soup separately. Beautiful Soup itself is a parser and navigator for HTML and XML, and MechanicalSoup bundles it internally, so there's no need to import or configure it on its own. For a full BS4 reference, see the Beautiful Soup web scraping guide.
MechanicalSoup has a JavaScript limitation, as it only processes raw HTML. Sites that rely on JavaScript to render content will return incomplete or empty pages, so those require Playwright or Selenium.
For more on the parser options MechanicalSoup supports, or a broader look at Python HTTP clients.
Installation and environment setup
Python 3.8+ is required, and no separate browser binary is needed.
MechanicalSoup defaults to html.parser (Python's built-in parser), but lxml is significantly faster and more lenient with malformed HTML. Use html.parser when portability matters and lxml for performance-sensitive scripts.
Instantiating StatefulBrowser with meaningful defaults
Configure the browser correctly from the start rather than discovering these options mid-project. The following instantiation call sets the lxml parser, enables exceptions on 404s, and identifies the scraper with a custom user agent.
Setting custom request headers
Because StatefulBrowser wraps a Requests.Session, you can set headers on the session directly. This same access pattern enables proxy configuration later (in the Proxy integration section).
For a deeper dive into the underlying session mechanics at the Requests library level.
Core scraping workflow: Navigate, select, and extract
Let's build a working scraper against quotes.toscrape.com, a clean sandbox site with a predictable HTML structure that's ideal for testing.
Opening a page
The browser.open(url) method sends a GET request and loads the response into the stateful browser. It returns a Requests.Response object, so you can check response.status_code immediately. After calling it, browser.url gives you the current URL (useful for confirming redirects landed where you expected), and browser.page gives you the Beautiful Soup object you'll use for all element selection on that page.
Every time you call browser.open() or follow a link, browser.page updates to reflect the new page. You don't need to re-assign it manually.
Selecting elements
There are 3 approaches depending on what you need.
- browser.page.select_one("css-selector") returns the first element matching the CSS selector, or None if nothing matches. You can use this when you expect exactly 1 result, like a page title or a specific form field.
- browser.page.select("css-selector") returns all matching elements as a list. You should use this when you're collecting multiple items of the same type, like every quote on a page or every row in a table.
- browser.page.find() and browser.page.find_all() work similarly but accept tag names and attribute dictionaries instead of CSS selectors (e.g., find("a", {"class": "tag"})). All of these return Beautiful Soup Tag objects, which you can then extract text and attributes from.
Extracting data
Once you have an element, there are a few ways to pull content out of it.
For text, element.get_text(strip=True) is the most reliable option because it grabs all nested text and strips whitespace, and element.text does the same without stripping. element.string is stricter and only returns a value when the element contains exactly 1 text node, returning None otherwise, so it's less useful for elements with nested tags.
For attributes, element.get("href") returns the attribute value or None if absent, while element["href"] raises a KeyError if the attribute doesn't exist. Prefer .get() when you're not certain the attribute is present.
To learn more about selectors and parsing strategies, check the BS4 guide. For readers new to parsing concepts, that primer covers the fundamentals, and there's also a guide on choosing the best parser.
Working example: Scraping quotes
This example puts all 3 steps together. It navigates to the quotes sandbox, selects every quote container using div.quote as the CSS selector, and extracts 3 fields from each container: the quote text (inside span.text), the author name (inside small.author), and the tag list (every a.tag link within the container). We'll extend this running example in later sections.
Notice that select_one and select are called on container rather than on browser.page. This scopes each search to the individual quote div, so the selectors only match elements inside that specific container. This pattern becomes important on pages where multiple sections share similar class names.
Form handling and multi-step workflows
Form interaction is where MechanicalSoup earns its keep. Doing this with raw Requests means inspecting the page source for hidden fields, extracting CSRF tokens, assembling the correct POST payload, and manually following redirects while preserving cookies.
MechanicalSoup handles all of that through 3 methods: select_form(), field assignment, and submit_selected().
Selecting a form
The select_form() method takes a CSS selector, finds the matching form on the current page, and loads it into the browser's internal state for filling. If the page has multiple forms (a login form and a search bar, for example), the CSS selector lets you target the right one. Passing just "form" selects the first form on the page.
If the selector matches nothing, select_form() raises LinkNotFoundError. This typically means the page didn't load the expected content, either because the site returned a CAPTCHA, redirected to a login page, or the form is rendered by JavaScript (which MechanicalSoup can't execute). Wrapping the call in a try/except block lets you catch this and inspect the page before the script crashes.
Filling fields and submitting
Once a form is selected, set fields by their HTML name attribute using browser["field_name"] = "value". MechanicalSoup looks up each field in the selected form and assigns the value. This works for text inputs, textareas, and checkboxes. For <select> dropdowns, pass the option's value attribute rather than its visible label, since MechanicalSoup matches against the underlying value.
Calling browser.submit_selected() serializes all form fields (including any hidden inputs the site set), sends the request using the form's action URL and method (GET or POST), and follows any redirects. It returns the Requests.Response object, and the browser's internal state updates to the response page automatically. The session cookies travel with the request, so if the form submission requires authentication, a prior login (covered in the Session management section) carries through.
Multi-step workflow example
To demonstrate form interaction on a testable target, this example uses httpbin's sample form rather than quotes.toscrape.com (which doesn't have a form that works with static HTML scraping). The code navigates to the form page, fills 5 fields, submits, and reads the response. httpbin echoes the submitted data back as JSON, so you can verify exactly what MechanicalSoup sent.
With raw Requests, this same workflow would require inspecting the form's action and method, building the POST body as a dict, and sending it with Requests.post(). On a real site with CSRF tokens and session cookies, the manual version gets significantly more involved.
Debugging forms with hidden fields
MechanicalSoup preserves hidden inputs automatically when filling and submitting, which is how it handles CSRF tokens and other server-side form state without any extra code on your part. To inspect all fields (including hidden ones) before submission, call browser.get_current_form().print_summary(). This prints every field name, type, and current value, which is the fastest way to debug a form submission that silently fails or returns unexpected results. For multi-page form workflows, see the web scraping pagination guide.
Session and authentication management
MechanicalSoup wraps a Requests.Session object, so every cookie set by a response is automatically stored and re-sent with subsequent requests to the same domain. In practice, this means you can log into a site once and then scrape any authenticated page without touching a cookie header yourself.
That single detail eliminates the most tedious part of authenticated scraping, where raw Requests would require you to extract Set-Cookie headers, store them, and attach them to every follow-up request manually.
Login workflow
The following example logs into quotes.toscrape.com/login by selecting the form, filling credentials, and submitting. After login, all subsequent browser.open() calls on the same domain carry the session cookies automatically.
The verification step checks for the word "logout" on the response page, since most sites only show a logout link when the user is authenticated. If login fails (wrong credentials, CAPTCHA, or an unexpected redirect), browser.page will contain the login form again or an error message, and browser.url will typically remain on the login path rather than redirecting to a protected page.
After login succeeds, every subsequent browser.open() call on the same domain sends the session cookies automatically. You can navigate to any authenticated page, submit forms, or follow links, and the session stays active until the cookies expire or the server invalidates them.
Inspecting session cookies
You can iterate over browser.session.cookies or convert the jar to a dict for a quick overview of what the session holds.
This is also useful for confirming that a site set the cookies you expected after login, or for checking whether a specific token cookie is present before attempting to access a protected endpoint.
Persisting cookies across script runs
To reuse a session between script runs, serialize the cookie jar to a JSON file after login and restore it at the start of the next session.
Keep in mind that session cookies have expiration times set by the server. If a restored session stops working, the cookies have likely expired, and you'll need to log in again.
Manually injecting cookies
When you already have a known token or want to bootstrap from a saved session, inject cookies directly with browser.session.cookies.set().
The domain parameter scopes the cookie so it's only sent to requests matching that domain, which mirrors how browsers handle cookies natively.
For the underlying session mechanics in detail, the Requests guide breaks it down.
Proxy integration with MechanicalSoup
This is the section every other MechanicalSoup guide skips, and it matters more than most of the library's features for anyone scraping beyond a sandbox. MechanicalSoup's session management keeps your cookies and headers in order, but it does nothing about the IP address those requests come from.
Why proxies matter
Every HTTP request carries your origin IP. Target sites log these IPs and apply rate limits per address, typically allowing a set number of requests per minute before throttling or blocking. MechanicalSoup manages session state well, but sending a few hundred requests from the same IP will trigger those limits regardless.
Proxies route your requests through intermediate servers with different IPs, spreading the load so no single address accumulates enough requests to get flagged.
Configuring proxies
Because StatefulBrowser exposes the underlying Requests.Session, proxy configuration uses the standard Requests format. The dict needs both "http" and "https" keys, since Requests uses them to match the URL scheme of each outgoing request. Set the proxy dict before the first browser.open() call so every subsequent request routes through it.
The following example uses Decodo residential proxies with the rotating gateway on port 7000, which assigns a new IP for every request.
Once set, every request the browser makes (including form submissions, redirects, and follow_link() calls) goes through the proxy. You don't need to pass the proxy to each method individually.
Keeping credentials out of source code
Store the proxy URL in a .env file and load it with python-dotenv so credentials stay out of version control. If no .env file exists or PROXY_URL is unset, the scraper runs without a proxy.
Rotating vs. sticky sessions
Proxy providers typically offer 2 session modes, and which one you need depends on whether your scraper maintains an authentication state. Decodo controls this through the port number you connect to.
Rotating sessions (port 7000) assign a different IP to each request. This works well for unauthenticated bulk collection (scraping product pages, collecting public listings) because each request appears to come from a different user, and no single IP accumulates enough hits to trigger a block.
Sticky sessions (such as port 10001, 10002, etc.) maintain the same IP for a set duration (up to 24 hours for residential proxies). These are mandatory for authenticated scraping because the target server associates your session cookies with the IP address that logged in. If the IP changes mid-session, the server sees a new address presenting cookies it issued to a different address, and it will often invalidate the session or flag the request.
As a rule of thumb, use sticky sessions for anything that involves a login flow (covered in the session management section) and rotating sessions for everything else. You can also target specific countries by swapping the endpoint, for example, us.decodo.com:10000 for US-based IPs. For more on how sticky and rotating sessions work, the docs walk through configuration.
Residential vs. datacenter proxies
MechanicalSoup sends standard HTTP requests without a browser fingerprint (no WebGL canvas, no font list, no screen dimensions). That means the proxy IP itself becomes the primary signal anti-bot systems use to evaluate your request. The IP's ASN (Autonomous System Number) tells the target site which network the request originates from, and datacenter ASNs (AWS, Google Cloud, DigitalOcean) are well-known and easy to filter.
Residential proxies use IPs assigned to real ISPs (Comcast, Vodafone, BT), so the ASN looks like a regular home internet connection. For targets with aggressive IP filtering, this is the difference between getting blocked on the 3rd request and running a full crawl.
browser.session meets better IPs
You've got MechanicalSoup handling forms and sessions. Plug Decodo's residential proxies into that Requests session and stop getting blocked mid-crawl.
Pagination handling
Most real-world scraping jobs span more than 1 page, and how you handle pagination determines whether your scraper is a one-off script or a reusable tool. 2 patterns cover the majority of cases.
URL-based pagination
When the URL pattern is predictable (e.g., /page/1/, /page/2/), you can loop through constructed URLs and stop when no results appear on the page. Some sites return a 404 for out-of-range pages, but many (including quotes.toscrape.com) return a 200 with an empty body instead, so always include an empty-content check as the primary termination condition.
Link-based pagination with follow_link()
Some sites use unpredictable pagination URLs, or the "Next" link contains query parameters that change per page. In those cases, grab the link element and let browser.follow_link() handle the navigation. The loop terminates when the next link is absent from the page.
Rate limiting and retries
Add time.sleep(1) between requests as a minimum courtesy delay. For sites that respond with 429 or 503, implement exponential backoff rather than hammering the server. The Python Requests retry guide covers the pattern in depth. For deeper pagination strategies like infinite scroll and cursor-based approaches, that guide has you covered. For saving results, see how to save scraped data.
Advanced data extraction
These patterns handle page structures more complex than the flat quote containers we've been working with.
Nested structures and chained selectors
When a page element contains sub-elements, chain select calls on the parent element rather than searching from browser.page each time. This scopes the search to the relevant container and breaks less often when the site's layout changes elsewhere on the page.
Table extraction
Many scraping targets present data in HTML tables (quotes.toscrape.com doesn't, but the pattern applies broadly). The following example iterates over table rows, skips the header, and guards against inconsistent column counts with a len(cells) check.
Safe attribute access
Use .get() rather than element["attr"] when the attribute may be absent, because .get() returns None while direct access raises KeyError. This matters for links (element.get("href")), images (element.get("src")), and custom data attributes (element.get("data-id")).
Exporting to CSV and JSON
The complete pipeline from a list of dicts to output files uses Python's csv.DictWriter for flat data and json.dump for nested structures.
For more options, including database storage, see how to save scraped data. For post-extraction cleanup, the data cleaning guide covers common patterns.
Error handling and debugging
A scraper that works in testing and breaks silently in production is worse than one that fails loudly on the first run. MechanicalSoup has specific failure modes worth handling explicitly.
HTTP status errors
By default, browser.open() silently accepts non-200 responses. Setting raise_on_404=True catches 404s, but other status codes still pass through. Production scripts should check status codes after every request and branch on the common failure cases.
Form not found
The select_form() method raises mechanicalsoup.utils.LinkNotFoundError when the selector matches nothing. Before the call, use browser.page.select("form") to list all forms on the page for debugging. An empty list usually means the page loaded incorrectly or rendered a CAPTCHA instead of the expected content.
Network errors with retry
Transient failures (timeouts, connection resets, DNS errors) are inevitable at scale. Wrap browser.open() in a retry function with exponential backoff.
The fastest debugging trick
When a selector returns None unexpectedly, print the raw HTML the browser received. If the output shows a CAPTCHA page, a redirect to a login page, or an empty body, the scraper has been detected or has navigated to the wrong page. This one check saves more debugging time than anything else.
Structured logging
Set up Python logging and record each URL, status code, and content length. Small responses (under a few hundred bytes) from pages that should be large are a reliable signal that you're getting blocked or served an error page.
Performance optimization
MechanicalSoup is lightweight by design, and a few configuration choices make a meaningful difference once you're scraping at volume.
Parser selection and connection reuse
lxml is 2-5× faster than html.parser for large pages, so set soup_config={"features": "lxml"} at instantiation and only fall back to html.parser in restricted environments. On the connection side, MechanicalSoup reuses the requests.Session connection pool automatically. Avoid creating a new StatefulBrowser() inside a loop, and instead create 1 instance and reuse it across pages.
Parallel scraping with ThreadPoolExecutor
MechanicalSoup is synchronous, but you can run multiple StatefulBrowser instances in separate threads for higher throughput. Each thread needs its own browser since StatefulBrowser is not thread-safe when shared.
Know the ceiling
MechanicalSoup handles hundreds of pages per minute comfortably for sequential scraping of static pages (with appropriate delays). When you need thousands of concurrent requests, you've outgrown it. At that point, the right tools are an async HTTP client like httpx or a dedicated framework like Scrapy.
Full code: complete MechanicalSoup scraper with proxy support and pagination
MechanicalSoup vs. alternative tools
MechanicalSoup vs. Requests + Beautiful Soup
This is the comparison that matters most. Using Requests + Beautiful Soup directly gives you the same parsing capability but requires manual cookie management, session header tracking, and form field assembly. MechanicalSoup automates all of that.
The tradeoff is that MechanicalSoup adds a dependency and a layer of abstraction, while Requests + BS4 is more explicit and easier to customize for unconventional HTTP patterns. If your scraper fills forms or navigates multi-step workflows, MechanicalSoup saves real development time. If you're making a single GET request and parsing the HTML, raw Requests + BS4 is less overhead.
MechanicalSoup vs. Beautiful Soup (alone)
Beautiful Soup is a parser only, with no built-in HTTP client, session management, or form handling. MechanicalSoup includes Beautiful Soup internally, so the question is whether to add the stateful browser layer on top.
MechanicalSoup vs. Scrapy
Scrapy is a full crawling framework with spiders, middleware, pipelines, scheduling, and async HTTP. It's built for scale and significantly faster when crawling large sites. But it also comes with a learning curve and project structure that's overkill for a 50-line form automation script. Scrapy also lacks native form submission, so MechanicalSoup is the better fit for workflows that depend on form interaction. For a deeper comparison, see Scrapy vs. Beautiful Soup.
MechanicalSoup vs. Playwright / Selenium
Playwright and Selenium control real browser engines and execute JavaScript. MechanicalSoup handles only raw HTML. For JavaScript-rendered content, single-page applications, or interactions requiring real browser events, Playwright or Selenium are required.
The resource gap is worth quantifying. A headless Chromium session uses 200-500 MB of RAM per instance. A MechanicalSoup session uses a few MB. If you're scraping a static site and spinning up Chromium to do it, you're paying a 100× resource premium for a capability you don't need.
Use MechanicalSoup when JavaScript execution is unnecessary, and use Playwright when you need it. For a head-to-head comparison, see Playwright vs. Selenium.
Decision summary
Scenario
Tool
Static site with forms and session management
MechanicalSoup
Static site, no forms, one-off scrape
Requests + Beautiful Soup
Large-scale structured crawl of static sites
Scrapy
JavaScript-rendered content or real browser interaction
Playwright
Best practices and security considerations
Responsible scraping
Check robots.txt before scraping. Python's robotparser module lets you programmatically verify whether scraping a specific path is permitted.
Keep these concepts in mind as well:
- Rate limiting. Implement time.sleep() between requests as a minimum. 1-2 seconds works for most sites. For sites with explicit rate limit headers (Retry-After), respect those values and use exponential backoff for retries.
- User agent. Always set a descriptive user agent string (e.g., "MyBot/1.0: contact@example.com"). Generic or missing user agents are more likely to be flagged.
Header hygiene
Set Accept-Language and Accept headers on the session to match a real browser request pattern, because Requests missing these headers are easier to identify as automated traffic.
Security
- Credentials. Store proxy passwords and site login credentials in environment variables or a .env file loaded with python-dotenv (see the Proxy integration section), and keep them out of source code.
- Input sanitization. MechanicalSoup passes through raw HTML without sanitization or validation. If your scraper feeds extracted data into a database or template system, sanitize strings to prevent injection and avoid passing extracted HTML directly to eval() or similar functions.
- IP exposure. MechanicalSoup sends real HTTP requests with your origin IP, and the target site logs those IPs. Use proxies (see the Proxy integration section) when anonymity or scale is required.
Maintenance
Site HTML structures change, and selectors that work today may break after a redesign. Build scripts to fail loudly (raise exceptions, log clearly) rather than silently return empty data. Monitor output for unexpected empty results as an indicator of selector breakage.
For the detection landscape, the anti-bot systems guide provides broader context. When MechanicalSoup gets blocked despite best practices, Decodo Site Unblocker handles the unblocking layer automatically.
Final thoughts
MechanicalSoup does one thing well and knows where to stop. It's the right tool for Python scraping projects that involve forms, multi-step navigation, or session management on static sites, without the overhead of a full browser engine. The key design choice is exposing browser.session directly, giving you the full flexibility of Requests.Session for proxy configuration, custom headers, cookie management, and connection pooling. If you already know Requests, you already know how to extend MechanicalSoup. Its two clear ceilings are JavaScript rendering (switch to Playwright) and high-volume concurrent crawling (switch to Scrapy or async HTTP clients). Knowing when to reach for a different tool is as valuable as knowing MechanicalSoup itself.
Static sites done, now what?
When your target needs JS rendering, CAPTCHA solving, or proxy rotation at scale, Decodo's Web Scraping API picks up where MechanicalSoup taps out.
About the author

Justinas Tamasevicius
Director of Engineering
Justinas Tamaševičius is Director of Engineering with over two decades of expertise in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.
Connect with Justinas via LinkedIn.
All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.


