Beautiful Soup Web Scraping: How to Parse Scraped HTML with Python
Web scraping with Python is a powerful technique for extracting valuable data from the web, enabling automation, analysis, and integration across various domains. Using libraries like Beautiful Soup and Requests, developers can efficiently parse HTML and XML documents, transforming unstructured web data into structured formats for further use. This guide explores essential tools and techniques to navigate the vast web and extract meaningful insights effortlessly.

Zilvinas Tamulis
Mar 25, 2025
14 min read
TL;DR
- Beautiful Soup is a Python library for parsing HTML and XML documents, making web scraping easier;
- The library helps extract unstructured web data and transform it into structured formats for analysis;
- Requests can be used in conjunction with Beautiful Soup to manage HTTP requests during web scraping;
- Automation with Beautiful Soup simplifies data extraction, reducing manual effort;
- The parsed data can be saved in structured formats like CSV or JSON for further use.
What is web scraping?
Web scraping is the automated process of extracting data from websites using code or dedicated software. It allows you to collect and organize information from the vast resources available online without much effort or manual labor.
The value of web scraping lies in its ability to gather large amounts of data that can then be analyzed and converted into valuable information. Organizations use it for various purposes, such as market analysis, competitor monitoring, and customer sentiment analysis, allowing them to stay competitive and ahead of the competition. Individuals can make use of it as well, using web scraping to track the best shopping deals, get personalized recommendations, or even land their next job.
Essential tools for web scraping
Python is the leading programming language for web scraping due to its simplicity, readability, and extensive support for data extraction. While languages like JavaScript, R, and PHP offer web scraping capabilities, Python stands out for its ease of use and compatibility with various libraries designed for handling web data.
Among these libraries, Requests is essential for making HTTP requests, allowing users to retrieve web pages efficiently. It seamlessly integrates with tools like Beautiful Soup for parsing HTML and Scrapy for large-scale data extraction, making it a cornerstone of web scraping workflows. For a deeper look into how Requests simplifies web scraping and handles HTTP requests, check out this detailed guide on its features and usage.
What is data parsing?
Data parsing is the process of analyzing data to extract meaningful information or convert it into a more structured format. When speaking about web content, this data usually comes in the form of HTML documents. They are made up of many elements that hold everything together, and while they’re the building blocks of a website, we only care about the information stored in between. Through data parsing, we analyze these files to find data, clean it, and then put it into an easy-to-read format, such as a CSV or JSON file, for further analysis and use.
Parsing data is an essential part of the collecting data process. The cleaned data can be used for analysis and statistics, providing valuable insights for your personal or business needs. Another benefit of data parsing is that it can combine data from various sources, allowing you to create new and diverse datasets. For example, when gathering data from eCommerce websites, it can find, connect, and calculate the average price of competitor products. Knowledge like this can help you make informed decisions on pricing products on your website and stay ahead in the market.
One more cool thing is that collecting and processing data can be fully automated with intuitive functions. It reduces the need for manual data entry and manipulation, saving time and reducing the risk of human error. This means you’ve got a little elf watching your competitors, analyzing them, and providing valuable insights 24 hours a day, 365 days a year, with no coffee breaks or extended vacations. Beating your competition while asleep is quite a flex, don’t you think?
What is Beautiful Soup?
If you’ve already been hooked on the idea of data parsing, you probably tried to open an HTML document, read it for information, and felt like you were trying to decipher ancient hieroglyphs. You might’ve also looked at it and seen it as a bowl of soup – a bunch of ingredients thrown together, sliced, boiled, and cooked. Together, they make up a tasty meal, but if you tried to pick out every piece of carrot (for whatever reason), you might’ve realized what a complex job that is.
Beautiful Soup is here to the rescue to make your soup, well, more beautiful. It’s a Python library commonly used to scrape and parse HTML and XML documents with tools to navigate and search the content inside them. Beautiful Soup makes it easier for developers to extract and work with data in a more structured and readable format.
Installation and setup
Beautiful Soup installation is quick and easy. First, you’ll need Python on your computer or virtual environment with the pip packet manager to install Python libraries. From version 3.4+, pip comes installed by default. You can run this command in the Terminal to check if you have it:
Then, to install Beautiful Soup, simply run this command. As you'll also need to get data from the internet, the command includes Requests installation as well.
That’s it! The Python package should take a few seconds to install and will be ready for use in your web scraping project.
Beautiful Soup basics
You’ll want to test if everything works correctly first. Write a simple code to extract data from the following HTML structure file:
Create a new text file and name it website.html. Open the file in a text editor, paste the above code, and save it.
Next, create a Python script file in the same directory. Name it whatever you want, like your favorite type of soup, for example, miso_soup.py. Here’s the code snippet:
Run the code with this command in the Terminal:
The code will read the HTML document, parse it, find the element with a main-heading class, and print <h1 class="main-heading">What is a residential proxy?</h1> in your Terminal.
If you’ve received the same result, that's fantastic! Your setup works fine, and you’re ready to move on to the next steps. However, if you run into any issues, make sure to check if:
- you’re using the latest version of Python that comes with pip, and Beautiful Soup is installed;
- the miso_soup.py and website.html files are in the same directory (folder);
- you’re running the Python Terminal command in the same directory as your files;
- there aren’t any typos, minor spelling mistakes or other errors inside the code or file names.
That’s all for the basics. Next, we‘ll parse data from an actual website, where you’ll see how we can use what you’ve already learned and expand upon it.
Parse data with Beautiful Soup
To understand what it’s like to scrape and parse a website, we’ll use the ScrapeMe shop as our target. Don’t worry; it’s not a real thing – it’s an example website built to test scripts, which functions exactly as a regular online shop would. The only difference is that it sells Pokémon. It’s a perfect place to start, as real websites might have anti-scraping measures, such as CAPTCHAs or rate limitations, to prevent multiple automated requests. They can, however, be circumvented with the help of proxies, which we’ll use in our real-world example later.
To get data from the web, you'll need to use the Requests library that you installed together with Beautiful Soup. Include it at the start of your script:
Once more, let’s test if our setup and libraries work. We’ll take the previously written code and modify it so that it scrapes data from an actual website instead of a file:
After you run the above code, you should have the entire HTML page printed in your Terminal. Let’s look at how you can modify this code to extract specific HTML elements from a page and get more relevant information.
Find elements by ID or class name
The most common task in web parsing is to find an element within the soup of HTML. Luckily, most elements have attributes that help to identify them from one another. These can be anything the developer decides them to be, but the most common are ID and class attributes.
Let’s try a simple task – list the names of all Pokémon on the main page. Inspect the website’s source using Chrome developer tools (right-click on a page →Inspect) or your web browser equivalent. A window will open that displays the HTML of the page. The first item on the list is Bulbasaur. You can try to find it mentioned in the code or use the element selector and then click the name on the site. This will jump to the HTML part where the element is located.

The HTML element you need has a class named woocommerce-loop-product__title. If you check the other products, you can see they all have the same HTML class name. Let’s write a simple loop that will go through these HTML elements and print them out:
You’ll get a list of 16 items in your Terminal. Congratulations, you’ve just parsed data from an actual website! You can play around with the script to see what other content you can extract by changing which class Beautiful Soup should try to find.
Search multiple web pages
You’ve managed to print the names of 16 Pokémon, but what if you wanna catch ‘em all? The issue that stands in our way is that the store is separated into several HTML pages, and our script isn’t smart enough to navigate them to get everything.
Let’s help it out and check how the links work. Whenever you visit a new page, the URL changes to …/shop/page/x/ where x is the page number. This is great for us, as we can modify our script to iterate through every page simply by changing the number in the URL. This is how it can be done:
The script's premise is simple, using what we already know from the previous example with an added loop. It is, however, handy to have an idea of how it’s done, as many websites will most likely not have all the information you need on their first page. If only it were that easy!
Find all links
Another widespread usage of data parsing is finding all the links in a website, otherwise known as web crawling. This can be useful for various reasons, such as indexing, seeing what your competitors are linking to, their website structure, and many other use cases.
Remember attributes? Well, this time, instead of using them as an identifier for HTML elements, we’re going to be looking for them instead. The most common way of linking is through the href attribute. Therefore, our task is very straightforward – find all the hrefs inside the HTML web page and print their content:
Find children of an HTML element
Some websites just don’t want to be friends with your scripts. While it can be an absolute breeze to scrape and parse a website with clean, accurate class attributes, there are many cases where an HTML element will not have an attribute. This will require a different approach – we must target the parents and their children!
Don’t worry, it’s not as bad as it sounds. In an HTML page, you might’ve noticed that an element will have multiple other elements under it, such as paragraphs, tables, or lists. The one that holds them all together is called a parent, while the items under it are called children. Lucky for us, Beautiful Soup has a way of navigating through this structure, making reaching even the darkest crevices possible.
Let’s take a look at our web page in the developer tools. On the right sidebar are several links; if you inspect them, you can see that they’re structured as an unordered list without any attributes:

Say you want to get the Comments RSS link – how would you tell your script how to find it? Here’s how:
Scraping & parsing HTML data from a real-world website
While we’ve had fun in the playground, it’s time to enter the real world. That might sound scary, but the reality is that most actual websites function exactly like the one we used in our examples. They follow the same HTML structure, contain URLs, specific elements with identifiable attributes, etc. The key difference is that they protect themselves from scraping their data.
A website can track how many requests are made from a single IP address. Humans are usually slow, take their time, and read pages properly. A script doesn’t do that – it goes fast, gets what it needs, and moves to the next task. This behavior can be easily identifiable as your script will make multiple requests on your behalf and cause risks such as getting your IP blocked from accessing the website.
The solution? Proxy servers. A proxy server acts as an intermediary between you and your target, allowing you to make requests from an IP address different from your own. This allows you to create multiple requests, as they will appear as if they were coming from several separate IP addresses and are impossible to block or trace back to the source.
Scraping APIs are also a great way to get data efficiently. For example, Decodo's APIs include the aforementioned proxies, as well as handle IP rotation, JavaScript rendering, and overcoming any issues that might stand in the way between you and information.
A popular target to scrape and parse data from is Amazon. It’s a vast marketplace with heaps of valuable information on products, purchases, pricing, reviews, and more data that can be incredibly valuable for any business or curious mind. Check out our Amazon Scraper API and how to scrape Amazon product data and parse product titles, prices, and reviews.
Best practices
Knowledge is power; you can never know too much about data parsing. Here are some valuable tips that you should keep in mind while writing parsing scripts:
- Many websites will employ anti-scraping and parsing measures to prevent you from sending too many requests from one IP address. Using proxies in your scripts eliminates this problem, as you’ll be able to make requests from multiple locations worldwide and ensure total anonymity of where the requests are coming from;
- A parsed result may not always be clean – therefore, it’s helpful to apply methods such as result.text.strip() that will return only string values without any extra spaces, symbols, HTML tags, or other extra nuisances that make your data look messy;
- Do you copy and paste all your parsed data from the Terminal into a notepad, pass it on to your data analysts, and wonder why they’re giving you a side-eye? Well, they can probably tell you why, and that’s because it’s always easier to work with data presented in structured CSV, XML, JSON, or similar files. Improve your code and have it write data into a new file instead of a Terminal;
- Make sure to follow the requirements inside a website’s robots.txt file. This file tells what pages can be web crawled and scraped and which shouldn’t be touched and left alone. Respect the robots!
- Some pages might not be easy to iterate in your script – they won’t have numbering in their links or follow a logical structure. Therefore, you might need to prepare an array of links and loop through them instead. Prepare a text file with all of the links and have your script read from it, make a request, parse the response, and go to the next link.
- A dynamic website can be harder to scrape, as it might not provide a full HTML response as soon as it is loaded. As more and more websites are being built with the help of JavaScript, this is becoming a more frequent occurrence. You’ll need to adjust your HTML parser script to have a slight delay or use headless browsers built for handling dynamic content.
Final thoughts
In this article, you’ve gained a basic understanding of the benefits of data parsing, how to use Beautiful Soup, and explored examples of how it can be used for parsing web information. We’ve only touched the basics, and there’re still many things you can expand your knowledge on. If you wish to continue your parsing journey, read the official Beautiful Soup documentation to learn about it more in-depth. You should also check out our fantastic proxy options to help you scrape the web without any risks or issues!
About the author

Zilvinas Tamulis
Technical Copywriter
A technical writer with over 4 years of experience, Žilvinas blends his studies in Multimedia & Computer Design with practical expertise in creating user manuals, guides, and technical documentation. His work includes developing web projects used by hundreds daily, drawing from hands-on experience with JavaScript, PHP, and Python.
Connect with Žilvinas via LinkedIn
All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.



