Challenges in traditional web scraping

Getting information from a web page with traditional web scraping tools like Python's Requests and Beautiful Soup was pretty simple: choose a target URL, send a request to download the HTML code, and extract the data points you want. Lastly, adjust the scraper as needed. Sounds easy enough?

It was until web owners started applying rigorous anti-bot systems. According to the yearly Bad Bot research by Imperva (2023), up to 30% of web traffic comes from malicious bots. Consequently, it is now a critical task for websites to protect data from unauthorized visitors. However, anti-bot measures disrupt the web scraping process; it has become difficult to handle traditional scripts, requiring more knowledge and resources when collecting public information.

Main protection methods that hinder the project's success

Here are a few examples of the main protection methods used by websites.

Rate limiting is a popular technique to control traffic flow to the website. A website owner chooses an identifier, such as an IP address to monitor its visitors. When you connect to a page without using a proxy or a VPN, the site can then track your real IP and location as well as restrict the number of requests you send to the server within a certain time frame. This could vary from 10 requests per second to 100 per minute.

Browser fingerprinting methods are able to track dozens of hardware and software parameters that need to be taken care of while web scraping. For example, if you’re using an HTTP client like Requests or Axios, you’ll need to emulate headers such as the user agent to spoof your identity. This requires constant adjustment and maintenance to avoid detection and blocking by websites.

CAPTCHAs are probably the most popular method used across different platforms. Various tasks based on passive and behavioral analyses are often too complicated for bots to handle. In this case, if you’re writing the code yourself, you’ll need to use a CAPTCHA-solving service, which can be slow and expensive, or avoid the challenge altogether, which requires web scraping expertise.