Back to blog

Groovy Web Scraping: HTTP Requests, DOM Parsing, and Headless Browsers

Share article:

Thanks to blending Java’s massive ecosystem with a scripting-friendly syntax, Groovy works as a practical alternative for web scraping on the JVM. This guide shows you how to scrape websites with the HTTP Jodd client, parse HTML documents, manage sessions, utilize Jodd Lagarto and Jerry, and use Selenium to automate browsers. You'll also learn how to configure proxies for real-world, block-resistant scraping.

Groovy Web Scraping

TL;DR

  • Use Jodd HTTP to send GET and POST requests, manage sessions, and retrieve page content with minimal boilerplate.
  • Parse HTML with Jodd Lagarto and Jerry to extract text, links, attributes, and paginated data using CSS selectors.
  • Use Selenium and a headless browser when websites load content through JavaScript instead of returning it in the initial HTML response.
  • Improve scraper reliability with residential proxies, request throttling, retries, structured data storage, and ongoing selector monitoring.

Prerequisites and environment setup

Before we start web scraping with Groovy, let’s prepare the tools. We need a Java runtime, the Groovy SDK, and libraries for making HTTP requests and parsing HTML.

Install Java and Groovy

First, install JDK 8 or later from Oracle and Groovy SDK from the Apache Groovy website. Then, verify the setup to make sure these two are installed correctly. You’ll get the version you installed if you open a terminal and run:

java -version
groovy -version

Manage dependencies with Grapes

Notably, Groovy resolves dependencies through Grapes (a built-in system that automatically downloads libraries), so you don’t have to create Maven or Gradle projects. Just use the @Grab annotation at the top of a script. This is very useful, as when the script runs, Groovy will resolve and download the dependencies that the project needs.

For example, here’s a brief @Grab example pulling in Jodd HTTP automatically:

@Grab('org.jodd:jodd-http:6.3.0')
import jodd.http.HttpRequest
def response = HttpRequest
.get("https://httpbin.org/get")
.send()
println response.statusCode()

Add Jodd HTTP and Lagarto

We’ve mentioned Jodd libraries, and you’ll need two of them for this process. The first is Jodd HTTP to send HTTP requests and handle responses, and the second is Jodd Lagarto to parse HTML and extract data from web pages.

Import both of these libraries directly with Grapes:

@Grab('org.jodd:jodd-http:6.3.0')
@Grab('org.jodd:jodd-lagarto:6.3.0')

Choosing the right scraping approach

Task

Recommended tool

HTTP requests

Jodd HTTP

HTML parsing

Jodd Lagarto + Jerry

JSON APIs

JsonSlurper

Login-protected pages

Jodd HTTP + cookies

JavaScript-heavy sites

Selenium

Production scraping

Proxies / Web Scraping API

Configure proxies

At this point, you’re nearly ready. The last thing you’ll need is to set up proxy configuration for production scraping. Residential proxies are a must-have because they’ll spread out the traffic across numerous real IPs. This will prevent websites from limiting requests from a single IP and will help you bypass geo-restrictions. You can also choose between rotating and sticky sessions.

As a residential proxy solution, we strongly recommend Decodo, which offers 115M+ ethically-sourced IPs in 195+ locations and the best response time in the market. Additionally, Decodo’s rotating proxies give your project new IPs for every request, helping you to bypass blocks, CAPTCHAs, and geo-limits, while also integrating with any Java HTTP client.

To set up Decodo residential proxies with rotating IPs:

  1. Register or log in to the Decodo dashboard.
  2. Navigate to find residential proxies, choose a subscription or start a 3-day free trial.
  3. Go to Proxy setup.
  4. Select a location or choose Random.
  5. Set the rotating session type and choose a protocol (HTTP(S) or SOCKS5).
  6. Choose the authentication type.
  7. Download the generated endpoint and credentials. Alternatively, copy them into your scraper, browser, or software.

You can also check out this video for step-by-step instructions on how to set up residential proxies.

Enhance your scraper with proxies

Claim your 3-day free trial of residential proxies and explore 115M+ ethically-sourced IPs, advanced geo-targeting options, a 99.86% success rate, an average response time under 0.6s, and more.

Don’t hardcode your Decodo credentials in scripts. Instead, store them in environment variables. This is how: 

export PROXY_USERNAME="YOUR_PROXY_USERNAME"
export PROXY_PASSWORD="YOUR_PROXY_PASSWORD"

After that, you can access your credentials from Groovy:

def proxyUser = System.getenv("PROXY_USERNAME")
def proxyPass = System.getenv("PROXY_PASSWORD")

Organize and run your scripts

Meanwhile, organizations can create separate scripts for each example, which are easier to learn. Another option is to keep everything inside a single Groovy class. The class works better if the organization is building a reusable scraping toolkit.

Finally, to run a script, save it as scraper.groovy and then execute with:

groovy scraper.groovy

Sending GET and POST requests with Groovy and Jodd HTTP

HTTP requests are the basis for the majority of your web scraping projects. In this section, we’ll look into two request types in more detail. The first is a GET request, which retrieves data from a target page, and the second is a POST request, which sends data to a server. The latter will often go through APIs or forms.

You can send GET and POST requests very easily in Groovy by utilizing the Jodd HTTP library and its concise API. Jodd HTTP is a lightweight alternative to larger libraries, and you can combine it really well with Groovy's standard JSON tools.

Besides these, we’ll also use httpbin.org. This is a very useful, public testing service that developers use for learning and debugging HTTP clients.

First, let’s send a GET request. The example below requests a page and prints the response body. In it, HttpRequest.get() creates an HTTP GET request to httpbin.org. Then, the send() method executes that request and returns a response. That response body actually contains the content that the server returned. Meanwhile, statusCode() and header() provide more response details. 

@Grab('org.jodd:jodd-http:6.3.0')
import jodd.http.HttpRequest
def response = HttpRequest
.get("https://httpbin.org/get")
.send()
println response.bodyText()
Jodd also gives access to response metadata, including status codes and headers:
println "Status: ${response.statusCode()}"
println "Content-Type: ${response.header("Content-Type")}"
println response.bodyText()

An output looks like this:

Status: 200
Content-Type: application/json
{
"url": "https://httpbin.org/get"
}

Note: Check the status code when scraping the website (before you try to parse the content) to confirm that the page loaded successfully.

Next, let’s send a POST request.

POST requests’ task is to send data to a server as part of the request. In our example below, the form() method adds two form fields, and both are submitted with the request. We can liken this action to filling out and submitting a form in a browser. On its part, the server receives all these values the POST request contained, and includes them in its response. 

@Grab('org.jodd:jodd-http:6.3.0')
import jodd.http.HttpRequest
def response = HttpRequest
.post("https://httpbin.org/post")
.form("search", "groovy scraping")
.form("page", "1")
.send()
println response.bodyText()

httpbin.org returns submitted data back to the client. This allows you to easily confirm that forms are being sent correctly before you begin working with real websites or APIs.

Finally, let’s parse JSON responses that APIs return, as most won't return them in HTML format.

Groovy’s JsonSlurper converts JSON into native objects, which you can then easily access. The best news here is that you don’t need to search through raw text manually, wasting your time. Simply reference individual fields directly, such as json.url or json.headers.Host. This way, it’s much easier and faster to extract specific values from API responses and then add them into scraping workflows.

@Grab('org.jodd:jodd-http:6.3.0')
import groovy.json.JsonSlurper
import jodd.http.HttpRequest
def response = HttpRequest
.get("https://httpbin.org/get")
.send()
def json = new JsonSlurper().parseText(response.bodyText())
println json.url
println json.headers.Host

Meanwhile, Jodd HTTP provides method chaining, which is a significant advantage, allowing multiple request options to be joined into one readable statement:

Grab('org.jodd:jodd-http:6.3.0')
import jodd.http.HttpRequest
def response = HttpRequest
.post("https://httpbin.org/post")
.form("query", "groovy")
.header("User-Agent", "Groovy Scraper")
.send()
println response.bodyText()

What this example does is join several request settings into one chain of method calls. The request sends form data and defines a custom User-Agent header. In turn, it identifies the client making the request. Method chaining improves readability and the overall process by keeping related configuration in one place. This feature is especially relevant as requests become more complex. It’s also very useful when adding authentication headers, cookies, proxy settings, or additional request parameters.

While Groovy removes much of that boilerplate, it’s interesting to note that this same workflow in Java becomes much more complex. It requires more object creation, type declarations, and exception handling, and generally has you focused on infrastructure code.

Analyzing the DOM tree and extracting data with Jodd Lagarto

In this section, we’ll learn how to parse fetched HTML into a navigable DOM and also how to extract specific elements we targeted. 

Once you have finished fetching a page, you’ll move to extracting the data you need, and this is where DOM parsing comes in to save the day. The Document Object Model (DOM) is a tree representation of an HTML document that makes elements accessible through selectors.

Jodd Lagarto (one of the two libraries we’ve introduced above) includes Jerry, aka a jQuery for Java. It’s a jQuery-inspired API for navigating and querying HTML. It’ll be useful to us as we go.

If you've worked with Java-based scraping before, you may be familiar with Jsoup. Both libraries support HTML parsing and CSS selectors, but this guide uses Jodd Lagarto and Jerry because they integrate naturally with the Jodd ecosystem already used for HTTP requests. Jsoup remains a popular alternative and is often preferred in projects that don't rely on other Jodd components.

Start by converting raw HTML into a DOM object:

@Grab('org.jodd:jodd-http:6.3.0')
@Grab('org.jodd:jodd-lagarto:6.3.0')
import jodd.http.HttpRequest
import jodd.jerry.Jerry
def html = HttpRequest
.get("https://news.ycombinator.com/")
.send()
.bodyText()
def document = Jerry.of(html)

In this example, the first request retrieves the raw HTML source of the homepage. Then, it stores it in the html variable. Once it passes that HTML string to Jerry.of(), a DOM object is created that can be navigated and also queried using CSS selectors.

Here are some examples:

document.find("title").text() // tag selector
document.find("#hnmain") // ID selector
document.find(".titleline") // class selector
document.find("a[href]") // attribute selector

Use CSS selectors to locate specific elements you need based on their location and attributes within the document. It’s very convenient that most browser developer tools support CSS selectors directly. This allows you to inspect a page and test selectors before you add them to your scraper.

Instead of manually searching raw HTML, you can work directly with structured elements and their relationships.

Moving on, the text() method extracts text content, and attr() retrieves attribute values – here’s a snippet, which assumes the document object from the converting example already exists:

def firstLink = document.find(".titleline a").first()
println firstLink.text()
println firstLink.attr("href")

The selector .titleline a aims for the first article link on the page. Then, calling text() returns the visible text inside the element, and attr("href") extracts the URL stored in the link's href attribute. Most data appear as text content or as HTML attributes (e.g. links, image sources, IDs, and metadata), which means that these two methods cover many common scraping tasks.

If you find that multiple elements match a selector, you can use each() to go over them. The example below extracts article titles and URLs from a front page:

@Grab('org.jodd:jodd-http:6.3.0')
@Grab('org.jodd:jodd-lagarto:6.3.0')
import jodd.http.HttpRequest
import jodd.jerry.Jerry
def html = HttpRequest.get("https://news.ycombinator.com/").send().bodyText()
def document = Jerry.of(html)
document.find(".titleline a").each { item, index ->
def url = item.attr("href")
if (!url.startsWith("http")) {
url = "https://news.ycombinator.com/${url}"
}
println "Title: ${item.text()}"
println "Link: ${url}"
println "---"
}

Notably, each() works to loop through every matching element, and it also executes the provided code block. Therefore, this method allows you to extract repetitive data structures such as product listings, search results, articles, or table rows.

You can also extract scores:

document.find(".score").each { item, index ->
println item.text()
}

In the HTML structure, you’ll find that scores show separately from the article titles, so you can use a dedicated selector to collect them. 

Moreover, many websites split results across multiple pages. You’ll see this pagination showing as a "next" link. When a scraper extracts these next-page URLs, it doesn’t stop after the first page, but simply continues gathering data automatically. This is especially important if you're scraping archives, search results, product catalogues, or similar content.

In our example, this is the selector:

def nextPage = document.find(".morelink").attr("href")
println "Next page: https://news.ycombinator.com/${nextPage}"

Also, you can place this inside a loop, and you’ll continue scraping until there’s no next page.

Lastly, we should briefly mention that there’s a significant difference between CSS selectors and XPath. CSS selectors are concise, easy to read and simpler to maintain, which is why they’re widely used for web scraping tasks. On the other hand, use XPath when you need to move upward through the DOM hierarchy, as well as when you want to target elements based on complex relationships. 

Managing authentication, session cookies, and form submissions 

In this section, we’ll cover how to handle scraping processes that need login or session persistence.

Some websites ask for authentication before showing you data because the data you target is hidden behind login walls, needs authorization, or relies on user-specific sessions. At the same time, login-protected dashboards, account pages, and personalized content need sessions to identify users. A session is maintained through cookies that the server returns after a successful login. After that, session cookies act as proof of authentication for future requests.

Here’s an example of a login via POST request with username/password form data:

@Grab('org.jodd:jodd-http:6.3.0')
import jodd.http.HttpRequest
def username = System.getenv("APP_USERNAME")
def password = System.getenv("APP_PASSWORD")
def loginResponse = HttpRequest
.post("https://httpbin.org/post")
.form("username", username)
.form("password", password)
.send()
println loginResponse.statusCode()

The username and password are loaded from environment variables where you’ve saved them. Not hardcoding them into the script keeps sensitive credentials out of the source code so that the risk of accidental exposure is lower.

For our example, the request is sent to httpbin.org, but the pattern is the same for real websites. When the request is done, you can check the response to verify that the authentication was successful.

Also, in real authentication, the response comes with session cookies. Jodd HTTP makes them available through cookies():

def loginResponse = HttpRequest
.get("https://httpbin.org/cookies/set/sessionid/demo123")
.send()
def cookies = loginResponse.cookies()
Receive a cookie, then:
def authenticatedResponse = HttpRequest
.get("https://httpbin.org/cookies")
.cookies(cookies)
.send()
println authenticatedResponse.bodyText()

You can examine the returned cookies to decide which values you need to keep. Just keep in mind that maintaining cookies throughout the entire session is key for authenticated scraping.

Additionally, many websites have these specific hidden form fields you’ll come across, one of them being CSRF (Cross-Site Request Forgery) tokens. Websites use them to see whether a legitimate page is submitting the forms, or it’s a third party. However, CSRF tokens are one-time values, meaning that they change between sessions. And because they must be included with the rest of the form data, a scraper must grab them dynamically before a form is submitted. 

Let’s look at an example. First, we’ll retrieve the page and extract the token:

@Grab('org.jodd:jodd-http:6.3.0')
@Grab('org.jodd:jodd-lagarto:6.3.0')
import jodd.http.HttpRequest
import jodd.jerry.Jerry
def html = """
<form>
<input type="hidden" name="csrf_token" value="abc123xyz">
</form>
"""
def document = Jerry.of(html)
def token = document
.find("input[name=csrf_token]")
.attr("value")
println token

The HTML is embedded directly in the script for this example. In a real workflow, it would actually come from a page retrieved with HttpRequest.get().

The selector will find the hidden input field and will extract its value attribute. 

Then, include the token with the form when it’s time to submit it:

def response = HttpRequest
.post("https://httpbin.org/post")
.form("csrf_token", token)
.form("message", "Hello from Groovy")
.send()
println response.statusCode()

This request combines the extracted token with the form data expected by the server. Many modern websites validate both the session cookie and the token before accepting a submission, making both pieces of information necessary for successful automation.

The scraper attaches the cookies to a new request, so it just goes on within the same authenticated session. What the website sees is requests from a user who is already logged in.

Therefore, it’s clear how important the session cookies are for the scraping efforts. If you don’t have them, websites will likely send the request back to the login page or will send you to restricted content.

Here’s a visual of a typical authenticated workflow:

Typical authenticated workflow: five steps, repeat as needed.

The exact process will be different between different websites, but most authenticated scraping workflows follow this general pattern. 

Note: Never hardcode usernames, passwords, API keys, or proxy credentials directly in scripts to prevent possible security issues. It’s better to store them in environment variables. You can also keep them in configuration files that aren't committed to version control.

Scraping dynamic content with Selenium and headless browsers

When you decide you need to transition from static scraping (Jodd) to handling dynamic pages, you can use Selenium with a headless Chrome browser controlled from Groovy. 

The Jodd-based approach works well for static websites where the server returns all content in the initial HTML response. But during web scraping of dynamic websites, there’s an issue: after the page renders, modern websites utilize JavaScript to load data. This means that an HTTP request gives us only the HTML skeleton, not the content we see in the browser and that we need.

If you want to identify this issue, take a look at the page source and compare it to what the browser shows. It’s likely that the JS is interfering if the source has only empty containers, loading spinners, or placeholder elements. You’ll often find this on modern eCommerce sites, social media platforms, dashboards, and single-page applications. 

To solve this, we’ll add another tool to the mix. We’ll employ a browser automation tool to access the fully rendered page. A headless browser loads the page, executes JavaScript, and renders the final DOM before extraction, solving our scraping problem. 

To use Selenium in Groovy, add the dependencies with Grapes:

@Grab('org.seleniumhq.selenium:selenium-java:4.21.0')

While Jodd HTTP communicates directly with a web server, Selenium doesn’t. Instead, it controls a real browser through WebDriver, which allows your script to behave like a real human. This allows the scraper to interact with websites the same way a browser does, making it possible to access content unavailable through simple HTTP requests.

You'll also need to install Chrome and ChromeDriver. Ensure right away that the versions of these two installations match.

Let’s see an example that launches a headless Chrome browser:

@Grab('org.seleniumhq.selenium:selenium-java:4.21.0')
import org.openqa.selenium.chrome.ChromeDriver
import org.openqa.selenium.chrome.ChromeOptions
def options = new ChromeOptions()
options.addArguments("--headless=new")
options.addArguments("--window-size=1920,1080")
options.addArguments("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/125.0.0.0 Safari/537.36")
def driver = new ChromeDriver(options)

The headless option runs Chrome without the need to open a visible browser window, speeding up the automation process. Also, the viewport size and user agent help us disguise our browser into a normal desktop user so that the scraper can see the same content a normal user would.

When the browser is running, go to the target page and wait for the dynamic content to load:

import org.openqa.selenium.By
import org.openqa.selenium.support.ui.ExpectedConditions
import org.openqa.selenium.support.ui.WebDriverWait
import java.time.Duration
driver.get("https://quotes.toscrape.com/js/")
def wait = new WebDriverWait(driver, Duration.ofSeconds(10))
wait.until(
ExpectedConditions.presenceOfElementLocated(
By.cssSelector(".quote")
)
)

It’s actually better to wait for specific elements to appear than to use fixed sleep intervals. It’s a more reliable method, because the scraper gets to work as soon as the target element shows. This approach reduces unnecessary delays and makes scraping more reliable. 

You can then extract data using CSS selectors or XPath expressions:

driver.findElements(By.cssSelector(".quote")).each {
println it.findElement(By.cssSelector(".text")).text
}

Given that Selenium returns a set of matching elements, you can process them one by one. Additional selectors can also target specific fields from each parent element. These include titles, descriptions, prices, ratings, or timestamps.

So, if you’re targeting elements based on their position or parent-child relationships, XPath is quite useful:

driver.findElements(
By.xpath("//div[contains(@class,'quote')]")
)

It’s true that CSS selectors are easier to read, as we’ve mentioned above, but for this purpose, XPath is a better choice because it provides flexibility for navigating complex document structures.

Meanwhile, infinite-scroll pages load new content only when a user scrolls to the bottom of the page, which is why they involve additional interaction. Send keyboard events and wait for new content to load:

import org.openqa.selenium.Keys
3.times {
driver.findElement(By.tagName("body"))
.sendKeys(Keys.PAGE_DOWN)
Thread.sleep(2000)
}

However, you can automate these interactions so that the scraper can automatically proceed with requests and grab the data that doesn’t appear on the first page load.

The page at https://quotes.toscrape.com/js/ is JavaScript-rendered, so a simple HTTP client won't retrieve the content visible in the browser. If we use Selenium to load the page, it will execute the JavaScript and bring up the data we’re targeting for extraction.

@Grab('org.seleniumhq.selenium:selenium-java:4.21.0')
import org.openqa.selenium.By
import org.openqa.selenium.chrome.ChromeDriver
import org.openqa.selenium.chrome.ChromeOptions
import org.openqa.selenium.support.ui.ExpectedConditions
import org.openqa.selenium.support.ui.WebDriverWait
import java.time.Duration
def options = new ChromeOptions()
options.addArguments("--headless=new")
def driver = new ChromeDriver(options)
try {
driver.get("https://quotes.toscrape.com/js/")
def wait = new WebDriverWait(
driver,
Duration.ofSeconds(10)
)
wait.until(
ExpectedConditions.presenceOfElementLocated(
By.cssSelector(".quote")
)
)
driver.findElements(By.cssSelector(".quote")).each {
def quote = it.findElement(
By.cssSelector(".text")
).text
def author = it.findElement(
By.cssSelector(".author")
).text
println "Quote: ${quote}"
println "Author: ${author}"
println "---"
}
} finally {
driver.quit()
}

Example output:

Quote: “The world as we have created it is a process of our thinking...
Author: Albert Einstein
---
Quote: “It is our choices, Harry, that show what we truly are...
Author: J.K. Rowling
---

We can use this same workflow for JavaScript-rendered search results, product listings, reviews, news feeds, and more. As soon as the content loads in the browser, Selenium can interact with it using the selectors and extraction method utilized for static pages.

Saving and structuring your scraped data

This section will focus on the methods to export and store the extracted data in useful formats. After you’ve instructed your data, you need to store it in a format that can be easily analyzed, shared, or processed by other systems.

Importantly, before saving anything, do some basic validation and cleanup, because good data quality makes all the difference. Issues like missing values, duplicate records, deformed URLs, and inconsistent formatting can reduce data quality. This, in turn, makes downstream analysis unnecessarily difficult. 

JSON is a common choice for storing data because it preserves nested structures and metadata. Groovy includes JsonOutput, making JSON export straightforward.

The following example writes scraped records to a JSON file:

import groovy.json.JsonOutput
def results = [
[
title: "Example Article",
url: "https://example.com/article",
scrapedAt: new Date().toString()
]
]
new File("results.json").text =
JsonOutput.prettyPrint(
JsonOutput.toJson(results)
)

The file we get holds data that’s structured in a way to be easily used by APIs or dashboards.

Meanwhile, instead of JSON, CSV is a better choice for spreadsheet analysis. You can generate a simple CSV file with Groovy's file APIs:

def rows = [
["Title", "URL"],
["Example Article", "https://example.com/article"]
]
new File("results.csv").withWriter { writer ->
rows.each { row ->
writer.println(
row.collect { "\"${it}\"" }
.join(",")
)
}
}

It’s good to keep in mind that CSV files work best when each record has the same structure, so it’s best to have a defined and consistent column order, and then add the missing values, even if some fields are optional, instead of just omitting columns.

It’s also helpful to include metadata with every record for long-term projects. This will make it a lot simpler to audit and reproduce results when the time comes. Useful fields include source URL, scrape timestamp, page number, search query, and country or proxy location. Metadata will also help us troubleshoot problems and compare results across many scraping sessions.

Moreover, timestamped filenames help separate scrape runs and prevent accidental overwrites:

def timestamp =
new Date().format("yyyyMMdd_HHmmss")
def filename =
"results_${timestamp}.json"

You’ll also find timestamped files very helpful. If a scraper suddenly begins grabbing incomplete or incorrect data, you can roll back to previous datasets.

Moreover, you can choose to create a new file for each run, given that new files are easier to track. Alternatively, you can choose to attach records to an existing dataset, given that appending makes aggregation a lot simpler.

That said, if your projects are large, it’s probably best to store the results in a database. This is often seen as a more practical solution. Groovy connects to databases such as MySQL, PostgreSQL, and SQLite.

You’ll notice that databases are very beneficial if you’re scraping the same source over and over again. They’ll enable you to track any changes over time, compare various records, identify trends, and more. You won’t need to work with large collections of individual files.

Overall, when the data has been properly cleaned, validated, and stored, it reduces workload, and you can integrate it into a variety of workflows.

Avoiding blocks and best practices for production scraping

When web scraping, you’re bound to run into challenges that will mess with your scraper’s reliability. Just because your scraper worked once doesn’t mean it’s ready for production. Websites are constantly monitoring all the traffic coming to them, and they may restrict or block automated requests that seem suspicious at any time.

Some of the most common blocking methods you’ll face include CAPTCHAs and IP bans. Moreover, rate limiting limits the frequency with which an IP can access a site, and there’s also fingerprinting in your way, which analyzes characteristics of the browser in search of automation tools.

Any solution you decide upon should include residential proxies, because they distribute requests across many IP addresses to lower the chances of triggering the defenses we mentioned. We recommend Decodo residential proxies. You can route Jodd HTTP traffic through a proxy gateway and integrate it into Selenium browser options.

Enhance your scraper with proxies

Claim your 3-day free trial of residential proxies and access 115M+ ethically-sourced IPs, a 99.86% success rate, 195+ geo-targeting locations, and more.

Remember that rotating user agents can also help reduce repetitive request patterns:

def userAgents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"
]
def randomUserAgent =
userAgents[new Random().nextInt(userAgents.size())]

Request timing matters as much as IP rotation. Sending requests at perfectly regular intervals often looks automated. Introduce randomized delays between requests:

sleep(2000 + new Random().nextInt(3000))

Also, temporary failures are common, but retry logic with exponential backoff will help prevent one failed request from stopping everything.

def delay = 1000
for (int attempt = 1; attempt <= 3; attempt++) {
try {
// request logic
break
} catch (Exception ignored) {
sleep(delay)
delay *= 2
}
}

It's also worth checking a website's robots.txt file and using reasonable request intervals. This reduces server load and helps avoid unnecessary blocks.

Another common issue is CSS selectors and XPath expressions breaking, which usually happens because websites update their layouts often. Keep an eye on the extraction results, and you also may want to add alerts to warn you when expected fields suddenly disappear.

However, there also comes a time when it’s best to hand over the majority of the work to someone else, so you can focus on the extracted data. As proxy rotation, retries, and anti-blocking logic become difficult to maintain, especially for large-scale projects, a Web Scraping API is the best solution by far.

Use Decodo's Web Scraping API for sites with strong anti-bot technology and for browser rendering when you need a fast, all-in-one solution that manages parsing, browser rendering, anti-bot bypass, and proxy rotation. Alternatively, consider Decodo Site Unblocker that comes with a proven success rate for scraping targets with aggressive anti-bot protections, as it’s purpose-built for bypassing Cloudflare, DataDome, Akamai, and similar protections.

Skip the boilerplate

Decodo's Web Scraping API handles proxies, CAPTCHAs, and anti-bot detection so your code stays short and your requests actually land.

Final thoughts

Groovy is a highly useful toolkit for web scraping. It can be used to send simple HTTP requests, parse HTML with Jodd Lagarto, manage authenticated sessions, and even render JavaScript-heavy pages with Selenium. Moreover, its flexibility allows users to combine it with proper data storage, proxy management, and APIs, which in the end enables them to build reliable scrapers for both small automation tasks and larger data collection projects.

Share article:

About the author

Kipras Kalzanauskas

Senior Account Manager

Kipras is a strategic account expert with a strong background in sales, IT support, and data-driven solutions. Born and raised in Vilnius, he studied history at Vilnius University before spending time in the Lithuanian Military. For the past 3.5 years, he has been a key player at Decodo, working with Fortune 500 companies in eCommerce and Market Intelligence.

Connect with Kipras on LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Is Groovy better than Python for web scraping?

Groovy and Python are used for different scenarios. Groovy is generally a more practical choice for web scraping if you're using the JVM infrastructure, because it simplifies the process. It allows direct access to Java libraries and combines the straightforwardness of the scripting language with Java's extensive tooling. That said, Python has a larger scraping ecosystem and comes with many widely used libraries, including Requests, Beautiful Soup, Scrapy, and Playwright.

Can Groovy be used for large-scale web scraping projects?

Yes, you can use Groovy for large-scale web scraping projects. It integrates well with system necessary for such operations, beyond just data extraction, including proxy management, monitoring, scheduling, storage, retries, and more. That said, as the workload increases and the projects become more complex, it’s a good idea to combine Groovy scripts with other tools and web scraping APIs to reduce maintenance overhead.

What are the best Groovy libraries for web scraping?

The best Groovy library depends on your target website. For example, you can use Jodd HTTP to send HTTP requests and manage responses. Use Jodd Lagarto to parse HTML and support CSS selectors, and use Selenium to scrape JS-heavy websites that need browser automation. Also, Groovy has built-in JsonSlurper and JsonOutput, which you can employ with APIs. Finally, because Groovy uses Java libraries, you can add popular tools if you need them, including Playwright, Apache HttpClient, and JDBC.

Web scraping dashboard showing JSON response code and controls with 'Start scraping' button on dark gradient background

Web Scraping With Java: The Complete Guide

Web scraping is the process of automating page requests, parsing the HTML, and extracting structured data from public websites. While Python often gets all the attention, Java is a serious contender for professional web scraping because it's reliable, fast, and built for scale. Its mature ecosystem with libraries like Jsoup, Selenium, Playwright, and HttpClient gives you the control and performance you need for large-scale web scraping projects.

Web-scraping dashboard showing 'Response' JSON with 'status_code':200 and 'Live preview' button on dark gradient background

Java Web Scraping Libraries: How to Choose and Use the Best Tools for Your Project

Java is a battle-tested choice for web scraping at scale due to its robust type safety, structured concurrency, safe multithreading, and a mature ecosystem. However, its advantage is also a major pain point: having too many libraries. From jsoup and HtmlUnit to Selenium and Playwright, these libraries exist to simplify web scraping, and yet picking "the right one" is a challenge. This guide will teach you how to choose the right tool based on your project requirements and how to handle modern scraping challenges.
web scraping UI showing JSON response with labels Response, Live preview and Start scraping on dark background

Jsoup Parsing HTML: A Complete Java Tutorial

Parsing HTML with jsoup is often the easiest way to extract structured data in Java when a page has no API. It handles imperfect markup, supports CSS selectors, and keeps things lightweight. This guide covers loading HTML, selecting elements, extracting data, and modifying markup – plus what to do when static parsing isn't enough.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved