Prerequisites for scraping with Java

Before you can create a simple web scraper with Java, make sure your setup is ready for the job:

Use a current Java LTS release – download Java 17, 21, or newer, and install it.

– 17, 21, or newer, and install it. Use an IDE like IntelliJ IDEA (Community Edition is free) – it simplifies development and integrates with build tools.

like (Community Edition is free) – it simplifies development and integrates with build tools. Use automated build tools like Maven to manage dependencies like Jsoup or Selenium. Use the installation guide for getting Maven on your system.

like to manage dependencies like Jsoup or Selenium. Use the for getting Maven on your system. Understand the basics of HTML so you can locate elements to scrape.

so you can locate elements to scrape. Understand CSS selectors and XPath – essential for targeting specific elements on a page.

For more in-depth setup instructions and some of the Java scripts discussed in this blog post, visit this GitHub repository.

Overview of Java web scraping libraries

Java gives you more than one way to scrape data from the web, and the right tool depends on what kind of content you're dealing with. Some pages serve static HTML, while others rely on JavaScript to dynamically load data. Let's take a brief look at when to use each library.

Before you try any of the examples in this guide, make sure your dependencies are set up. You don't download these libraries manually. If you're using Maven, you just add the dependency blocks to your pom.xml, and Maven pulls everything from Maven Central Repository into your project for you. Gradle works the same way. Once the dependencies are declared, your IDE handles the imports automatically, and the snippets below compile without extra configuration. This setup also makes it easy to update versions or check the official documentation when an API changes.

Jsoup

Jsoup is lightweight and reads HTML almost like a browser would. You can fetch a page with just a few lines of code and then use familiar CSS selectors to extract what you need.

Use Jsoup when your target pages render all their data in the initial HTML response. On the other hand, it can't scrape dynamically created pages that heavily rely on JavaScript to generate the content.

HtmlUnit

HtmlUnit is a headless browser written in Java that simulates user interactions like clicking. It's slower than Jsoup but can handle dynamic JavaScript content without running a full browser window.

It's handy for testing and simple scraping tasks where you need to wait for JavaScript to run.

Selenium

If you need even more control over dynamic websites, Selenium is the next step. It controls a real browser (Chrome, Firefox, or Edge) through WebDriver. This allows you to scrape content that appears only after user-like actions. You can click buttons, log in, or scroll through infinite pages.

The trade-off is speed. Selenium is powerful but pretty resource-intensive.

Apache HttpClient/HttpComponents

For projects focused on HTTP performance, you might prefer Apache HttpClient (HttpComponents). It's a tool for sending requests, managing headers, handling cookies, and controlling sessions.

In production-grade systems, developers often pair HttpClient with Jsoup for clean separation between fetching and parsing.

Playwright

Playwright for Java brings a high-level API for scraping JavaScript-heavy websites faster and more reliably than Selenium in some cases. Microsoft maintains it, and it's becoming a strong contender for automation and scraping alike.

Step-by-step guide: basic web scraping with Java

Let's walk through the process of building a simple web scraper in Java, with each step explained so you can drop it straight into your project.

We'll assume you've already met all the prerequisites and created a Java project in IntelliJ IDEA using Maven as the build tool.

Add dependencies

Before you start building the scraper, add the libraries you'll use. For parsing HTML, you'll use Jsoup. For exporting results to JSON, you'll use Gson. And for exporting to CSV, you'll use OpenCSV. Add the following to your pom.xml file: