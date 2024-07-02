Use HtmlAgilityPack when the data is visible in "view source." It's the fastest option and doesn't spin up a browser. Perfect for scraping blogs, product listings with server-side rendering, or any site built before 2015. If you're coming from Python's Beautiful Soup, this is your equivalent.

Use Selenium when content loads after the page renders – think infinite scroll, lazy-loaded images, or data fetched via API calls. It's battle-tested and has extensive documentation. Yes, it's slower than parsing raw HTML, but it actually works on modern websites. Check out our guide on Selenium Scraping With Node.js to see how the concepts translate across languages.

Use PuppeteerSharp if you want Selenium's capabilities with a cleaner API. It's the C# port of Google's Puppeteer library. Good choice if you're already familiar with headless Chrome workflows or need advanced browser control like request interception.

In the following sections of this guide, you'll see how to use HtmlAgilityPack for static and Selenium for dynamic content. That doesn't mean they're the best options, as many more libraries exist, such as ScrapySharp, which offer completely different features based on your particular needs.

Installing HtmlAgilityPack via NuGet

To install HtmlAgilityPack, from your project directory, run:

dotnet add package HtmlAgilityPack

You'll see output confirming the package was added. The command downloads HtmlAgilityPack and automatically updates your project file.

To verify the installation, open the WebScraper.csproj file in VS Code. You should see a new <ItemGroup> section that looks like this:

< ItemGroup > < PackageReference Include = "HtmlAgilityPack" Version = "1.11.61" / > < / ItemGroup >

Adding CsvHelper for CSV export

Scraping data is pointless if you can't export it somewhere useful. You could manually write CSV formatting logic – concatenating strings, escaping commas, dealing with newlines – but why waste time reinventing the wheel when CsvHelper exists?

CsvHelper is the de facto standard for CSV operations in C#. It handles encoding, culture-specific formatting, and edge cases (like fields containing commas or quotes) automatically. You define a class, pass it a list of objects, and it generates a properly formatted CSV. No surprises, no bugs at 2 AM because someone's company name had a comma in it.

But why CSV, you might ask? Because it's the universal data format. Excel opens it, Google Sheets imports it, Pandas reads it, and databases ingest it. For your first scraping project, CSV is the path of least resistance. You're not dealing with JSON schema validation, database connections, or API rate limits – just rows and columns that anyone can understand.

Once your scraper works, you can always swap CSV for JSON, SQL, or whatever your pipeline needs. But start simple.

Run this in your project directory:

dotnet add package CsvHelper

That's it. CsvHelper is now in your project alongside HtmlAgilityPack. If you want to check out other NuGet packages or explore different versions, browse the official NuGet gallery.

Now you've got the tools to scrape and export. Time to write actual code.

Building a static web scraper with HtmlAgilityPack

For this example project, let's scrape quotes from quotes.toscrape.com – a practice site designed for precisely this purpose. This site displays quotes with authors and tags. The HTML is server-rendered, which means all the content is already in the page source when it loads. Perfect for HtmlAgilityPack.

Loading HTML with HtmlWeb.Load()

HtmlAgilityPack provides two ways to fetch web pages: synchronous and asynchronous. For most scraping tasks, especially when you're just learning, synchronous is simpler.

Synchronous loading blocks your program until the page loads completely. Open Program.cs and write the following code:

using HtmlAgilityPack ; var web = new HtmlWeb ( ) ; var doc = web . Load ( "https://quotes.toscrape.com/" ) ; Console . WriteLine ( "Page loaded successfully!" ) ; Console . WriteLine ( $ "Title: {doc.DocumentNode.SelectSingleNode(" // title ").InnerText}" ) ;

Save the file and run it with this command in your terminal:

dotnet run

You'll see the page title printed in the terminal. It's a simple task, but it confirms that the library works and sets a basis for further scraping tasks.

Using XPath with SelectNodes() and SelectSingleNode()

​​XPath is a query language for navigating HTML/XML structures. It's like SQL for documents – a bit cryptic at first, but incredibly powerful once you understand the syntax.

Basic XPath patterns:

// Select ALL matching elements var quoteNodes = doc . DocumentNode . SelectNodes ( "//div[@class='quote']" ) ; // Select the FIRST matching element var firstQuote = doc . DocumentNode . SelectSingleNode ( "//div[@class='quote']" ) ;

The "//" tells the application to search anywhere in the document. The [@class='quote'] filters for elements with that specific class attribute. To find them, you should know how to Inspect Element in your browser.

Let's extract actual data:

using HtmlAgilityPack ; var web = new HtmlWeb ( ) ; var doc = web . Load ( "https://quotes.toscrape.com/" ) ; // Select all quote containers var quoteNodes = doc . DocumentNode . SelectNodes ( "//div[@class='quote']" ) ; foreach ( var quoteNode in quoteNodes ) { // Extract nested elements using relative XPath ( starts with . ) var text = quoteNode . SelectSingleNode ( ".//span[@class='text']" ) . InnerText ; var author = quoteNode . SelectSingleNode ( ".//small[@class='author']" ) . InnerText ; Console . WriteLine ( $ "Quote: {text}" ) ; Console . WriteLine ( $ "Author: {author}" ) ; Console . WriteLine ( "---" ) ; }

The script heads to the website, finds the required information through the defined XPaths, and prints the quotes and author names.

Cleaning HTML entities with HtmlEntity.DeEntitize()

If you ran the code above, you probably noticed that the text in your terminal looks a little bit odd:

Quote : "I have not failed. I've just found 10,000 ways that won't work."

Those "'" are HTML entities – encoded representations of special characters. Browsers decode them automatically, but when you extract InnerText, you get the raw encoded version.

To fix this issue, you must decode them before outputting:

using HtmlAgilityPack ; var web = new HtmlWeb ( ) ; var doc = web . Load ( "https://quotes.toscrape.com/" ) ; var quoteNodes = doc . DocumentNode . SelectNodes ( "//div[@class='quote']" ) ; foreach ( var quoteNode in quoteNodes ) { var text = quoteNode . SelectSingleNode ( ".//span[@class='text']" ) . InnerText ; var author = quoteNode . SelectSingleNode ( ".//small[@class='author']" ) . InnerText ; // Decode HTML entities to readable text text = HtmlEntity . DeEntitize ( text ) ; Console . WriteLine ( $ "Quote: {text}" ) ; Console . WriteLine ( $ "Author: {author}" ) ; Console . WriteLine ( "---" ) ; }

Much cleaner. Always run DeEntitize() before writing to CSV or JSON – your data analysts will thank you.

Now let's tackle the more complex problem: JavaScript-rendered pages.

Scraping JavaScript-rendered pages with Selenium

HtmlAgilityPack works perfectly until you encounter a site where "view source" shows only empty div containers.

This is where Selenium saves you. It's not just a scraping and parsing library – it's a browser automation framework. Selenium launches an actual Chrome (or Firefox) instance, navigates to the page, waits for JavaScript to execute, and then lets you extract data from the fully rendered DOM.

How Selenium works: WebDriver architecture

Selenium uses a WebDriver protocol to control the browser. Think of it as a remote control:

Your C# code sends commands to the WebDriver (e.g., "navigate to this URL," "click this button"). WebDriver translates those commands into browser-specific instructions. Chrome (via ChromeDriver) executes the instructions and sends back results. Your code receives the data and continues.

This round trip makes Selenium slower than plain HTTP requests, since you're driving a full browser. Still, when a page relies heavily on JavaScript to generate content, a real browser engine is often the only practical option.

Responsible automation

Automated browsers can send requests faster than humans, and hammering a server with 1000 concurrent Selenium instances will get you IP-banned instantly. Add delays between requests (Thread.Sleep() or better yet, use exponential backoff). Respect robots.txt. If a site explicitly blocks automation, don't try to circumvent it – use a service like Decodo's Web Scraping API that handles rate limits and proxies correctly.

Also, check out our ChatGPT web scraping guide if you're experimenting with AI-assisted scraping workflows.

Now let's build a scraper for quotes.toscrape.com/js – the JavaScript-rendered version of the site you scraped earlier.

Installing Selenium.WebDriver and ChromeDriver

You need two packages: the Selenium library itself and the ChromeDriver binary that controls Chrome. Run these commands in your project directory:

dotnet add package Selenium . WebDriver dotnet add package Selenium . WebDriver . ChromeDriver

Launching Chrome in headless mode

Headless mode runs Chrome without a visible window. No GUI means less memory usage and faster execution. Here's the basic script to write in Program.cs:

using OpenQA . Selenium ; using OpenQA . Selenium . Chrome ; // Configure Chrome options var options = new ChromeOptions ( ) ; options . AddArgument ( "--headless" ) ; // Run without GUI options . AddArgument ( "--disable-gpu" ) ; // Disable GPU acceleration ( recommended for headless ) options . AddArgument ( "--no-sandbox" ) ; // Bypass OS security model ( needed in some environments ) // Launch Chrome with these options var driver = new ChromeDriver ( options ) ; try { driver . Navigate ( ) . GoToUrl ( "https://quotes.toscrape.com/js/" ) ; Console . WriteLine ( $ "Page title: {driver.Title}" ) ; } finally { driver . Quit ( ) ; // ALWAYS close the browser }

Run this with:

dotnet run

You won't see a browser window open, but you should see this in your teerminal:

Page title : Quotes to Scrape

If you bump into an issue where the driver isn't found after running the script, check that chromedriver.exe (or chromedriver) exists in the output folder. Some antivirus software flags it – add an exception if needed.

Why headless matters:

Speed . No rendering overhead for UI elements you'll never see.

. No rendering overhead for UI elements you'll never see. Server environments . Many CI/CD servers don't have displays.

. Many CI/CD servers don't have displays. Resource efficiency. Lower memory usage when running multiple scrapers.

If you're debugging and want to see what Selenium is doing, just remove the --headless argument. Chrome will open visibly, and you can watch it navigate and interact with the page.

Extracting elements with driver.FindElements()

Once the page loads and JavaScript executes, you can extract data just like with HtmlAgilityPack – but with Selenium's API instead.

using OpenQA . Selenium ; using OpenQA . Selenium . Chrome ; var options = new ChromeOptions ( ) ; options . AddArgument ( "--headless" ) ; options . AddArgument ( "--disable-gpu" ) ; var driver = new ChromeDriver ( options ) ; try { driver . Navigate ( ) . GoToUrl ( "https://quotes.toscrape.com/js/" ) ; // Wait for JavaScript to load content ( important! ) Thread . Sleep ( 2000 ) ; // Simple wait // Find all quote containers var quoteElements = driver . FindElements ( By . CssSelector ( "div.quote" ) ) ; Console . WriteLine ( $ "Found {quoteElements.Count} quotes

" ) ; foreach ( var quoteElement in quoteElements ) { // Extract text from nested elements var text = quoteElement . FindElement ( By . CssSelector ( "span.text" ) ) . Text ; var author = quoteElement . FindElement ( By . CssSelector ( "small.author" ) ) . Text ; // Extract tags ( multiple elements ) var tagElements = quoteElement . FindElements ( By . CssSelector ( "a.tag" ) ) ; var tags = tagElements . Select ( t = > t . Text ) . ToList ( ) ; Console . WriteLine ( $ "Quote: {text}" ) ; Console . WriteLine ( $ "Author: {author}" ) ; Console . WriteLine ( $ "Tags: {string.Join(" , ", tags)}" ) ; Console . WriteLine ( "---" ) ; } } finally { driver . Quit ( ) ; // Clean up browser process }

The script launches a browser, navigates to the page, waits for elements to load dynamically, then extracts the quotes, author names, and tags, and finally, prints them in the terminal.