Back to blog

How to Scrape Websites with PowerShell: A Complete Guide

PowerShell is already where many Windows admins, DevOps teams, and automation-minded developers handle repetitive work. That makes web scraping a natural next step when you need product prices, uptime signals, public data for reports, or quick checks from the terminal. PowerShell works well here because output is pipeable, objects are native, and CSV and JSON exports are built in. In this guide, you'll build a scraper that fetches pages, parses HTML, handles pagination and errors, uses proxies when needed, and exports structured data.

TL;DR

  • Use PowerShell 7 for new scraping projects. It runs on Windows, macOS, and Linux, and its web cmdlets use the modern .NET HTTP stack
  • Use Invoke-WebRequest when you need the full response object, headers, status code, and HTML content
  • Use Invoke-RestMethod when the target is a JSON or XML API, and you want PowerShell to deserialize the response automatically
  • Use PSParseHTML for CSS selector-based HTML parsing in PowerShell 7
  • Use Selenium only when the page needs a browser-rendered DOM. It works, but the PowerShell module has maintenance and driver-version friction.
  • Add retries, explicit timeouts, realistic headers, delays, and proxy support before you run a scraper repeatedly

Setting up PowerShell for web scraping

Before writing scraping code, get the runtime and modules right.

PowerShell Core 7 vs. Windows PowerShell 5

PowerShell 7 runs on Windows, macOS, and Linux, and it uses the newer .NET networking stack behind its web cmdlets. That matters when your scraper needs current TLS behavior, consistent HTTP handling, and the option to run the same job on a laptop, server, or CI runner.

PowerShell 7 also installs side by side with Windows PowerShell 5.1, so you can keep old automation scripts untouched while running new scraping scripts with pwsh.exe instead of powershell.exe.

Check your version:

pwsh --version

If the command isn't found, install PowerShell 7. On Windows clients, use winget:

winget install --id Microsoft.PowerShell --source winget

You can also download the MSI installer from the PowerShell GitHub releases page if winget isn't available on your machine.

The main scraping difference between the two versions is HTML parsing. Windows PowerShell 5.1 used Internet Explorer components for .ParsedHtml in Invoke-WebRequest. PowerShell 7 removed that dependency. Instead, fetch HTML with PowerShell's web cmdlets and parse it with a dedicated module such as PSParseHTML.

Fixing execution policy on Windows

Execution policy controls when PowerShell loads scripts and configuration files. It's a safety feature that only applies to Windows.

If Windows blocks your local .ps1 script, set the policy for your current user:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

RemoteSigned allows scripts you write locally to run and requires downloaded scripts to be signed or unblocked. You usually don't need admin rights with -Scope CurrentUser. You don't need to do this on Linux and macOS.

Installing PSParseHTML

PSParseHTML gives you ConvertFrom-Html and related HTML utilities. The current module supports both AgilityPack and AngleSharp parsing engines. Use -Engine AngleSharp when you want browser-style DOM methods such as QuerySelector() and QuerySelectorAll(), which makes selectors feel familiar if you already inspect pages in DevTools.

Install PSParseHTML from the PowerShell Gallery:

Install-Module PSParseHTML -Scope CurrentUser
Import-Module PSParseHTML

PowerHTML is a practical alternative when you prefer an HtmlAgilityPack-based parser or XPath-heavy workflows. If you're choosing between them, PSParseHTML is usually more convenient for CSS selectors, while PowerHTML is worth considering when your team already thinks in XPath.

Project structure

Create a small project structure at first:

powershell-scraper/
scraper.ps1
models.ps1
export.ps1
output/

Put request and parsing logic in scraper.ps1, reusable object shapes in models.ps1, and output functions in export.ps1. You can merge them while learning, but separate files keep recurring jobs easier to maintain.

Don't hardcode proxy credentials, API keys, or account details in scripts. Instead, use environment variables in a .env file:

$env:PROXY_USER = "your-user"
$env:PROXY_PASSWORD = "your-password"

Then read them inside the script. That keeps secrets out of Git history and pasted code snippets.

Fetching and parsing HTML content

PowerShell gives you 2 built-in web cmdletsInvoke-WebRequest and Invoke-RestMethod. They overlap, but they aren't interchangeable once you care about response metadata.

Invoke-WebRequest vs. Invoke-RestMethod: When to use which

Invoke-WebRequest sends HTTP(S) requests and returns a response object with properties such as StatusCodeHeadersContentLinks, and Images. Use it when you're scraping HTML or debugging how a server responds.

Invoke-RestMethod also sends HTTP(S) requests, but it's built for REST services that return structured data. For JSON and XML, PowerShell deserializes the response into objects for you.

Target type

Recommended cmdlet

Why

Static HTML page

Invoke-WebRequest

You get status, headers, raw content, and link metadata.

JSON API

Invoke-RestMethod

JSON is automatically converted into PowerShell objects.

XML API

Invoke-RestMethod

XML is parsed into usable nodes.

Raw HTML fetch only

Either

Invoke-WebRequest is still better when you need diagnostics.

Troubleshooting blocks

Invoke-WebRequest

Headers and status codes are easier to inspect.

Fetching a page with Invoke-WebRequest

Use Hacker News as the demo target because it serves stable static HTML and exposes obvious repeating story rows.

$url = "https://news.ycombinator.com/news"
$userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0 Safari/537.36"
$response = Invoke-WebRequest `
-Uri $url `
-UserAgent $userAgent `
-UseBasicParsing
$response | Select-Object StatusCode, Headers, Links, Images

This should return status code 200 and a condensed preview of the response headers, which tells you the request worked, and the response object contains inspectable metadata:

StatusCode Headers
---------- -------
200 {[Server, System.String[]], [Date, System.String[]], [Transfer-Encoding, System.String[]], [Conn...

It's very important to properly configure the custom user agent because PowerShell's default user agent identifies the PowerShell client. That's useful for normal automation, but it can be a weak fingerprint for scraping. Set a realistic user agent and keep your request volume sane.

Parsing HTML with PSParseHTML

Fetching the page gives you a string in $response.ContentParsing turns that raw HTML into a DOM, a tree representation of an HTML page that lets you select elements by tag, class, ID, or relationship.

Convert the response content into a DOM:

Import-Module PSParseHTML
$document = ConvertFrom-Html -Content $response.Content -Engine AngleSharp

Use QuerySelector() for a single element. It returns the first match or $null:

$firstTitle = $document.QuerySelector(".titleline > a")
$firstTitle.TextContent.Trim()
$firstTitle.GetAttribute("href")

On the other hand, use QuerySelectorAll() for multiple elements. It returns an AngleSharp collection. Use its .Length property when you need the number of matches. Always null-check before reading .TextContent or attributes because sites change markup without asking your scraper for permission:

$storyRows = $document.QuerySelectorAll("tr.athing")
foreach ($row in $storyRows) {
$titleLink = $row.QuerySelector(".titleline > a")
$titleLink.TextContent.Trim()
}

For JavaScript-rendered pages, this pattern won't be enough because Invoke-WebRequest doesn't execute JavaScript.

Selecting and extracting specific data

After fetching and parsing the page, the next step is turning that HTML into usable data. This is where you identify the right selectors, extract the fields you care about, and shape the results into structured PowerShell objects.

Inspecting the target with browser DevTools

Open the target page in a browser, right-click the data you want, and choose Inspect. Look for the smallest stable container that wraps 1 complete item.

On Hacker News, each story row is a table row with the class athing. The title link sits under .titleline > a. The points, author, age, and comments live in the next row under .subtext.

Before writing PowerShell, test selectors in the browser console:

document.querySelector(".titleline > a").textContent
document.querySelectorAll("tr.athing").length

If a selector fails in the browser, it won't magically work in PowerShell. Fix it before you code around it.

Extracting multiple data points from a listing

This example extracts the title, URL, points, author, comment count, and scrape timestamp from Hacker News:

$baseUrl = "https://news.ycombinator.com/news"
$userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0 Safari/537.36"
$response = Invoke-WebRequest -Uri $baseUrl -UserAgent $userAgent -UseBasicParsing
$document = ConvertFrom-Html -Content $response.Content -Engine AngleSharp
$stories = foreach ($row in $document.QuerySelectorAll("tr.athing")) {
$titleLink = $row.QuerySelector(".titleline > a")
if ($null -eq $titleLink) {
continue
}
$subtext = $row.NextElementSibling.QuerySelector(".subtext")
$scoreText = $subtext.QuerySelector(".score")?.TextContent
$author = $subtext.QuerySelector(".hnuser")?.TextContent
$subtextLinks = $subtext.QuerySelectorAll("a")
$commentsText = if ($subtextLinks.Length -gt 0) {
$subtextLinks[$subtextLinks.Length - 1].TextContent
} else {
$null
}
$points = $null
if ($scoreText -match "(\d+)") {
$points = [int]$Matches[1]
}
$commentCount = 0
if ($commentsText -match "(\d+)") {
$commentCount = [int]$Matches[1]
}
$href = $titleLink.GetAttribute("href")
$absoluteUrl = [Uri]::new([Uri]$baseUrl, $href).AbsoluteUri
[PSCustomObject]@{
Title = $titleLink.TextContent.Trim()
Url = $absoluteUrl
Points = $points
Author = $author
CommentCount = $commentCount
ScrapedAt = [datetime]::UtcNow
}
}
$stories | Format-Table Title, Points, Author, CommentCount

There are 2 important details worth mentioning here:

  • Optional fields must be null-checked. Some listings won't have points, authors, prices, ratings, or comments.
  • Numeric values are cast to integers. The raw page says "994 points." Your output should store 994 as a number, not as a sentence fragment.

Modeling data with PSCustomObject

PSCustomObject is the standard lightweight shape for structured PowerShell output. You define explicit field names, cast values where useful, and let PowerShell handle the rest.

[PSCustomObject]@{
Title = [string]$title
Url = [string]$url
Points = [Nullable[int]]$points
Author = [string]$author
CommentCount = [int]$commentCount
ScrapedAt = [datetime]::UtcNow
}

Typed fields make exports cleaner. CSV columns stay predictable. JSON consumers get stable property names. Your downstream code can compare integers without parsing strings again.

For small pages, collecting objects with foreach is fine. For larger runs, use a generic list so appending stays efficient:

$results = [System.Collections.Generic.List[object]]::new()
$results.Add($storyObject)

Avoid $array += $item inside large loops. It looks clean, but PowerShell creates a new array each time.

Handling pagination and multiple items

There are 2 common pagination patterns. Some sites expose predictable URLs, such as ?p=1 and ?p=2. Others expose a "More" or "Next" link, and your scraper needs to follow it until it disappears.

URL-pattern pagination

When the URL pattern is obvious, a for loop is usually enough:

$allResults = [System.Collections.Generic.List[object]]::new()
$userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0 Safari/537.36"
for ($page = 1; $page -le 5; $page++) {
$url = "https://news.ycombinator.com/news?p=$page"
Write-Host "Scraping page $page - $url"
$response = Invoke-WebRequest -Uri $url -UserAgent $userAgent -UseBasicParsing
$document = ConvertFrom-Html -Content $response.Content -Engine AngleSharp
$rows = $document.QuerySelectorAll("tr.athing")
if ($rows.Length -eq 0) {
break
}
foreach ($row in $rows) {
$titleLink = $row.QuerySelector(".titleline > a")
if ($null -ne $titleLink) {
$allResults.Add([PSCustomObject]@{
Title = $titleLink.TextContent.Trim()
Url = [Uri]::new([Uri]$url, $titleLink.GetAttribute("href")).AbsoluteUri
ScrapedAt = [datetime]::UtcNow
})
}
}
Start-Sleep -Seconds 2
}

The stop condition depends on the site. Use an empty result set, a "no results" message, or a known page count. Don't loop until failure if the page gives you a cleaner signal.

The delay matters too. Start-Sleep -Seconds 2 isn't a magic number. It's a reminder that a scraper shouldn't hammer a server just because a loop can run quickly.

Link-following pagination is a better fit when the next URL isn't predictable. Hacker News exposes a "More" link with class morelink.

$currentUrl = [Uri]"https://news.ycombinator.com/news"
$allResults = [System.Collections.Generic.List[object]]::new()
$maxPages = 5
$page = 0
while ($null -ne $currentUrl -and $page -lt $maxPages) {
$page++
Write-Host "Scraping $currentUrl"
$response = Invoke-WebRequest `
-Uri $currentUrl `
-UserAgent $userAgent `
-UseBasicParsing
$document = ConvertFrom-Html -Content $response.Content -Engine AngleSharp
foreach ($row in $document.QuerySelectorAll("tr.athing")) {
$titleLink = $row.QuerySelector(".titleline > a")
if ($null -eq $titleLink) {
continue
}
$allResults.Add([PSCustomObject]@{
Title = $titleLink.TextContent.Trim()
Url = [Uri]::new($currentUrl, $titleLink.GetAttribute("href")).AbsoluteUri
ScrapedAt = [datetime]::UtcNow
})
}
$nextLink = $document.QuerySelector("a.morelink")
if ($null -eq $nextLink) {
$currentUrl = $null
}
else {
$currentUrl = [Uri]::new($currentUrl, $nextLink.GetAttribute("href"))
}
Start-Sleep -Seconds 2
}

This approach follows the site instead of assuming how the site works. The example caps the run at 5 pages, so a copied script doesn't crawl indefinitely; remove $maxPages when you intentionally want to continue until the next-page link disappears.

Collecting results across pages

For multi-page runs, use a list and log progress:

$results = [System.Collections.Generic.List[object]]::new()
Write-Host "Page $page returned $($rows.Length) rows."
Write-Host "Total collected so far: $($results.Count)"

That basic visibility saves time. If a selector breaks on page 7, you want the terminal to tell you where the drop happened.

Scraping JavaScript-rendered content

Invoke-WebRequest sends HTTP requests and receives HTTP responses. It doesn't run JavaScript, wait for client-side data fetching, click buttons, or build the final rendered DOM.

Why Invoke-WebRequest fails on JS-heavy pages

Many modern pages return a thin HTML shell first. JavaScript then calls APIs with fetch() or XHR, receives data, and renders the UI in the browser.

When you run this:

$response = Invoke-WebRequest -Uri "https://quotes.toscrape.com/js/" -UseBasicParsing
$response.Content | Select-String 'class="quote"'

PowerShell only sees the raw server response. If the browser shows quotes, but $response.Content doesn't contain rendered quote elements; the target is probably JavaScript-rendered.

Confirm that before changing code. Compare the raw HTML from PowerShell with the rendered DOM in DevTools. If the data appears only after JavaScript runs, use a browser automation tool or a managed scraping service.

Using the Selenium PowerShell module

Selenium controls a real browser through WebDriver. Install the Selenium PowerShell module:

Install-Module Selenium -Scope CurrentUser -AllowPrerelease
Import-Module Selenium

Note that newer Selenium 4 builds for PowerShell can still be a little rough, and browser-driver compatibility can be an issue. If the browser updates and the bundled driver doesn't match, replace the driver manually or point the module at a matching driver directory.

A simple headless Firefox flow against a JavaScript-rendered demo page looks like this:

$driver = Start-SeDriver -Browser Firefox -State Headless
try {
Set-SeUrl -Url "https://quotes.toscrape.com/js/"
$null = Get-SeElement `
-By CssSelector `
-Value ".quote" `
-Timeout 15
$quotes = Get-SeElement `
-By CssSelector `
-Value ".quote"
foreach ($quote in $quotes) {
$text = (Get-SeElement -Element $quote -By CssSelector -Value ".text" -Single).Text
$author = (Get-SeElement -Element $quote -By CssSelector -Value ".author" -Single).Text
[PSCustomObject]@{
Text = $text
Author = $author
}
}
}
finally {
Stop-SeDriver -Driver $driver
}

Use explicit waits such as Get-SeElement -Timeout 15 before selecting dynamic content. Blind sleeps are easy to write, but annoying to debug.

When to use a managed alternative

Local Selenium is fine for a small dynamic scrape or a debugging task. It becomes expensive when you need concurrency, browser lifecycle management, driver updates, JavaScript rendering, retries, and proxy rotation at the same time.

For those targets, Decodo's Web Scraping API can handle JavaScript rendering server-side, so your PowerShell script sends a request and receives rendered or structured output without running a browser locally.

If your Decodo setup exposes a proxy-style endpoint, you can keep the same PowerShell -Proxy pattern shown later. If you're using the API endpoint directly, call it with Invoke-RestMethod and keep browser infrastructure out of your script.

PowerShell hits, proxies rotate

Your script handles the logic. Decodo handles the part where websites pretend you don't exist. Residential proxies, anti-bot bypass, one API.

Error handling and troubleshooting

Timeouts, HTTP errors, empty responses, and broken selectors are just a normal part of scraping, so your script needs to handle them without falling apart.

Common errors and what causes them

HTTP 4xx and 5xx responses are the first category. Invoke-WebRequest throws a terminating error for non-success HTTP responses unless you use -SkipHttpErrorCheck. Catch those errors and inspect the response status code:

try {
$response = Invoke-WebRequest -Uri $url -ErrorAction Stop
}
catch {
$statusCode = $null
if ($_.Exception.Response) {
$statusCode = $_.Exception.Response.StatusCode.value__
}
Write-Warning "Request failed with status $statusCode for $url - $($_.Exception.Message)"
}

SSL and TLS errors usually come from self-signed certificates, corporate inspection proxies, old servers, or legacy Windows PowerShell 5.1 environments with outdated TLS defaults. In PowerShell 7, -SkipCertificateCheck skips certificate validation. Use it only against known test hosts with self-signed certificates.

In Windows PowerShell 5.1, older scripts sometimes change System.Net.ServicePointManager::SecurityProtocol or System.Net.ServicePointManager::ServerCertificateValidationCallback to get past TLS or certificate failures. But that's an outdated troubleshooting method. A global certificate-validation bypass can affect every web request in the session, which is too broad for production scraping.

Timeouts should be explicit. In PowerShell 7, use -ConnectionTimeoutSeconds and -OperationTimeoutSeconds instead of waiting indefinitely:

$response = Invoke-WebRequest `
-Uri $url `
-ConnectionTimeoutSeconds 10 `
-OperationTimeoutSeconds 30 `
-ErrorAction Stop

Older scripts may use -TimeoutSec, especially when written for Windows PowerShell 5.1. That single timeout is still useful when you maintain older scripts, but new PowerShell 7 scripts should separate connection timeout from operation timeout so a slow server doesn't freeze the whole run.

An empty response body with a 200 status usually means one of 3 things:

  • The page renders with JavaScript
  • The server returned different content to your client
  • Your request lacks required headers or session cookies

Selector failures are the most common parsing issue. If $document.QuerySelector(".price") returns $null, the element doesn't exist in the parsed HTML. Null-check it before calling .TextContent:

$priceNode = $document.QuerySelector(".price")
$price = if ($null -ne $priceNode) { $priceNode.TextContent.Trim() } else { $null }

Building a retry wrapper

Retry transient network errors, timeouts, and temporary server errors. Don't retry bad selectors 10 times and pretend something useful happened.

This wrapper accepts a script block, retries with exponential backoff, and rethrows the final error:

function Invoke-WithRetry {
param(
[Parameter(Mandatory = $true)]
[scriptblock]$Operation,
[Parameter(Mandatory = $true)]
[string]$Url,
[int]$MaxAttempts = 3,
[int]$InitialDelaySeconds = 2
)
for ($attempt = 1; $attempt -le $MaxAttempts; $attempt++) {
try {
return & $Operation
}
catch {
$message = $_.Exception.Message
Write-Warning "Attempt $attempt failed for $Url - $message"
if ($attempt -eq $MaxAttempts) {
throw
}
$delay = $InitialDelaySeconds * [math]::Pow(2, $attempt - 1)
Start-Sleep -Seconds $delay
}
}
}
$response = Invoke-WithRetry -Url $url -Operation {
Invoke-WebRequest `
-Uri $url `
-UserAgent $userAgent `
-ConnectionTimeoutSeconds 10 `
-OperationTimeoutSeconds 30 `
-ErrorAction Stop
}

This is the same retry idea you'd use in other languages.

Structured error logging

Don't let failed URLs vanish into terminal scrollback. Save them:

$errorLog = [System.Collections.Generic.List[object]]::new()
try {
$response = Invoke-WithRetry -Url $url -Operation {
Invoke-WebRequest -Uri $url -UserAgent $userAgent -ErrorAction Stop
}
}
catch {
$statusCode = $null
if ($_.Exception.Response) {
$statusCode = $_.Exception.Response.StatusCode.value__
}
$errorLog.Add([PSCustomObject]@{
Url = $url
StatusCode = $statusCode
Error = $_.Exception.Message
FailedAt = [datetime]::UtcNow
})
}
if ($errorLog.Count -gt 0) {
$errorLog | Export-Csv -Path ".\error_log.csv" -NoTypeInformation -Encoding UTF8
}

Use -ErrorAction SilentlyContinue only for optional fields where failure is expected.

Using proxies and avoiding blocks

Your requests carry an IP address, a user agent, headers, timing patterns, and session behavior. Anti-bot systems look at those signals together to detect automated behavior.

Why scraping without a proxy gets you blocked

PowerShell's default request tells the server it came from PowerShell. Your real IP sends every request. If you run a tight loop against the same URL, you'll definitely face rate limiting, IP bans, and CAPTCHAs.

Good scraping behavior starts before proxies:

  • Set a realistic user agent
  • Add delays
  • Cache where you can
  • Avoid repeated requests for the same page in a short window

Proxies become necessary when IP reputation, geography, or volume becomes part of the problem.

Passing a proxy with Invoke-WebRequest and Invoke-RestMethod

Both Invoke-WebRequest and Invoke-RestMethod support -Proxy and -ProxyCredential. Use -Proxy for the proxy endpoint and -ProxyCredential for the username and password.

Use environment variables for credentials:

$proxyUri = $env:PROXY_URI
$proxyUser = $env:PROXY_USER
$proxyPassword = ConvertTo-SecureString $env:PROXY_PASSWORD -AsPlainText -Force
$proxyCredential = [PSCredential]::new($proxyUser, $proxyPassword)
$response = Invoke-WebRequest `
-Uri "https://news.ycombinator.com/news" `
-UserAgent $userAgent `
-Proxy $proxyUri `
-ProxyCredential $proxyCredential `
-UseBasicParsing

The same pattern works with APIs:

$data = Invoke-RestMethod `
-Uri "https://api.example.com/items" `
-Proxy $proxyUri `
-ProxyCredential $proxyCredential

Realistic headers are still very important:

$headers = @{
"Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
"Accept-Language" = "en-US,en;q=0.9"
}
$response = Invoke-WebRequest `
-Uri $url `
-Headers $headers `
-UserAgent $userAgent `
-Proxy $proxyUri `
-ProxyCredential $proxyCredential

Just keep headers coherent. A strange mix of browser, language, platform, and request behavior can look worse than a simple default client.

Proxy type decision framework:

Proxy type

Best fit

Trade-off

Datacenter proxies

Public APIs, low-friction sites, fast checks

Cheaper and fast, but easier to flag by IP range

Protected targets, eCommerce pages, search pages, region-sensitive content

More natural IP reputation, usually higher cost

ISP proxies

High-volume jobs where a stable residential-looking identity matters

Faster than many residential pools, less flexible than rotating pools

Rotating IPs per request

With a rotating proxy endpoint, you don't need to manage a local pool of IPs. You send each request through the same proxy endpoint, and the provider can assign a fresh exit IP based on the endpoint or session configuration.

Geo-targeting works the same way. The country, state, city, or session detail usually lives in the proxy username, password, endpoint, or provider dashboard, not in your scraping loop. Keep those values in environment variables so you can switch from a US session to a German session without editing the script.

Your PowerShell code stays simple:

foreach ($url in $urls) {
$response = Invoke-WebRequest `
-Uri $url `
-UserAgent $userAgent `
-Proxy $proxyUri `
-ProxyCredential $proxyCredential `
-ErrorAction Stop
Start-Sleep -Milliseconds 1500
}

For session-based rotation, reuse the same session value when you need a stable identity for a short workflow. Drop or change that session value when each request can use a fresh exit IP. The PowerShell code stays the same, and the proxy configuration controls the rotation behavior.

Exporting and structuring scraped data

PowerShell makes it easy to save the output in a format that other tools can read because your scraped items are already objects.

Exporting to CSV

Use CSV for flat, tabular data:

$results | Export-Csv `
-Path ".\output\results.csv" `
-NoTypeInformation `
-Encoding UTF8

Always use -NoTypeInformation, without it, older PowerShell versions add a type metadata row that spreadsheet and BI tools don't need.

For recurring jobs, append to an existing file:

$results | Export-Csv `
-Path ".\output\results.csv" `
-NoTypeInformation `
-Encoding UTF8 `
-Append

UTF-8 is the safe default for international text, product names, author names, and symbols scraped from pages.

Exporting to JSON

Use JSON when records have nested structure or need to feed an API, pipeline, or document database:

$results | ConvertTo-Json -Depth 3 | Out-File ".\output\results.json" -Encoding UTF8

For versioned run history, add a timestamp:

$timestamp = Get-Date -Format "yyyyMMdd_HHmmss"
$jsonPath = ".\output\results_$timestamp.json"
$results | ConvertTo-Json -Depth 3 | Out-File $jsonPath -Encoding UTF8

For production-style exports, wrap the data with metadata:

$export = [PSCustomObject]@{
SourceUrl = "https://news.ycombinator.com/news"
ScrapedAt = [datetime]::UtcNow
TotalItems = $results.Count
Results = $results
}
$export | ConvertTo-Json -Depth 5 | Out-File $jsonPath -Encoding UTF8

That wrapper makes the file self-describing. You can inspect it 3 months later and still know where it came from and when it was created.

Console summary

End each run with a short summary:

Write-Host ""
Write-Host "Scrape complete."
Write-Host "Items scraped: $($results.Count)"
Write-Host "Errors: $($errorLog.Count)"
if ($results.Count -gt 0) {
Write-Host "First item: $($results[0].Title)"
Write-Host "Last item: $($results[$results.Count - 1].Title)"
}
$results | Select-Object -First 10 Title, Points, Author | Format-Table

This is a form of operational feedback. A scraper that silently outputs an empty file can misleadingly look successful until someone checks the data.

Final thoughts

PowerShell can handle the full scraping pipeline: setup, fetch, parse, select, paginate, retry, proxy, and export. It's a great choice for static HTML and simple APIs, because the shell already treats structured data as objects.

Use PowerShell 7 for anything you expect to maintain. Windows PowerShell 5.1 can still fetch pages, but its old HTML parsing behavior belongs to a different era. In PowerShell 7, pair Invoke-WebRequest with PSParseHTML and keep the parsing model explicit.

If JavaScript rendering, browser maintenance, proxy rotation, and anti-bot handling become the project, a managed scraping layer can be cheaper than a local scraping code.

403 Forbidden? Shocking

Decodo's Web Scraping API returns actual data instead of error codes. Rendering, CAPTCHAs, proxy rotation, all handled before your script parses a thing.

About the author

Justinas Tamasevicius

Director of Engineering

Justinas Tamaševičius is Director of Engineering with over two decades of expertise in software development. What started as a self-taught passion during his school years has evolved into a distinguished career spanning backend engineering, system architecture, and infrastructure development.


Connect with Justinas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Can PowerShell scrape JavaScript-rendered pages?

Not by itself. Invoke-WebRequest only gets the raw HTTP response, so content rendered after JavaScript runs won't appear. For those pages, use Selenium with a headless browser or a scraping API that renders the page server-side.

What is the difference between Invoke-WebRequest and Invoke-RestMethod for scraping?

Invoke-WebRequest returns the full response object, so it's better for HTML scraping and debugging. Invoke-RestMethod returns parsed content and auto-converts JSON or XML into PowerShell objects, so it's the better fit for APIs.

Do I need PowerShell 7 or can I use Windows PowerShell 5?

Windows PowerShell 5.1 can scrape, but PowerShell 7 is the better baseline. It runs cross-platform, installs side by side with 5.1, and avoids the old Internet Explorer-based HTML parsing path.

How do I avoid getting blocked while scraping with PowerShell?

Use a realistic user agent, coherent headers, delays between requests, and retries with backoff. For IP-based blocking, use proxies through -Proxy and -ProxyCredential. For harder targets, use rotating residential proxies or a managed scraping solution.

Is web scraping with PowerShell legal?

That depends on the site, the data, your jurisdiction, and how you use the results. For commercial use, review the site's terms, confirm what data you're collecting, and get legal advice if the risk is unclear.

How To Scrape Websites With Dynamic Content Using Python

You've mastered static HTML scraping, but now you're staring at a site where Requests + Beautiful Soup returns nothing but an empty <div> and <script> tags. Welcome to JavaScript-rendered content, where you get the material after the initial request. In this guide, we'll tackle dynamic sites using Python and Selenium (plus a Beautiful Soup alternative).

Choosing between XPath and CSS

How To Choose The Right Selector For Web Scraping: XPath vs CSS

If you're fresh-new to data scraping, you may not be familiar with selectors yet. Let us introduce ya – selectors are objects that find and return web items on a page. These pieces are an essential part of a scraper, as they affect your tests' outcome, efficiency, and speed.

Yep, understanding the idea of a selector isn't that complicated. Finding the right selector itself might be. To be honest, even the two languages that define them, XPath and CSS, have their own pros and cons. So it can quickly become a headache to choose one of them. But here's some good news – we're here to help! Let's explore it together.

Anti-scraping

Anti-Scraping Techniques And How To Outsmart Them

Businesses collect scads of data for a variety of reasons: email address gathering, competitor analysis, social media management – you name it. Scraping the web using Python libraries like Scrapy, Requests, and Selenium or, occasionally, the Node.js Puppeteer library has become the norm.

But what do you do when you bump into the iron shield of anti-scraping tools while gathering data with Python or Node.js? If not too many ideas flash across your mind, this article is literally your stairway to heaven cause we’re about to learn the most common anti-scraping techniques and how to combat them.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved