Back to blog

How To Scrape Emails From a Website: Python Tutorial

Scraping emails from a website is essential for lead generation, partner research, and CRM enrichment. However, to reliably scrape emails from a website, you need to handle multiple formats, including mailto links, plain-text addresses, obfuscated strings, and JavaScript-rendered content. This guide shows how to safely build a Python email scraper and scale it into a multi-page crawling workflow.

TL;DR

  • There are 3 main approaches to email scraping: handling mailto links, regex, and obfuscation normalization. 
  • Check legalities such as Terms of Service, robots.txt, GDPR, and the CAN-SPAM Act before email scraping.
  • Scrape a single webpage and crawl an entire domain for emails.
  • Resolve scraping blocks, such as rotating residential proxies, Playwright for JS targets, and third-party scraping APIs.
  • Create a pipeline that saves, cleans, and deduplicates the output using best practices, including normalization, syntax revalidation, and noise filtering.
  • Automate your email scraper workflows using tools such as cron, Task Scheduler, and n8n. 
  • Build a custom scraper for niche domains with specialized crawling logic, but use a dedicated email finder for large lists with built-in CRM integrations. 

Use cases and benefits of email scraping

  • B2B lead generation and outreach. A developer who recently built a new product needs investors. However, instead of purchasing outdated bulk email lists for B2B outreach and lead generation, email scraping delivers higher-quality data by directly finding publicly listed contact addresses on potential companies' websites.
  • CRM enrichment. Whether you use Zoho, HubSpot, or any other CRM tool, you can supplement existing contact databases by email scraping, especially those with missing email fields resulting from false positives from bulk email crawling. 
  • Academic and journalistic research. If the target outreach audience for your software solution is largely academic and editorial, access their readily available contact information on their publishing or institution's website. 
  • Competitive intelligence. Email scraping reveals competitive insights for website building. Web scraping competitors' email addresses exposes the layout of their websites and the roles they publicly advertise. This is handy information for developers when building or updating websites.

Note that domain-scoped targeted scraping, which involves scraping specific URLs of a site, produces better results – lower false-positive rate, higher deliverability, and less irrelevant data to filter than broad crawling, which focuses on retrieving all the pages of a particular site with lots of irrelevant data. 

Before writing a line of code, consider how web scraping legalities shape your engineering decisions; this will save you from being susceptible to unprecedented lawsuits in web scraping. Here's a practical compliance checklist. 

  • GDPR (EU). General Data Protection Regulation (GDPR) is the toughest privacy and security law that sets out the implications of collecting contact data from the web, the lawful requirements before processing the data, and the distinction between the careful usage of personal and B2B company contact data.
  • Terms of service (ToS) and robots.txt. These pages contain the specific scraping permissions or authentication requirements for a target site before the scraper runs. Even though the web data is publicly available, developers risk legal violations by skipping ToS or robots.txt checks. The ToS can be found in the footer section of websites, while the robots.txt file can be found by adding /robots.txt to the end of the website's root domain. For example, https://decodo.com/robots.txt if your target site uses a "www" subdomain. You can programmatically include robot.txt and ToS links in your scraping script to guide the scraper before scraping commences.  
  • CAN-SPAM Act (US). This prevents unintended post-scraping actions, such as spammy commercial messages, by setting rules for sending bulk emails and clarifying opt-out requirements for disinterested leads during outreach. 
  • Data Downstream Use. Scrape emails for targeted outreach, not blatantly harvesting emails with no legitimate use, but rather to spam contacts unnecessarily. 

For a more robust knowledge on the legal coverage of web scraping, "Is web scraping legal?" is a helpful piece to help you. Once you're done with the safety compliance checks, you are ready to scrape.

Technical methods for email extraction

The appropriate methods for email extraction in your scraper depend on how emails are rendered on your target websites. Here are the 3 main approaches to displaying contacts on websites. 

  • mailto links. This is the most obvious approach on websites. It involves embedding a link in a Contact us statement or mail icon using an anchor tag (<a>) whose href starts with mailto, for example, <a href="mailto:contact@example.com">Contact</a>. If you encounter a more complex selector structure for emails, "How To Choose The Right Selector For Web Scraping" is a helpful article with great tips. 
  • regex on raw HTML. If your target website decides not to take the common approach but rather to render emails in plain-text paragraphs (<p>), footers, and div content, as is common on academic institution websites, use regex as a fallback for efficient scraping. In your Python code, use regex to bypass false positives, such as image filenames containing @, version strings like 2.0@stable, and to cover other filtering strategies helpful for parsing and retrieving emails rendered as plain text.
  • Obfuscated email formats. This is when your target website has a mission to trick scrapers and thus decides to hide emails. It achieves this through rendering emails on web pages in non-obvious formats, for example, info[at]domain[dot]com instead of info@domain.com. They could go further by implementing HTML entity encoding, which renders emails in ASCII or Unicode characters; for example, 'e&#64;mail&#46;com'. Another level is using CSS-reversed text, such as moc.liame@drowssap, to display email addresses in reversed format on websites. So, when scraping at scale, normalize outputs to retrieve usable data. 
  • JavaScript-rendered emails. When target websites render their emails via JavaScript, your email scraper returns 'none'. Successfully scrape with a headless browser like Playwright for a single-page, one-time use, or a managed scraping API like Decodo's Web Scraping API if you are scraping at scale or across multiple pages with such a configuration.

Note. JavaScript-based email rendering or high email obfuscation might imply that the website doesn't want the data scraped, so it's worth double-checking their ToS or robot.txt before commencing. 

Once everything is done, let's get started building our email scraper. 

Setting up your Python environment

The first step in building an email scraper is to set up a Python development environment tailored towards efficiency and scalability. 

Python environment

Create a project folder

mkdir email_scraper
cd email_scraper

Install Python 3.9+ and core dependencies like: 

  • Requests. For making HTTP requests.
  • Beautiful Soup 4 (bs4). Acts as your scraping driver for easily scrapable target websites, but use Playwright if you are scraping JavaScript-heavy sites. If you are new to Beautiful Soup, this article on Beautiful Soup web scraping covers the essentials. 
  • lxml. For parsing HTML or XML pages. Choose lxml over Python's built-in HTML parser because it's faster, has a low memory footprint during parsing, and excels at repairing broken HTML. 
  • Python-dotenv. For loading secrets (API keys, Decodo's proxy credentials) stored in your .env file.
pip install requests beautifulsoup4 lxml python-dotenv

Create and activate a virtual environment 

py -m venv .venv
venv\Scripts\activate

To retrieve your proxy credentials, first sign up for a Decodo account. In the dashboard, select Residential, and start a free trial.

Once your free trial is activated, click Residential → Proxy setup page on your dashboard. Then click Authentication, where you will be asked to add a user or choose your whitelisted IP address. Choose 'Add User', copy your proxy credentials into your .env file, and you are good to go. Ensure the fields of the copied proxies match those below.

Include these fields in your .env file.

PROXY_USERNAME='YOUR_PROXY_USERNAME'
PROXY_PASSWORD='YOUR_PROXY_PASSWORD'
PROXY_HOST='YOUR_PROXY_HOST'
PROXY_PORT='YOUR_PROXY_PORT'

Residential proxies are often preferred for continuous scraping because datacenter proxies, offered by cloud providers, are more easily identified and blocked by anti-scraping systems. 

Suggested file layout

If you are scraping at scale, here's a recommended project structure for scalability and modularity in your email_scraper folder. 

email-scraper/
├── config.py # Stores configuration and loads proxy credentials
├── extractor.py # Handles email extraction logic
├── crawler.py # Manages link-following and crawl queue
├── main.py # Entry point that runs the scraper and handles output
├── .env # Stores proxy credentials (not committed)
├── .gitignore # Excludes sensitive files like .env from version control
├── requirements.txt # Project dependencies
└── README.md # Project documentation

Scale your scraping

Bypass all email scraping blocks with Decodo's elite residential proxy network.

Extracting emails from a single page

This section is a practical guide to extracting email addresses from a single webpage using 3 technical layers: mailto links, parsing raw HTML with regex, and handling obfuscated email addresses in our code. Fixing all layers in a single script makes your scraper more efficient by saving you time from adding other layers if one fails. We will use Decodo's About page URL for this tutorial. Since we have installed the dependencies, let's write our script.

1. Fetch the page HTML using requests. Set a realistic User-Agent to prevent the default Requests library user agent, which is highly recognizable to anti-bots. Then define a timeout and handle HTTP errors.

import requests
URL = "https://decodo.com/about"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
def fetch_html(url):
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return ""

2. Layer 1. Extract mailto links. This code selects all anchor tags with href containing 'mailto' –  <a href="mailto:...">, removes query strings, and normalizes emails to lowercase.

from bs4 import BeautifulSoup
def extract_mailto(html):
soup = BeautifulSoup(html, "lxml")
emails = set()
for tag in soup.select('a[href^="mailto:"]'):
href = tag.get("href", "")
email = href.replace("mailto:", "").split("?")[0]
emails.add(email.lower())
return emails

3. Layer 2. Extract emails with regex to capture emails rendered as plain text in the HTML. 

import re
EMAIL_REGEX = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
def extract_with_regex(html):
matches = re.findall(EMAIL_REGEX, html)
return set(email.lower() for email in matches)

Regex breakdown:

  • [a-zA-Z0-9._%+-]+. The username part.
  • @. The separator.
  • [a-zA-Z0-9.-]+. The domain name.
  • \.[a-zA-Z]{2,} . The top-level domain (e.g., .com, .org).

Remember to keep an eye out for common false positives, such as image file names like logo@2x.png. Depending on the site, you may need to apply additional filters.

4. Layer 3. Handle obfuscated emails. Normalize formats like john[at]company[dot]com or sales(at)example(dot)io before applying regex again. This improves coverage of hidden emails.

def normalize_obfuscation(text):
text = text.lower()
replacements = {
r"\s*\[at\]\s*": "@",
r"\s*\(at\)\s*": "@",
r"\s*\[dot\]\s*": ".",
r"\s*\(dot\)\s*": ".",
}
for pattern, replacement in replacements.items():
text = re.sub(pattern, replacement, text)
return text
def extract_obfuscated(html):
normalized = normalize_obfuscation(html)
return extract_with_regex(normalized)

5. Combining results. Each method captures different cases. Combine results into a single set for deduplication – to avoid duplication.

def extract_all_emails(html):
emails = set()
emails.update(extract_mailto(html))
emails.update(extract_with_regex(html))
emails.update(extract_obfuscated(html))
return emails

6. Add Decodo's rotating residential proxy to your email scraper. This sets the stage for proxy rotation to prevent IP flagging by routing the request via multiple IP addresses

7. Load Decodo proxy credentials from your .env file into your script. This keeps your credentials safe. 

from dotenv import load_dotenv
import os
load_dotenv()
PROXY_USERNAME = os.getenv("PROXY_USERNAME")
PROXY_PASSWORD = os.getenv("PROXY_PASSWORD")
PROXY_HOST = os.getenv("PROXY_HOST")
PROXY_PORT = os.getenv("PROXY_PORT")

8. Build your proxy URL and use proxies in requests. Also, add the predefined timeout and headers in requests.

proxy_url = f"http://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_HOST}:{PROXY_PORT}"
proxies = {
"http": proxy_url,
"https": proxy_url,
}
response = requests.get(url, headers=HEADERS, proxies=proxies, timeout=10)

9. Verify your proxy setup. Confirm it’s working by comparing your proxy IP with your original IP.

def verify_proxy():
try:
# Step 1: Get real IP (no proxy)
real_ip = requests.get(
"https://api.ipify.org?format=json",
timeout=10
).json()["ip"]
# Step 2: Get proxy IP
proxy_ip = requests.get(
"https://api.ipify.org?format=json",
proxies=proxies,
timeout=10
).json()["ip"]
print(f"[INFO] Real IP: {real_ip}")
print(f"[INFO] Proxy IP: {proxy_ip}")
# Step 3: Compare
if real_ip != proxy_ip:
print("[SUCCESS] Proxy is working correctly ✅")
else:
print("[WARNING] Proxy is NOT working ❌")
except Exception as e:
print("[ERROR] Proxy test failed:", e)

10. Put it together. Here's the full working single-page email scraper script with proxies.

# import requests
from bs4 import BeautifulSoup
import re
import requests
import os
from dotenv import load_dotenv
import requests
# Load environment variables
load_dotenv()
# Proxy credentials
PROXY_USERNAME = os.getenv("PROXY_USERNAME")
PROXY_PASSWORD = os.getenv("PROXY_PASSWORD")
PROXY_HOST = os.getenv("PROXY_HOST")
PROXY_PORT = os.getenv("PROXY_PORT")
# Build proxy URL
proxy_url = f"http://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_HOST}:{PROXY_PORT}"
proxies = {
"http": proxy_url,
"https": proxy_url,
}
# Target URL
URL = "https://decodo.com/about#gref"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
EMAIL_REGEX = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
def fetch_html(url):
try:
response = requests.get(
url,
headers=HEADERS,
proxies=proxies,
timeout=10
)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
print(f"[ERROR] Request failed: {e}")
return ""
def extract_mailto(html):
soup = BeautifulSoup(html, "lxml")
emails = set()
for tag in soup.select('a[href^="mailto:"]'):
href = tag.get("href", "")
email = href.replace("mailto:", "").split("?")[0]
emails.add(email.lower())
return emails
def extract_with_regex(html):
matches = re.findall(EMAIL_REGEX, html)
return set(email.lower() for email in matches)
def normalize_obfuscation(text):
text = text.lower()
replacements = {
r"\s*\[at\]\s*": "@",
r"\s*\(at\)\s*": "@",
r"\s+at\s+": "@",
r"\s*\[dot\]\s*": ".",
r"\s*\(dot\)\s*": ".",
r"\s+dot\s+": ".",
}
for pattern, replacement in replacements.items():
text = re.sub(pattern, replacement, text)
return text
def extract_obfuscated(html):
normalized = normalize_obfuscation(html)
return extract_with_regex(normalized)
def extract_all_emails(html):
emails = set()
emails.update(extract_mailto(html))
emails.update(extract_with_regex(html))
emails.update(extract_obfuscated(html))
return emails
def verify_proxy():
try:
real_ip = requests.get(
"https://api.ipify.org?format=json",
timeout=10
).json()["ip"]
proxy_ip = requests.get(
"https://api.ipify.org?format=json",
proxies=proxies,
timeout=10
).json()["ip"]
print(f"[INFO] Real IP: {real_ip}")
print(f"[INFO] Proxy IP: {proxy_ip}")
if real_ip != proxy_ip:
print("[SUCCESS] Proxy is working correctly ✅")
else:
print("[WARNING] Proxy is NOT working ❌")
except Exception as e:
print("[ERROR] Proxy test failed:", e)
def main():
print("[INFO] Verifying proxy...")
verify_proxy()
print(f"\n[INFO] Fetching: {URL}")
html = fetch_html(URL)
if not html:
print("[ERROR] No HTML retrieved.")
return
emails = extract_all_emails(html)
print(f"\n[INFO] Found {len(emails)} unique email(s):\n")
for email in sorted(emails):
print(email)
if __name__ == "__main__":
main()

Run the script

python email_extractor.py

The output is every email it finds listed on Decodo's About page

For more context on Python web scraping, read Python Web Scraping: In-Depth Guide

Crawling an entire domain for emails

Let's take our email scraper a step further by expanding it from a single-page scraper into a domain crawler that follows internal links and collects emails across all reachable pages. We'll build a simple domain crawler using the seed URL - https://decodo.com. This crawler discovers and processes multiple pages across Decodo's website. 

1. Configuring Decodo's residential proxies at the session level. Route every request through a proxy without repeating the configuration. Using Decodo's Residential Proxies at the session level ensures requests automatically rotate IPs, reducing the risk of blocking during sustained crawling.

import requests
import os
from dotenv import load_dotenv
load_dotenv()
PROXY = f"http://{os.getenv('PROXY_USERNAME')}:{os.getenv('PROXY_PASSWORD')}@{os.getenv('PROXY_HOST')}:{os.getenv('PROXY_PORT')}"
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0"
})
session.proxies.update({
"http": PROXY,
"https": PROXY,
})

2. Domain scoping. Avoid crawling irrelevant or restricted resources by limiting the crawler to the same domain and excluding non-HTML files.

from urllib.parse import urlparse
START_URL = "https://decodo.com"
BASE_DOMAIN = urlparse(START_URL).netloc
def is_valid_url(url):
parsed = urlparse(url)
# Stay within the Decodo domain
if parsed.netloc != BASE_DOMAIN:
return False
# Skip non-HTML resources
if url.endswith((".pdf", ".jpg", ".png", ".js", ".css")):
return False
return True

3. Visited URL tracking. Prevent processing the same page multiple times by maintaining a set of visited URLs while crawling.

visited = set()
if url in visited:
continue
visited.add(url)

4. Page prioritization. Prioritize pages that are more likely to contain email addresses before exploring the rest of the site.

PRIORITY_PATHS = ["/contact", "/about", "/team", "/people"]
def prioritize_links(links):
priority = []
normal = []
for link in links:
if any(path in link for path in PRIORITY_PATHS):
priority.append(link)
else:
normal.append(link)
return priority + normal

5. Extract internal links with Beautiful Soup. The find_all() function finds the internal links. 

from bs4 import BeautifulSoup
from urllib.parse import urljoin
def extract_links(html, base_url):
soup = BeautifulSoup(html, "lxml")
links = set()
for tag in soup.find_all("a", href=True):
full_url = urljoin(base_url, tag["href"])
if is_valid_url(full_url):
links.add(full_url)
return links

6. Process pages iteratively using a queue. Continue the crawl loop until the queue is empty or a crawl limit is reached.

from collections import deque
queue = deque([START_URL])
while queue:
url = queue.popleft()
if url in visited:
continue
print(f"[INFO] Crawling: {url}")
visited.add(url)
html = session.get(url, timeout=10).text
# Extract emails (reuse earlier logic)
emails = extract_emails(html)
# Extract and prioritize links
links = extract_links(html, url)
links = prioritize_links(links)
for link in links:
if link not in visited:
queue.append(link)

7. Randomize delays between requests (5 to 20 seconds)

import time
import random
time.sleep(random.uniform(2, 5))

8. Put it together. Here's the complete script.

import requests
from bs4 import BeautifulSoup
import re
import os
import time
import random
from dotenv import load_dotenv
from urllib.parse import urljoin, urlparse
from collections import deque
# Load environment variables
load_dotenv()
# Proxy configuration (Decodo)
PROXY = f"http://{os.getenv('PROXY_USERNAME')}:{os.getenv('PROXY_PASSWORD')}@{os.getenv('PROXY_HOST')}:{os.getenv('PROXY_PORT')}"
# Session setup
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
})
session.proxies.update({
"http": PROXY,
"https": PROXY,
})
# Target: Decodo
START_URL = "https://decodo.com"
BASE_DOMAIN = urlparse(START_URL).netloc
# Email regex
EMAIL_REGEX = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
# Priority pages
PRIORITY_PATHS = ["/about", "/contact", "/company", "/team", "/careers"]
def fetch_html(url):
try:
response = session.get(url, timeout=10)
response.raise_for_status()
return response.text
except:
return ""
def extract_emails(html):
return set(re.findall(EMAIL_REGEX, html))
def is_valid_url(url):
parsed = urlparse(url)
# Stay within the Decodo domain
if parsed.netloc != BASE_DOMAIN:
return False
# Skip non-HTML files
if url.endswith((".pdf", ".jpg", ".png", ".js", ".css")):
return False
return True
def extract_links(html, base_url):
soup = BeautifulSoup(html, "lxml")
links = set()
for tag in soup.find_all("a", href=True):
full_url = urljoin(base_url, tag["href"])
if is_valid_url(full_url):
links.add(full_url)
return links
def prioritize_links(links):
priority = []
normal = []
for link in links:
if any(path in link for path in PRIORITY_PATHS):
priority.append(link)
else:
normal.append(link)
return priority + normal
def crawl(max_pages=20):
queue = deque([START_URL])
visited = set()
all_emails = set()
while queue and len(visited) < max_pages:
url = queue.popleft()
if url in visited:
continue
print(f"[INFO] Crawling: {url}")
visited.add(url)
html = fetch_html(url)
if not html:
continue
# Extract emails
emails = extract_emails(html)
if emails:
print(f"[INFO] Found {len(emails)} email(s) on this page")
all_emails.update(emails)
# Extract and prioritize links
links = extract_links(html, url)
links = prioritize_links(links)
for link in links:
if link not in visited:
queue.append(link)
# Delay to avoid detection
time.sleep(random.uniform(2, 5))
return all_emails
def main():
print("[INFO] Starting crawl on Decodo...\n")
emails = crawl()
print(f"\n[INFO] Total unique emails found: {len(emails)}\n")
for email in sorted(emails):
print(email)
if __name__ == "__main__":
main()

Then, run the script.

python email_crawler.py

The result is the combination of all emails found across all pages.

Remember that residential proxies are often preferred for continuous scraping because datacenter proxies, offered by cloud providers, are more easily identified and blocked by anti-scraping systems. 

For a deeper look at crawling architecture, read Getting started with web crawler Python development.

Handling JavaScript-rendered emails and anti-bot measures

Identifying the JavaScript rendering gap

If your scraper returns no emails, but emails are visible on the browser, it means that JavaScript injects the address after page load. Confirm this by viewing the page source - view-source or DevTools inspector. Press Ctrl+U to view the page source; this renders the HTML returned from the server. If the email is missing here but visible on the page, then it is injected later by JavaScript.  

Press F12 on Windows or Cmd+Option+I on Mac to open DevTools. This shows the rendered DOM – the final web structure after JavaScript has loaded. If DevTools shows the emails but view-source doesn't, then it's injected by JavaScript after the page loads. Your scraper only sees what the server sends, so it comes back empty. The gap is a signal to switch to a headless browser for scraping, such as Playwright.

Using Playwright in headless mode

Playwright launches a headless Chromium browser, executes JavaScript the way a real browser does, and gives you the fully rendered HTML. It loads the page, waits for the email element to appear, calls page.content() to retrieve the rendered HTML, and runs the same email-extraction functions used in the single-page section. Here's a simple setup:

from playwright.async_api import async_playwright
import asyncio
async def get_rendered_html(url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url)
await page.wait_for_selector("a[href^='mailto:']", timeout=5000)
html = await page.content()
await browser.close()
return html

For a deeper walkthrough, read Playwright web scraping: a practical tutorial.

Basic stealth configuration for Playwright

It's important to note that anti-bots easily detect headless browsers. Here are some steps to reduce Playwright's detection signal:

  • Set a realistic User-Agent string that matches a current desktop browser.
  • Configure viewport dimensions to a standard resolution (e.g., 1920×1080).
  • Remove the navigator.webdriver flag that explicitly identifies automated sessions.
await page.set_extra_http_headers({"User-Agent": "Mozilla/5.0 ..."})
await page.set_viewport_size({"width": 1920, "height": 1080})
await page.add_init_script("delete Object.getOwnPropertyDescriptor(navigator, 'webdriver')")

How to notice anti-bot systems

Here's how you can tell if anti-bots have detected your scraper:

  • Sites return 403s on IP blocks or show empty pages after too many requests from a single address
  • Experiencing CAPTCHA on contact forms when bot-like behavior is detected
  • Honeypot links. These are links on web pages invisible to humans but visible to scrapers as a trap to detect scrapers, since only scrapers can locate such links, which are often irrelevant. 
  • Rate limiting that reduces request intervals to a nearly unusable pace.

For a more detailed breakdown of methods, check Anti-Scraping Methods and How to Outsmart Them.

Rotating residential proxies vs. site unblockers 

Some target websites go beyond IP addresses; they check TLS (Transport Layer Security) handshake metadata, browser fingerprinting, or CAPTCHA-backed protection, and even with Playwright and rotating IP addresses alone, they can't be bypassed. That's where Decodo's Site Unblocker comes in. It's the right escalation for special targets that specifically identify and block automated browser environments, not the starting point for every scrape.

Saving, deduplicating, and validating results

Raw email-scraping output from a multi-page crawl can be messy, with duplicates, false positives, and noise from system addresses. This section outlines a pipeline that cleans the scraped data and sends it to your CRM, AI mailing tool, or spreadsheet.

  • Normalization. Before deduplication, standardize every address by lowercasing, stripping surrounding whitespace, and unnecessary punctuation that regex sometimes captures at string boundaries. Stripping it at first contact, before output, prevents duplicates caused by formatting differences.
email = email.lower().strip().strip(".,;:")
  • Deduplicate across the full crawl. Use a Python set function – set(), not just per page but across the entire crawl loop, so that the same address on the contact and about page collapses into a single record.
  • Syntax re-validation. Run the set output through a stricter regex to discard false positives, image filenames like image@2x.png or version strings like 3.0@beta that the initial extraction sometimes retrieves.
  • Noise filtering. Skip role and system accounts while scraping unless your use case specifically needs them.
SKIP_PREFIXES = ("noreply@", "no-reply@", "privacy@", "legal@", "admin@", "support@")
filtered = [e for e in valid_emails if not e.startswith(SKIP_PREFIXES)]
  • CSV output. Write the results to CSV with header row fields, namely emailsource_url, and scraped_at, and ensure each row corresponds to one email address, making it ready for direct import into CRM, LLM mailing tools, or spreadsheets. For expanded CSV, Excel, and database patterns, see How to save scraped data to CSV, Excel, and databases in Python.
  • JSON output for pipeline integration. Wrap results in a metadata object. It ensures traceability and facilitates efficient parsing in automated workflows. 
import json
{
"metadata": {
"seed_url": "https://example.com/contact",
"crawl_timestamp": "2026-04-09T10:00:00Z",
"domain": "example.com",
"total_emails_found": 3
},
"results": {
"emails": [
"contact@example.com",
"support@example.com",
"info@example.com"
]
}
}

Automation and integration options

Although a single scraper run meets our testing requirements, the scraper requires automated scheduling operations to handle recurring tasks, including monitoring the contact page, entering CRM data, and managing the domain list.

  • Scheduling with cron (Linux/macOS) or Task Scheduler (Windows). Scheduling email scraping for periodic runs on an already defined domain list is useful for tracking how contact pages change over time and reduces load on the target website. Achieve this on Linux/macOS by adding a cron entry to run the scraper on a defined interval.
0 8 * * * /usr/bin/python3 /path/to/email_scraper.py
#This runs 08:00 daily

On Windows, use Task Scheduler with a .bat file containing a script that points to your scraper:

@echo off
cd /d "C:\Users\NEW USER\Desktop\Work\Schedule Scraper\"
python scraper.py
echo scraper executed at %date% %time% >> scraper.log

The setup is covered in detail in this Python web scraping scheduling guide.

  • n8n integration. If your team uses low-code options for scraping, like n8n, the scraper can run as a workflow node. From there, you can:
     
    • Route new addresses to a CRM via webhook.
    • Send a Slack notification when new emails are found.
    • Filter results before export based on domain or prefix rules.

Read the complete guide to building n8n web scraping workflows for the full walkthrough of the integration.

  • Google Sheets as an output destination. For teams that prefer a spreadsheet-based workflow, write results directly to Google Sheets using the gspread library or the Sheets API:
import gspread
gc = gspread.service_account(filename="credentials.json")
sh = gc.open("Email Results")
ws = sh.sheet1
ws.append_rows([[e["email"], e["source_url"], e["scraped_at"]] for e in results])

For more information on routing to a spreadsheet-based workflow, read 'Google Sheets web scraping: an ultimate guide'.

  • Replacing the proxy layer with the Web Scraping API. Managing rotating proxies and headless browser infrastructure adds maintenance overhead that compounds as scale increases. Decodo's Web Scraping API removes that layer: pass a URL, get back clean, rendered HTML, and run the same extraction functions. No proxy pool to maintain and no browser infrastructure to host. It's the right move for keeping scraping workflows simple when teams scraping at scale do not want to manage rotating proxies, headless browsers, or retry logic themselves. 

Email scraping tools: When a dedicated solution makes sense

Building your own scraper isn't always the right call. The decision comes down to what you're targeting and what you need from the data.

When DIY scraping wins

  • The domains you're targeting are niche or custom, not covered by commercial databases, and you're cost-sensitive at a moderate scale
  • You need full control over data handling and storage
  • Your use case requires domain-specific crawling logic

When a dedicated tool wins

  • You need LinkedIn emails and other gated professional directories. LinkedIn requires authentication and explicitly prohibits automated collection; web scraping won't work there. You can retrieve only with dedicated tools. 
  • You need large-scale, verified data with built-in bounce rate reduction
  • Compliance features like GDPR (EU) are already built into the email finder tool
  • The tool already has built-in CRM integrations that eliminate the need for a custom output pipeline

Key criteria when evaluating tools

  • Data freshness. This refers to how often dedicated email finder tools update their databases, whether weekly, biweekly, or monthly.  
  • Verification rates. Read the accuracy percentage claims for the tools and double-check them against customer reviews. 
  • CRM integrations and API access. Top tools offer CRM integrations with APIs to update their databases in real time. 
  • Pricing models (pay-as-you-go vs. subscription). Know which tool works best with your budget. Subscription works for large-scale continuous retrieval, while pay-as-you-go works for one-off or minimal use cases. 
  • GDPR and other ethical compliance postures. Choose tools that comply with the legal and ethical requirements of email scraping. 

These factors will separate the tools that fit your use case from those that don't.

The hybrid approach

Many top-performing teams don't choose one option. They use a combination of Do It Yourself (DIY) scraping with rotating proxies and other scraping mechanisms for publicly crawlable web sources, and then verified email databases with email-finder tools like Hunter.io, Apollo.io, and Snov.io for gated or dataset-ready sources. This combination covers more ground than a single approach.

In case you are thinking about which email finder tools to use, here's an overview of email finder tools for various use cases.

Tool

Core Strength

Key Features

Accuracy in domain searches and robust email verification

Domain searches, email verification, and a Chrome extension.

All-in-one sales intelligence and engagement platform

Massive contact database, CRM integrations, and outreach automation.

Versatile, all-in-one cold outreach and lead generation.

Email finding, CRM features, cold outreach, and email deliverability management.

Final thoughts

Effective email extraction combines multiple techniques, including parsing mailto links, applying regex for plain-text addresses, normalizing obfuscated formats, and handling JavaScript-rendered content with Playwright. A reliable crawler builds on this by starting from a seed, prioritizing high-value pages, restricting traversal to the same domain, and maintaining deduplicated results. Always remember to pair technical implementation with compliance. Check robots.txt before crawling, introduce request delays, and evaluate how collected data will be used downstream.

Extract emails at scale

Decodo's Web Scraping API pulls structured data from any site without running into blocks, CAPTCHAs, or wasting IPs.

About the author

Lukas Mikelionis

Senior Account Manager

Lukas is a seasoned enterprise sales professional with extensive experience in the SaaS industry. Throughout his career, he has built strong relationships with Fortune 500 technology companies, developing a deep understanding of complex enterprise needs and strategic account management.


Connect with Lukas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Is it legal to scrape email addresses from websites?

Generally, yes, with conditions. Scraping publicly visible emails is lawful in most jurisdictions, but downstream use determines the compliance risk. Targeted B2B outreach falls into a different category than bulk unsolicited emails. Check the site's robots.txt and Terms of Service (ToS) before scraping, respect GDPR's lawful basis requirement for EU data subjects, and follow CAN-SPAM rules for commercial email in the US. The collection is usually legal; sending spam as a result isn't.

What's the difference between email scraping and email harvesting?

The terms are often used interchangeably but carry different weights. Email scraping describes the technical act of extracting addresses from web pages at the neutral, tool-level. Email harvesting implies bulk, indiscriminate collection, often for spam, and is the term used in most anti-spam legislation. Targeted scraping with a documented legitimate purpose sits in a different compliance category. This article covers domain-scoped, responsible scraping, not bulk list building.

How do I scrape emails loaded by JavaScript that don't appear in the page source?

If an email appears in the browser but not in the page source, the solution is to scrape with a headless browser; Playwright is the recommended option here. It executes JavaScript, returns the fully rendered DOM (Document Object Model), and your existing mailto and regex extraction functions work on that output without changes. Refer to the 'Handling JavaScript-rendered emails and anti-bot measures' section in this article for a Playwright snippet you can use to solve this.

Can I scrape email addresses from LinkedIn?

No. LinkedIn requires users to log in before they can view profiles and uses its Terms of Service to forbid automated data extraction. The legal case of hiQ Labs against LinkedIn established that scraping public websites isn't a violation, but LinkedIn's protected content requires authentication to access. Dedicated email finder tools like Hunter.io, Apollo.io, or Snov.io use official integrations or proprietary databases, making them the optimal solution for LinkedIn data retrieval. Check the 'Email scraping tools' section of this article to know which tool works best for your use case.

How do I verify that scraped email addresses are actually valid?

Syntax validation via Python uses regex, which identifies obvious false positives but cannot verify whether an address truly exists or receives mail. The 3 options for checking email validity, from least complete to most complete, include MX record lookup, which verifies the existence of a mail server, SMTP verification, which tests mailbox access without sending mail, and specialized services like ZeroBounce, NeverBounce, or MillionVerifier, which perform checks and return a confidence score. The noise-filtering process, which removes the noreply@ and admin@ prefixes, addresses email quality issues before the aforementioned external confirmation tools become necessary.

What's the best Python library for scraping emails from a website?

There's no single best answer; it depends on the target website. For static HTML, the requests library and Beautiful Soup are fast and sufficient. For JavaScript-rendered pages, Playwright is the recommended headless browser option because of its clean async API and reliable element waiting. While for large-scale domain crawls, Scrapy provides a full crawl framework with built-in queuing and middleware support. This article uses the requests library and Beautiful Soup as the starting point, with Playwright introduced for JS-rendered targets.

Beautiful Soup Web Scraping: How to Parse Scraped HTML with Python

Web scraping with Python is a powerful technique for extracting valuable data from the web, enabling automation, analysis, and integration across various domains. Using libraries like Beautiful Soup and Requests, developers can efficiently parse HTML and XML documents, transforming unstructured web data into structured formats for further use. This guide explores essential tools and techniques to navigate the vast web and extract meaningful insights effortlessly.

Python Web Crawlers: Guide to Building, Scaling, and Maintaining Crawlers

TL;DR: A web crawler is a program that systematically navigates the web by following links from page to page. Python is the go-to language for building crawlers thanks to libraries like Requests, Beautiful Soup, and Scrapy. This guide covers everything from your first 50-line crawler to a production-grade Scrapy setup with proxy integration, JavaScript rendering, and distributed architecture. If you've ever had to collect data from hundreds or thousands of pages and done it manually, this is for you.

Is Web Scraping Legal? Guide to Laws, Cases & Compliance

Web scraping extracts data from websites using automated tools. It's become a standard practice for businesses gathering competitive intelligence, training AI models, and building data-driven products. But the big question remains – is web scraping legal? The answer depends on what you scrape, how you scrape it, where the data comes from, and what you do with it next.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved