Back to blog

Web Scraping in Dify: A No-Code Guide

Share article:

Dify is an open-source platform for building LLM apps and AI workflows visually. It gives teams a drag-and-drop canvas for chaining LLMs, tools, and APIs into complete AI workflows. Modern AI apps need fresh, structured web data, but most teams don't want to write and maintain Python scrapers. In this article, you'll learn how to build a no‑code Dify workflow and switch to a managed Web Scraping API when basic plugins aren't enough.

Web-scraping UI displaying 'Response' JSON panel and 'Start scraping' button on dark dotted gradient background

TL;DR: web scraping in Dify in 5 steps

  1. Create a blank Dify Workflow, add a start node, and define a target_url variable so the workflow can accept a page URL at runtime.
  2. Add an HTTP Request node, set it to POST and point it to Decodo's Web Scraping API endpoint: https://scraper-api.decodo.com/v2/scrape.
  3. Configure the request with your Basic Auth token and a JSON payload that sends the target_url to the Web Scraping API.
  4. Add an LLM node to extract structured fields from the scraped response, such as event name, date, location, and source URL.
  5. Add an Output node, run the workflow in Dify's trace view, then publish it as an API or reusable tool for other AI workflows.

What is Dify?

Dify is an open‑source, user-friendly LLM app development platform (LLMOps) for building chatbots, agentic workflows, and visual automations on top of any major LLM provider. It gives teams the core pieces for building, publishing, testing, and monitoring LLM apps in one workspace.

Dify is also model-agnostic, so teams can connect proprietary models like Claude or use open-source alternatives without rebuilding the workflow.

For web scraping, these four Dify building blocks are the most relevant:

  • Workflows. Visual, node-based automations that pass data from one step to another. This is where you can connect a URL input, a scraper node, an LLM node, and a final output.
  • Model-agnostic LLM nodes. Model steps that can use different providers depending on cost, speed, or accuracy needs. This helps teams test extraction, summarization, and classification prompts without locking the workflow to one model.
  • Tools. HTTP requests, plugins, custom APIs, workflow tools, and MCP tools that let LLM apps call external services.
  • Knowledge bases. Vector-indexed datasets that scraped content can feed into for retrieval-augmented generation, also known as RAG.

Instead of building a scraper in Python and then separately sending the result to an LLM, Dify keeps the full flow on the same canvas. A scraper node feeds an LLM node, and an End node returns the structured output.

Workflow showing USER INPUT -> SCRAPER NODE -> EXTRACT STRUCTURED D... (gpt-4o-mini) -> OUTPUT on dark UI

If you've used n8n for web scraping workflows, the canvas will feel familiar. However, n8n is a general‑purpose automation tool that connects to hundreds of SaaS apps, while Dify is purpose-built for LLM orchestration. Its nodes, variables, and debug traces are designed around prompt inputs, model outputs, and token costs.

If you want to compare Dify with a more code-first setup, check out our guide to end-to-end AI workflows with LangChain and Web Scraping API.

Why use web scraping in Dify?

Any team looking into web scraping in Dify probably already knows the basics of scraping. So naturally, what matters is not the scraping itself, but what Dify adds around it. Compared to a standalone script, Dify gives teams a more structured, flexible, and reusable way to work with scraped data.

Key benefits include:

  • Faster iteration. You can change prompts, output schemas, and node logic visually.
  • Easier collaboration. Developers can configure the scraping tool, while PMs, analysts, or content teams can adjust the URL input, prompt, or output schema without writing code.
  • Easier debugging. Dify's run history shows scraper calls, LLM outputs, token usage, and failed nodes in one trace. That's easier to inspect than a traditional coding pipeline.
  • Built-in scheduling and API exposure. A scraper-plus-LLM workflow can run on a schedule or be published as an API endpoint for other apps to call.
  • Cost transparency. Each run makes it easier to track both scraping usage and LLM token spend in the same workflow view.

If you want to compare this with another AI automation setup, see our guide on AI agent orchestration with n8n and MCP.

Web scraping plugins available in Dify

Dify's Plugin Marketplace offers several plugins capable of handling the scraping layer of your workflow. The best option simply depends on how clean the target page is, its anti-bot capabilities, and how much control you need. 

Let's go over a few popular plugins:

  • Firecrawl plugin. Converts any URL into clean Markdown. A strong default for content-heavy pages and RAG ingestion, but with less control over headers and proxies.
  • Jina Reader. The fastest setup, with a free tier for quick testing. It works well for public docs and blogs, but not for harder targets. A useful trick is to prefix a URL with r.jina.ai/ to get LLM-ready Markdown. For example, instead of https://www.python.org/events/python-events/, use https://r.jina.ai/http://www.python.org/events/python-events/ as your target URL.
  • ScrapeGraph plugin. A more LLM-driven option that can turn URLs into Markdown or structured JSON. For a closer look, read our ScrapeGraph AI review.
  • Custom tool via HTTP Request node. This is the most flexible route. Dify's HTTP Request node lets you call any scraping API directly, including Decodo's Web Scraping API when you need JavaScript rendering, geo-targeting, session control, or automatic proxy rotation.

As a rule of thumb, start with Jina Reader for public content. The moment you hit a 403, a CAPTCHA, or a JavaScript-rendered target, switch to a managed Web Scraping API as a custom tool. For another no-code angle, see no-code web scraper with Playwright MCP.

How to set up web scraping in Dify: step-by-step

Time to build a simple Dify workflow that connects to Decodo's Web Scraping API and passes the scraped page content into an LLM.

Upcoming Events panel listing PyCon Italia 2026, Python Leiden User Group, GeoPython 2026, PyCon Singapore 2026, SciPy 2026 on python.org events page

The goal is to scrape the public Python events page. The workflow will fetch the page through Decodo, pass the result into an LLM node, and return structured event data.

You'll build a four-stage workflow:

Dify Scraping Workflow showing User Input -> Decodo Web Scraping API -> LLM Extraction -> Structured Output

Each node has a clear job. The User Input node collects the target URL, the HTTP Request node sends it to Decodo, the LLM node extracts structured fields, and the Output node returns the final result.

This AI scraping process is visual from start to finish. You'll connect the nodes in a simple flow and won't have to write a scraper from scratch.

Requirements

To follow along, you'll need the following:

  • Dify account. A free account is enough for testing.
  • Decodo account. Get a Web Scraping API subscription or free plan and retrieve your basic authentication token from the dashboard.
Web Scraping API dashboard showing Basic authentication token and code playground with Python snippet and Send request button
If you don't have these yet, create the accounts first and keep your Decodo token ready. If you're self-hosting Dify, make sure your version supports workflows, the Plugin Marketplace, and HTTP Request nodes.

Get Web Scraping API

Claim your free plan of our scraper API and explore full features with unrestricted access.

Step 1: Create a new workflow

Log in to your Dify workspace and create a new app. Choose Workflow instead of Chatflow.

Dify workspace showing Workflow tab selected, 'Create from Blank' highlighted, 'Web scraping in Dify demo' card visible

A Workflow is the better option here because it gives you a predictable input-output pipeline. That makes it easier to expose the scraper as an API later, reuse it in another Dify app, or trigger it from an external automation tool.

Name it something simple, such as Web scraping in Dify demo.

Create from Blank modal showing 'Workflow' selected and 'Chatflow' option, app name and description fields on dark UI

Step 2: Configure the Start node

Open the Start node and add a new input variable.

USER INPUT node selected, showing SETTINGS > INPUT FIELD with userinput.files (LEGACY) in dark workflow editor

In the right panel, add a new input field:

This lets the workflow accept a target page at runtime. You can keep the Python events page as the default URL for testing, then replace it later with any supported target.

Step 3: Add the HTTP Request node

Click the plus icon after the User Input node and add an HTTP Request node. Rename it to Decodo Web Scraping API.

HTTP Request node highlighted in nodes panel, showing 'HTTP Request' tooltip and 'USER INPUT' node with target_url field.

Set the method to POST and use this endpoint: https://scraper-api.decodo.com/v2/scrape. This node is where Dify calls Decodo.

HTTP Request panel showing POST https://scraper-api.decodo.com/v2/scrape in a Decodo web-scraping flow with User Input node

Step 4: Configure the Decodo request

Then configure the request by adding the required headers:

  • Acceptapplication/json
  • Content-Typeapplication/json
  • AuthorizationBasic YOUR_DECODO_AUTH_TOKEN
HTTP Request UI — POST https://scraper-api.decodo.com/v2/scrape; headers Accept:application/json; Authorization:Basic...

In the HTTP Request node, add the JSON payload that Dify will send to Decodo. Use this structure:

{ "url": "{{target_url}}", "proxy_pool": "premium", "headless": "html" }

Here's what each field does:

  • url tells Decodo which page to scrape. Bind this to the Start node's target_url variable.
  • proxy_pool sets the proxy pool used for the request.
  • Setting headless to html tells Decodo to return the page HTML after rendering.
BODY panel showing raw selected and RAW TEXT JSON { "url":"https://www.python.org/events/python-events/", "proxy_pool": "premium", "headless": "html" }
Run the workflow in Preview mode. After the run finishes, open the HTTP Request node output. You should see a JSON response from Decodo containing the scraped page result.
HTTP Request node showing 'LAST RUN' and 'OUTPUT' panels with tokenization warning, 'Test Run' button in web-scraping UI

At this stage, don't worry if the output looks too long or messy. Raw HTML often does. The next step is where the LLM turns that page content into structured data.

If the request fails, check these things first:

  • The Decodo API token is valid.
  • The authorization header starts with Basic.

Step 5: Add the LLM extraction node

Click the plus icon after the HTTP Request node and add an LLM node. Rename it to Extract structured data and choose your model (GPT‑4o or Claude Sonnet work well).

No need to worry about adding an API key for now as Dify provides new accounts with free LLM credits.

Extract structured data panel showing gpt-4o-mini selected with context and SYSTEM prompt fields in dark UI

Set the LLM context to the response body from the HTTP Request node. This may appear as Decodo Web Scraping API → body

body highlighted in DECODO WEB SCRAPING API context panel, showing CONTEXT header, status_code Number and files Array[File]

Then, add a prompt that tells the model exactly what to extract. Use a prompt like this:

"
Extract Python event data from the HTML inside results[0].content.
Focus only on the ul element with the class list-recent-events menu.
Return only valid JSON.
For each event, include:
name
date
location
event_url
Use https://www.python.org as the base URL for relative links.
If a field is missing, return null.
Content:
{{context}}
"

Asking the model to focus on results[0].content keeps it away from the page navigation, scripts, styles, and footer. Asking for JSON only also makes the output easier to map in the End node and easier to consume from an API.

Step 6: Add the Output node

Click the plus icon after the LLM node and add an Output node.

Output panel showing "Extract structured data" and 'text' highlighted in a variables dropdown inside a dark settings UI

Create one output variable:

  • Variable name: result
  • Value: the LLM node response (Extract structured data → text)

This makes the final JSON available in the test run result and, later, through the workflow API.

Step 7: Run a test

Click Test Run and use the default URL. When the run is complete, open the Tracing tab and check each node in order.

Test Run (04:32:51 PM) showing TRACING with USER INPUT, DECODO WEB SCRAPING API, EXTRACT STRUCTURED DATA, OUTPUT in dark UI
  • User Input should show the target_url value.
  • HTTP Request should return a successful response from Decodo, usually with a 200 status code.
  • LLM node should receive the scraped page content from the HTTP Request node.
  • Output should return the final structured JSON.

The expected output should look similar to this:

{
"events": [
{
"title": "PyCon Italia 2026",
"date": "27 May 2026 - 30 May 2026",
"location": "Bologna, Italy",
"source_url": "https://www.python.org/events/python-events/2036/"
},
{
"title": "Python Leiden User Group",
"date": "28 May 2026 5:15pm UTC - 8pm UTC",
"location": "Leiden, The Netherlands",
"source_url": "https://www.python.org/events/python-events/2175/"
}
]
}

If the result is empty or inaccurate, start debugging from the HTTP Request node. If the response is missing, check the API URL, headers, authorization token, and JSON body.

Step 8: Publish the workflow

Once the workflow works, click Publish. You can publish it as an API or use it as a tool inside another Dify app.

Publish Update button highlighted in modal reading 'CURRENT DRAFT UNPUBLISHED' and 'Auto-Saved · a few seconds ago'

This turns your Dify setup into a reusable scraping workflow that other apps can call with a different target_url.

If you're comparing Dify with another visual workflow tool, check out Decodo's guide to building n8n web scraping workflows.

Managing API credentials and security in Dify

When you move a scraping workflow into production, how you store, scope, and rotate credentials matters as much as the flow itself. Dify encrypts all secrets at rest, but you still control where those secrets live, who can see them, and what happens if a key leaks.

Dify can store plugin credentials at the workspace or workflow level. Workspace-level credentials are useful for shared internal tools because multiple workflows can reuse the same setup. Workflow-level credentials are better when a workflow is isolated.

Here are some best practices to help keep your setup safer:

Choose the right credential scope

Use workspace-level credentials for internal workflows and workflow-level credentials for isolated projects.

Rotate keys regularly

Replace scraping API keys every 60 to 90 days, and always rotate them after a team member with access leaves.

Avoid hardcoding secrets

Don't paste API keys into prompts, Code nodes, notes, or request bodies that may appear in traces or workflow exports. Use credential fields or secured request headers instead.

Check the trust model

On Dify Cloud, secrets live on Dify's infrastructure. On self-hosted Dify, you control the environment, but you're also responsible for backups, access control, and secret storage.

Building no-code and low-code web scraping workflows in Dify

Dify gives you a few ways to take action inside your web scraping workflows, especially when scraped output needs cleaning before it reaches the LLM or when the workflow needs to branch at some point.

Let's have a look at some of them:

Low-code workflow

Sometimes the scraper output needs shaping before the LLM sees it, such as deduplicating URLs, filtering empty entries, or restructuring JSON.

PYTHON3 editor showing def main(arg1: str, arg2: str): return {"result": arg1 + arg2} in Code panel with INPUT VARIABLES arg1 arg2

In that case, drop a Code node between the scraper and LLM. Dify supports Python and Node.js for this.

Conditional logic with IF/ELSE

Use an IF/ELSE node to check for common failure signals, such as an empty response, a failed HTTP status, or blank Markdown.

IF/ELSE node connected to Extract Structure node, right panel shows IF/ELSE settings 'Add Condition', '+ ELIF', 'ELSE'

If the first scraper fails, route the workflow to a fallback plugin or a managed Web Scraping API.

Iteration node for paginated targets

Use an Iteration node when the target spans multiple pages. For example, you can loop through a list of category URLs, scrape each one, and collect the results into a single output.

Iteration panel showing INPUT and OUTPUT VARIABLES, ERROR RESPONSE METHOD: Terminated in dark workflow editor

If RAG is your end goal, see our guide on how to build production-ready RAG with LlamaIndex.

Real-world use cases for web scraping in Dify

Conference and event intelligence

Teams can scrape public event pages, such as the Python events page, to track upcoming conferences, meetups, and technical events. A Dify workflow can run weekly, scrape the event page, and pass the result into an LLM node to extract event titles, dates, locations, topics, and links.

The output can be returned as structured JSON or written into a Knowledge Base, where an internal assistant can answer questions like, “Which Python or infrastructure events are coming up next month?”

USER INPUT target_url → DECODO WEB SCRAPING POST https://scraper-api.decodo.com/v2/scrape → EXTRACT EVENT INTEL gpt-4o-mini

Job market signal tracking

For hiring and market research, Dify can monitor a public job board weekly. The workflow can scrape new listings, then use an LLM to classify each role by stack, seniority, location, and role type.

Python header displaying '34 jobs on the Python Job Board' with job listings and 'Submit a Job' sidebar
This makes it easier to spot hiring patterns without manually checking the page every week. For example, the output can summarize whether Python roles are leaning more toward backend engineering, machine learning, DevOps, or data roles.
SCHEDULE TRIGGER set for May 14, 2026, 9:00 AM → DECODO WEB SCRAPIN... (POST) → CLASSIFY JOBS gpt-4o-mini → OUTPUT

Local regulation tracking

Dify can also monitor public meeting pages, scrape the latest meeting minutes, and ask the LLM to flag items related to zoning, public hearings, permits, or community development.

Full Board/Public Hearing Minutes listing meeting links on NYC Bronx Community Board 10 website
In this setup, the final output can be sent to Slack so the right team sees relevant updates without checking the page manually. The Slack message can include a summary, the category of the update, and the source link.
SCHEDULE TRIGGER fires Sunday, May 17, 2020, 9:00 AM → DECODO WEB SCRAPIN... POST https://scraper-api.decodo.com/v2/scrape → PUBLIC HEARING MINU... gpt-40-mini → SLACK WEBHOOK_URL - Authorization required

Troubleshooting web scraping in Dify

Most scraping failures in Dify come down to a handful of root causes. The fastest way to debug is to open the run trace and check where the workflow first breaks.

SCRAPER NODE showing "GET @User input [x]target_url" and "RETRY 3 TIMES ON FAILURE", connected to CODE; Test Run TRACING shows "TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'"

Let's look at some common troubleshooting tips:

Empty Markdown output

The target may be a JavaScript-rendered single-page app, so the scraper is only seeing empty HTML. Switch to a tool that supports headless rendering, like a scraper plugin or Decodo Web Scraping API.

HTTP 403429, or Cloudflare challenge page

Your target is likely detecting automation, or bot detection is triggered by your IP, missing headers, or request pattern. To fix this, rotate User-Agents and IPs or route the request through a Web Scraping API with proxy rotation so your requests look organic.

Plugin authentication errors

Check that the API key is stored in the plugin credential field or secure header, not inside a prompt. Test the same request with curl outside Dify, and confirm that everything works as expected.

Malformed JSON from the LLM node

Tighten the prompt with an explicit schema, lower the model temperature, and tell the model to return valid JSON only. If the output still breaks, use a Parameter Extractor node or a Code node to validate the result before passing it forward.

Workflow timeout

Long scrapes can exceed Dify's per-node timeout. Split the job into smaller batches with an Iteration node, reduce the target page size, or move long-running collections to a managed API that can handle heavier scraping tasks.

Where to get help

Start with Dify GitHub Discussions, the Dify Discord, the plugin's official support channel from its Dify Marketplace page, and your scraping provider's docs. To learn more about blocked requests and rate limits, check out our guide on how to bypass web scraping prevention.

Final thoughts

In this article, you learned how to use Dify to build a no-code web scraping workflow. Dify is a strong LLM-focused alternative to general automation tools like n8n, especially for teams building AI workflows with live web data. If that sounds like your setup, try the Decodo Web Scraping API today and unlock a more reliable way to collect production-ready web data.

Skip the boilerplate

Decodo's Web Scraping API handles proxies, CAPTCHAs, and anti-bot detection so your code stays short and your requests actually land.

Share article:

About the author

Lukas Mikelionis

Senior Account Manager

Lukas is a seasoned enterprise sales professional with extensive experience in the SaaS industry. Throughout his career, he has built strong relationships with Fortune 500 technology companies, developing a deep understanding of complex enterprise needs and strategic account management.

Connect with Lukas via LinkedIn.

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

Is web scraping in Dify free?

Dify itself is open‑source and free to use. Scraping plugins charge per request or credit. Jina Reader offers a free tier good enough for testing.

Can I scrape JavaScript-rendered sites in Dify?

Not with vanilla HTTP‑request plugins. If the page needs JavaScript to load its content, use a scraper plugin or Web Scraping API that supports headless rendering. This is where a managed option like Decodo's Web Scraping API is useful.

How do I schedule a Dify scraping workflow?

Use Dify's built‑in scheduler (Workflow → Schedule) or trigger the published API endpoint from cron, n8n, or GitHub Actions.

Is it legal to scrape inside Dify?

Same rules as scraping anywhere else: respect robots.txt, terms of service, and applicable data protection law.

Scraping UI showing "Amazon search" query "laptop", "Start scraping" button, and "Response" JSON panel with results

Complete Guide for Building n8n Web Scraping Automations

If you're tired of duct-taping complicated scripts just to grab web data, this n8n web scraping tutorial is for you. You'll see how to use n8n for web scraping, why it beats DIY scrapers, and what you need to get started. Perfect for developers and coding beginners looking to automate data extraction without the headaches.

Instruction cards: buildSearchInstructions, buildAnalysisInstructions, buildReportFormat on dark dotted gradient background

End-to-End AI Workflows with LangChain and Web Scraping API

AI has evolved from programs that just follow rules to systems that can learn and make decisions. Businesses that understand this shift can leverage AI to tackle complex challenges, moving beyond simple task automation. In this guide, we'll walk you through how to connect modern AI tools with live web data to create an automated system that achieves a specific goal. This will give you a solid foundation for building even more sophisticated autonomous applications.

No-Code Web Scraper With Playwright MCP: How to Scrape Any Website With Playwright MCP

Playwright MCP is one of the most accessible ways to get started if you need data from a website but do not want to write scraping code. It enables an AI application or agent to control a browser, interact with web pages, and extract content just like a regular user would. In this article, you’ll learn what Playwright MCP is, how to set it up, and how to use it to scrape websites with natural language.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved