Skip to main content
AI agent that crawls web pages and extracts structured data.

Code

Create web-scraping-agent.py with the code below, or save it directly from your editor.
"""Web Scraping AI Agent (Local & Cloud SDK)

Crawls web pages, extracts structured data, cleans and formats outputs,
and prepares datasets for analysis or integration.

Features:
- ScrapeGraph AI for intelligent structured extraction
- Mem0 for persistent memory (dedup, extraction profiles)
- OpenRouter (openai/gpt-oss-120b) for synthesis and formatting
- Local run mode + Bindu Cloud SDK deployment

Usage:
    python web_scraping_agent.py

Environment:
    Requires SCRAPEGRAPH_API_KEY, MEM0_API_KEY, OPENROUTER_API_KEY in .env file
"""

import os
from dotenv import load_dotenv

load_dotenv()

from bindu.penguin.bindufy import bindufy
from agno.agent import Agent
from agno.models.openrouter import OpenRouter
from agno.tools.scrapegraph import ScrapeGraphTools
from agno.tools.mem0 import Mem0Tools

# Initialize the web scraping agent
agent = Agent(
    instructions=(
        "You are a web scraping assistant. Given a URL and an optional extraction prompt, "
        "use ScrapeGraph to extract structured data from the page. Clean and format the output "
        "into JSON. Use memory to avoid re-scraping URLs you have already processed and to "
        "remember extraction preferences ."
    ),
    model=OpenRouter(
        id="openai/gpt-oss-120b",
        api_key=os.getenv("OPENROUTER_API_KEY"),
    ),
    tools=[
        ScrapeGraphTools(api_key=os.getenv("SCRAPEGRAPH_API_KEY")),
        Mem0Tools(api_key=os.getenv("MEM0_API_KEY")),
    ],
)

# Agent configuration for Bindu
config = {
    "author": "bindu.builder@getbindu.com",
    "name": "web_scraping_agent",
    "description": (
        "AI-enabled web scraping agent that collects, structures, and processes "
        "data from websites for analysis and automation."
    ),
    "deployment": {
        "url": "http://localhost:3773",
        "expose": True,
        "cors_origins": ["http://localhost:5173"],
    },
    "skills": ["skills/web-scraping-skill"],
}

def handler(messages: list[dict[str, str]]):
    """
    Process incoming messages and return agent response.

    Args:
        messages: List of message dictionaries containing conversation history

    Returns:
        Extracted and structured data from the requested web page
    """
    if messages:
        latest = (
            messages[-1].get("content", "")
            if isinstance(messages[-1], dict)
            else str(messages[-1])
        )
        result = agent.run(input=latest)
        if hasattr(result, "content"):
            return result.content
        elif hasattr(result, "response"):
            return result.response
        return str(result)
    return "Please provide a URL and an extraction prompt."

if __name__ == "__main__":
    # Bindu-fy the agent — converts it to a discoverable, interoperable Bindu agent
    bindufy(config, handler)

Skill Configuration

Create skills/web-scraping-skill/skill.yaml:
# Web Scraping Skill
# AI-enabled web scraping that crawls pages and extracts structured data

id: web-scraping-skill
name: web-scraping-skill
version: 1.0.0
author: bindu.builder@getbindu.com

description: |
  AI-enabled web scraping skill that crawls web pages, extracts structured data,
  cleans and formats outputs, and prepares datasets for analysis or integration.

tags:
  - web-scraping
  - data-processing
  - extraction
  - crawler
  - scrape
  - structured-data

input_modes:
  - application/json

output_modes:
  - application/json

examples:
  - "Extract product listings from this e-commerce site: https://example.com/products"
  - "Scrape blog titles and publish dates from https://example.com/blog"
  - "Get all article headlines from this news page"

capabilities_detail:
  web_scraping:
    supported: true
    description: "Crawl and extract structured data from any public web page"
  data_processing:
    supported: true
    description: "Clean, normalize, and format extracted content into structured JSON"
  memory:
    supported: true
    description: "Remember previously scraped URLs and extraction profiles via Mem0"
  deduplication:
    supported: true
    description: "Avoid re-scraping already processed URLs"

assessment:
  keywords:
    - scrape
    - crawl
    - extract
    - web
    - website
    - product listings
    - blog titles
    - data collection
    - html
    - structured data

  specializations:
    - domain: e_commerce_extraction
      confidence_boost: 0.3
    - domain: content_aggregation
      confidence_boost: 0.2

  anti_patterns:
    - "pdf extraction"
    - "database query"
    - "audio transcription"
    - "image generation"

How It Works

Web Scraping
  • ScrapeGraphTools: AI-powered structured data extraction
  • Intelligent web page crawling and parsing
  • JSON output formatting and cleaning
  • Custom extraction prompt support
Memory Management
  • Mem0Tools: Persistent memory for deduplication
  • Extraction profile storage and retrieval
  • Avoids re-scraping previously processed URLs
  • Remembers user extraction preferences
Data Processing
  • OpenRouter with GPT-OSS-120b for synthesis
  • Advanced data structuring and formatting
  • Content cleaning and normalization
  • JSON output preparation
Agent Capabilities
  • Web scraping assistant with AI extraction
  • Structured data output in JSON format
  • Memory-based optimization and caching
  • Multi-format data preparation

Dependencies

uv init
uv add bindu agno python-dotenv scrapegraphai mem0ai

Environment Setup

Create .env file:
SCRAPEGRAPH_API_KEY=your_scrapegraph_api_key
MEM0_API_KEY=your_mem0_api_key
OPENROUTER_API_KEY=your_openrouter_api_key_here

Run

uv run web-scraping-agent.py
Examples:
  • “Extract product information from https://example-shop.com including prices, names, and descriptions”
  • “Scrape news headlines from multiple news websites”
  • “Get pricing data from e-commerce product pages”

Example API Calls

{
  "jsonrpc": "2.0",
  "method": "message/send",
  "params": {
    "message": {
      "role": "user",
      "kind": "message",
      "messageId": "9f11c870-5616-49ad-b187-d93cbb100001",
      "contextId": "9f11c870-5616-49ad-b187-d93cbb100002",
      "taskId": "9f11c870-5616-49ad-b187-d93cbb100003",
      "parts": [
        {
          "kind": "text",
          "text": "Extract product information from https://example-shop.com including prices, names, and descriptions"
        }
      ]
    },
     "skillId": "web-scraping-skill",
    "configuration": {
      "acceptedOutputModes": ["application/json"]
    }
  },
  "id": "9f11c870-5616-49ad-b187-d93cbb100003"
}
{
  "jsonrpc": "2.0",
  "method": "tasks/get",
  "params": {
    "taskId": "9f11c870-5616-49ad-b187-d93cbb100003"
  },
  "id": "9f11c870-5616-49ad-b187-d93cbb100004"
}

Frontend Setup

# Clone the Bindu repository
git clone https://github.com/GetBindu/Bindu

# Navigate to frontend directory
cd frontend

# Install dependencies
npm install

# Start frontend development server
npm run dev
Open http://localhost:5173 and try to chat with the web scraping agent