> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getbindu.com/llms.txt
> Use this file to discover all available pages before exploring further.

# 2.9 Web Scraping AI Agent

> AI agent that crawls web pages and extracts structured data

AI agent that crawls web pages and extracts structured data.

## Code

Create `web-scraping-agent.py` with the code below, or save it directly from your editor.

```python theme={null}
"""Web Scraping AI Agent (Local & Cloud SDK)

Crawls web pages, extracts structured data, cleans and formats outputs,
and prepares datasets for analysis or integration.

Features:
- ScrapeGraph AI for intelligent structured extraction
- Mem0 for persistent memory (dedup, extraction profiles)
- OpenRouter (openai/gpt-oss-120b) for synthesis and formatting
- Local run mode + Bindu Cloud SDK deployment

Usage:
    python web_scraping_agent.py

Environment:
    Requires SCRAPEGRAPH_API_KEY, MEM0_API_KEY, OPENROUTER_API_KEY in .env file
"""

import os
from dotenv import load_dotenv

load_dotenv()

from bindu.penguin.bindufy import bindufy
from agno.agent import Agent
from agno.models.openrouter import OpenRouter
from agno.tools.scrapegraph import ScrapeGraphTools
from agno.tools.mem0 import Mem0Tools

# Initialize the web scraping agent
agent = Agent(
    instructions=(
        "You are a web scraping assistant. Given a URL and an optional extraction prompt, "
        "use ScrapeGraph to extract structured data from the page. Clean and format the output "
        "into JSON. Use memory to avoid re-scraping URLs you have already processed and to "
        "remember extraction preferences ."
    ),
    model=OpenRouter(
        id="openai/gpt-oss-120b",
        api_key=os.getenv("OPENROUTER_API_KEY"),
    ),
    tools=[
        ScrapeGraphTools(api_key=os.getenv("SCRAPEGRAPH_API_KEY")),
        Mem0Tools(api_key=os.getenv("MEM0_API_KEY")),
    ],
)

# Agent configuration for Bindu
config = {
    "author": "bindu.builder@getbindu.com",
    "name": "web_scraping_agent",
    "description": (
        "AI-enabled web scraping agent that collects, structures, and processes "
        "data from websites for analysis and automation."
    ),
    "deployment": {
        "url": "http://localhost:3773",
        "expose": True,
        "cors_origins": ["http://localhost:5173"],
    },
    "skills": ["skills/web-scraping-skill"],
}

def handler(messages: list[dict[str, str]]):
    """
    Process incoming messages and return agent response.

    Args:
        messages: List of message dictionaries containing conversation history

    Returns:
        Extracted and structured data from the requested web page
    """
    if messages:
        latest = (
            messages[-1].get("content", "")
            if isinstance(messages[-1], dict)
            else str(messages[-1])
        )
        result = agent.run(input=latest)
        if hasattr(result, "content"):
            return result.content
        elif hasattr(result, "response"):
            return result.response
        return str(result)
    return "Please provide a URL and an extraction prompt."

if __name__ == "__main__":
    # Bindu-fy the agent — converts it to a discoverable, interoperable Bindu agent
    bindufy(config, handler)
```

## Skill Configuration

Create `skills/web-scraping-skill/skill.yaml`:

```yaml theme={null}
# Web Scraping Skill
# AI-enabled web scraping that crawls pages and extracts structured data

id: web-scraping-skill
name: web-scraping-skill
version: 1.0.0
author: bindu.builder@getbindu.com

description: |
  AI-enabled web scraping skill that crawls web pages, extracts structured data,
  cleans and formats outputs, and prepares datasets for analysis or integration.

tags:
  - web-scraping
  - data-processing
  - extraction
  - crawler
  - scrape
  - structured-data

input_modes:
  - application/json

output_modes:
  - application/json

examples:
  - "Extract product listings from this e-commerce site: https://example.com/products"
  - "Scrape blog titles and publish dates from https://example.com/blog"
  - "Get all article headlines from this news page"

capabilities_detail:
  web_scraping:
    supported: true
    description: "Crawl and extract structured data from any public web page"
  data_processing:
    supported: true
    description: "Clean, normalize, and format extracted content into structured JSON"
  memory:
    supported: true
    description: "Remember previously scraped URLs and extraction profiles via Mem0"
  deduplication:
    supported: true
    description: "Avoid re-scraping already processed URLs"

assessment:
  keywords:
    - scrape
    - crawl
    - extract
    - web
    - website
    - product listings
    - blog titles
    - data collection
    - html
    - structured data

  specializations:
    - domain: e_commerce_extraction
      confidence_boost: 0.3
    - domain: content_aggregation
      confidence_boost: 0.2

  anti_patterns:
    - "pdf extraction"
    - "database query"
    - "audio transcription"
    - "image generation"
```

## How It Works

**Web Scraping**

* `ScrapeGraphTools`: AI-powered structured data extraction
* Intelligent web page crawling and parsing
* JSON output formatting and cleaning
* Custom extraction prompt support

**Memory Management**

* `Mem0Tools`: Persistent memory for deduplication
* Extraction profile storage and retrieval
* Avoids re-scraping previously processed URLs
* Remembers user extraction preferences

**Data Processing**

* `OpenRouter` with GPT-OSS-120b for synthesis
* Advanced data structuring and formatting
* Content cleaning and normalization
* JSON output preparation

**Agent Capabilities**

* Web scraping assistant with AI extraction
* Structured data output in JSON format
* Memory-based optimization and caching
* Multi-format data preparation

## Dependencies

```bash theme={null}
uv init
uv add bindu agno python-dotenv scrapegraphai mem0ai
```

## Environment Setup

Create `.env` file:

```bash theme={null}
SCRAPEGRAPH_API_KEY=your_scrapegraph_api_key
MEM0_API_KEY=your_mem0_api_key
OPENROUTER_API_KEY=your_openrouter_api_key_here
```

## Run

```bash theme={null}
uv run web-scraping-agent.py
```

**Examples:**

* "Extract product information from [https://example-shop.com](https://example-shop.com) including prices, names, and descriptions"
* "Scrape news headlines from multiple news websites"
* "Get pricing data from e-commerce product pages"

## Example API Calls

<AccordionGroup>
  <Accordion title="Message Send Request">
    ```json theme={null}
    {
      "jsonrpc": "2.0",
      "method": "message/send",
      "params": {
        "message": {
          "role": "user",
          "kind": "message",
          "messageId": "9f11c870-5616-49ad-b187-d93cbb100001",
          "contextId": "9f11c870-5616-49ad-b187-d93cbb100002",
          "taskId": "9f11c870-5616-49ad-b187-d93cbb100003",
          "parts": [
            {
              "kind": "text",
              "text": "Extract product information from https://example-shop.com including prices, names, and descriptions"
            }
          ]
        },
         "skillId": "web-scraping-skill",
        "configuration": {
          "acceptedOutputModes": ["application/json"]
        }
      },
      "id": "9f11c870-5616-49ad-b187-d93cbb100003"
    }
    ```
  </Accordion>

  <Accordion title="Task get Request">
    ```json theme={null}
    {
      "jsonrpc": "2.0",
      "method": "tasks/get",
      "params": {
        "taskId": "9f11c870-5616-49ad-b187-d93cbb100003"
      },
      "id": "9f11c870-5616-49ad-b187-d93cbb100004"
    }
    ```
  </Accordion>
</AccordionGroup>

## Frontend Setup

```bash theme={null}
# Clone the Bindu repository
git clone https://github.com/GetBindu/Bindu

# Navigate to frontend directory
cd frontend

# Install dependencies
npm install

# Start frontend development server
npm run dev
```

Open [http://localhost:5173](http://localhost:5173) and try to chat with the web scraping agent
