Skip to main content

Observability Overview

Bindu provides comprehensive observability through OpenTelemetry, enabling you to monitor, trace, and analyze your agent’s performance and behavior in real-time.

What is Observability?

Observability gives you deep insights into your agent’s internal state through:
  • Traces - Follow requests through the entire system
  • Metrics - Track performance and resource usage
  • Logs - Capture detailed execution information
  • Events - Monitor state transitions and key moments

Why Observability Matters

Performance Analysis

Identify bottlenecks and optimize agent response times

Error Debugging

Quickly diagnose and fix issues with detailed traces

State Visibility

Track task state transitions and agent behavior

Production Monitoring

Monitor agent health and performance in production

Architecture

Bindu implements distributed tracing across the entire task execution lifecycle:

Trace Flow

  1. TaskManager - Creates root span for operations
  2. Scheduler - Propagates trace context to workers
  3. Worker - Restores context and creates child spans
  4. Agent - Tracks execution time and state changes

Key Features

Distributed Tracing

Track requests across the entire system:
  • βœ… End-to-end visibility - From API request to agent response
  • βœ… Span propagation - Maintains context across async boundaries
  • βœ… Parent-child relationships - Clear hierarchy of operations
  • βœ… Timing information - Precise duration of each operation

Rich Attributes

Comprehensive metadata on every span:
  • bindu.operation - Operation name (e.g., β€œsend_message”)
  • bindu.task_id - Task UUID
  • bindu.context_id - Conversation context
  • bindu.agent.name - Agent identifier
  • bindu.agent.did - Agent DID
  • bindu.agent.execution_time - Agent processing time

State Transition Events

Timeline markers for key moments:
  • Task state changes (working β†’ completed)
  • Error occurrences with stack traces
  • Input/auth requirements
  • Custom agent events

Performance Metrics

Automatic metric collection:
  • bindu_tasks_total - Counter of tasks processed
  • bindu_task_duration_seconds - Histogram of durations
  • bindu_active_tasks - Current active tasks
  • bindu_contexts_total - Contexts managed

Supported Backends

Bindu works with any OpenTelemetry-compatible backend:

Open Source

  • Jaeger - Distributed tracing platform
  • Grafana Tempo - High-scale distributed tracing
  • Zipkin - Distributed tracing system
  • SigNoz - Full-stack observability platform

Commercial

  • Honeycomb - Observability for production systems
  • New Relic - Full-stack observability
  • Datadog - Monitoring and analytics
  • Lightstep - Observability for microservices

Quick Start

1

Start Jaeger

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest
2

Configure Agent

Add to your agent config:
{
  "name": "my-agent",
  "telemetry": true,
  "oltp": {
    "endpoint": "http://localhost:4318/v1/traces",
    "service_name": "bindu-agent"
  }
}
3

Run Agent

python your_agent.py
4

View Traces

Open http://localhost:16686 and select your service

Configuration Options

{
  "telemetry": true,
  "oltp": {
    "endpoint": "http://localhost:4318/v1/traces",
    "service_name": "bindu-agent",
    "service_version": "1.0.0",
    "deployment_environment": "production"
  }
}

Environment Variables

# OTLP endpoint
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318/v1/traces"

# Service identification
export OTEL_SERVICE_NAME="bindu-agent"
export OTEL_SERVICE_VERSION="1.0.0"
export DEPLOYMENT_ENV="production"

# Batch processing (recommended)
export OTEL_USE_BATCH_PROCESSOR="true"
export OTEL_BSP_SCHEDULE_DELAY="5000"
export OTEL_BSP_MAX_EXPORT_BATCH_SIZE="512"
Agent config takes precedence over environment variables.

Example Trace

Here’s what a complete trace looks like:
task_manager.send_message (250ms)
β”œβ”€ bindu.operation: "send_message"
β”œβ”€ bindu.request_id: "req-123"
β”œβ”€ bindu.task_id: "task-456"
└─ run task (220ms)
   └─ agent.execute (200ms)
      β”œβ”€ bindu.agent.name: "my-agent"
      β”œβ”€ bindu.agent.did: "did:bindu:user:agent:uuid"
      β”œβ”€ bindu.agent.execution_time: 0.200
      └─ Events:
         └─ task.state_changed
            β”œβ”€ from_state: "working"
            └─ to_state: "completed"

Trace Attributes

AttributeDescriptionExample
bindu.operationOperation typesend_message
bindu.task_idTask identifiertask-456
bindu.agent.nameAgent namemy-agent
bindu.agent.execution_timeProcessing time0.200
bindu.successSuccess flagtrue

Observability Best Practices

1. Consistent Naming

Use clear, consistent span names:
# Good
"task_manager.send_message"
"agent.execute"
"scheduler.enqueue"

# Avoid
"send_msg"
"exec"
"queue"

2. Rich Attributes

Add meaningful context to spans:
span.set_attribute("bindu.agent.name", agent_name)
span.set_attribute("bindu.task_id", task_id)
span.set_attribute("bindu.message_count", len(messages))

3. Span Events

Use events for timeline markers:
span.add_event("task.state_changed", {
    "from_state": "working",
    "to_state": "completed"
})

4. Error Handling

Always record errors with context:
try:
    result = await agent.execute()
except Exception as e:
    span.record_exception(e)
    span.set_status(Status(StatusCode.ERROR))
    raise

5. Sampling Strategy

Configure sampling for high-volume production:
# Sample 10% of traces
export OTEL_TRACES_SAMPLER="parentbased_traceidratio"
export OTEL_TRACES_SAMPLER_ARG="0.1"

Performance Tuning

Batch Processor

Optimize for production workloads:
# High-volume production
export OTEL_BSP_MAX_QUEUE_SIZE="4096"
export OTEL_BSP_SCHEDULE_DELAY="10000"
export OTEL_BSP_MAX_EXPORT_BATCH_SIZE="1024"
export OTEL_BSP_EXPORT_TIMEOUT="60000"

Development vs Production

Development:
# Immediate export for debugging
export OTEL_USE_BATCH_PROCESSOR="false"
Production:
# Batched export for efficiency
export OTEL_USE_BATCH_PROCESSOR="true"
export OTEL_BSP_SCHEDULE_DELAY="5000"

Troubleshooting

Check:
  • Jaeger is running: docker ps | grep jaeger
  • Endpoint is correct: echo $OTEL_EXPORTER_OTLP_ENDPOINT
  • Agent logs show observability initialization
  • Test endpoint: curl http://localhost:4318/v1/traces
Cause: BatchSpanProcessor batches spans before sending (default: 5s)Solution: Set OTEL_USE_BATCH_PROCESSOR="false" for development
Solution: Set OTEL_SERVICE_NAME or configure in agent config
Cause: Queue size too large or export delaysSolution: Tune batch processor parameters:
export OTEL_BSP_MAX_QUEUE_SIZE="2048"
export OTEL_BSP_SCHEDULE_DELAY="5000"

Next Steps


Resources