Skip to main content

Distributed Tracing

Bindu implements comprehensive distributed tracing using OpenTelemetry to provide end-to-end visibility across the entire task execution lifecycle.

Overview

Distributed tracing tracks requests as they flow through your agent system: TaskManager → Scheduler → Worker → Agent Each component creates spans that form a complete trace, showing:
  • ⏱️ Timing - How long each operation takes
  • 🔗 Relationships - Parent-child span hierarchy
  • 📊 Attributes - Rich metadata about operations
  • 📝 Events - Timeline markers for key moments
  • Errors - Exception details and stack traces

Architecture


Key Components

1. TaskManager Tracing

File: task_telemetry.py Decorator: @trace_task_operation(operation_name) Creates root spans for all task operations:
@trace_task_operation("send_message")
async def send_message(self, params: dict):
    # Automatically traced with:
    # - Span name: "task_manager.send_message"
    # - Timing information
    # - Success/error status
    # - Request parameters
    pass
Captured Attributes:
  • bindu.operation - Operation name (e.g., “send_message”)
  • bindu.request_id - JSON-RPC request ID
  • bindu.task_id - Task UUID
  • bindu.context_id - Context UUID
  • bindu.component - “task_manager”
  • bindu.success - Boolean success flag
  • bindu.error_type - Exception class (on error)
  • bindu.error_message - Error description (on error)

2. Scheduler Span Propagation

Files: base.py, memory_scheduler.py Challenge: Async boundaries break automatic context propagation Solution: Explicit span passing
class _TaskOperation(TypedDict):
    operation: str              # "run", "cancel", etc.
    params: dict                # Task parameters
    _current_span: Span         # ⭐ Preserves trace context
How it works:
  1. Scheduler captures active span: get_current_span()
  2. Stores span in _TaskOperation._current_span
  3. Sends operation to worker queue
  4. Worker restores span to continue trace

3. Worker Tracing

File: base.py Restores parent span and creates child spans:
# Restore parent span from TaskManager
with use_span(task_operation["_current_span"]):
    # Create child span for worker operation
    with tracer.start_as_current_span(f"{operation} task"):
        await handler(params)
Maintains:
  • Trace continuity across async boundaries
  • Parent-child span relationships
  • Complete trace hierarchy

4. Agent Execution Tracing

File: manifest_worker.py Tracks agent-level execution:
with tracer.start_as_current_span("agent.execute") as span:
    span.set_attribute("bindu.agent.name", agent_name)
    span.set_attribute("bindu.agent.did", agent_did)
    span.set_attribute("bindu.agent.message_count", len(messages))
    
    start_time = time.time()
    result = await agent.execute()
    execution_time = time.time() - start_time
    
    span.set_attribute("bindu.agent.execution_time", execution_time)
Captured Attributes:
  • bindu.agent.name - Agent name from manifest
  • bindu.agent.did - Agent DID identifier
  • bindu.agent.message_count - Number of messages
  • bindu.agent.execution_time - Processing time (seconds)
  • bindu.component - “agent_execution”
Span Events:
span.add_event("task.state_changed", {
    "from_state": "working",
    "to_state": "completed"
})

Complete Trace Example

Scenario: User sends a message

task_manager.send_message (250ms)

├─ Attributes:
│  ├─ bindu.operation: "send_message"
│  ├─ bindu.request_id: "req-abc123"
│  ├─ bindu.task_id: "task-def456"
│  ├─ bindu.context_id: "ctx-ghi789"
│  ├─ bindu.component: "task_manager"
│  └─ bindu.success: true

└─ run task (220ms)

   ├─ Attributes:
   │  ├─ bindu.task_id: "task-def456"
   │  └─ bindu.operation: "run"

   └─ agent.execute (200ms)

      ├─ Attributes:
      │  ├─ bindu.agent.name: "my-agent"
      │  ├─ bindu.agent.did: "did:bindu:user:agent:uuid"
      │  ├─ bindu.agent.message_count: 3
      │  ├─ bindu.agent.execution_time: 0.200
      │  └─ bindu.component: "agent_execution"

      └─ Events:
         ├─ task.state_changed (t=0ms)
         │  ├─ from_state: "pending"
         │  └─ to_state: "working"

         └─ task.state_changed (t=200ms)
            ├─ from_state: "working"
            └─ to_state: "completed"

Configuration

{
  "name": "my-agent",
  "telemetry": true,
  "oltp": {
    "endpoint": "http://localhost:4318/v1/traces",
    "service_name": "bindu-agent",
    "service_version": "1.0.0",
    "deployment_environment": "production"
  }
}

Environment Variables

# OTLP endpoint
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318/v1/traces"

# Service identification
export OTEL_SERVICE_NAME="bindu-agent"
export OTEL_SERVICE_VERSION="1.0.0"
export DEPLOYMENT_ENV="production"

# Resource attributes
export OTEL_RESOURCE_ATTRIBUTES="team=ai-platform,region=us-west"

# Batch processing (recommended)
export OTEL_USE_BATCH_PROCESSOR="true"
export OTEL_BSP_SCHEDULE_DELAY="5000"
export OTEL_BSP_MAX_EXPORT_BATCH_SIZE="512"
Agent config parameters take precedence over environment variables.

Span Attributes Reference

Standard Attributes

AttributeTypeDescriptionExample
bindu.operationstringOperation namesend_message
bindu.task_idstringTask UUIDtask-abc123
bindu.context_idstringContext UUIDctx-def456
bindu.request_idstringRequest IDreq-ghi789
bindu.componentstringComponent nametask_manager
bindu.successbooleanSuccess flagtrue

Agent Attributes

AttributeTypeDescriptionExample
bindu.agent.namestringAgent namemy-agent
bindu.agent.didstringAgent DIDdid:bindu:...
bindu.agent.message_countintMessage count3
bindu.agent.execution_timefloatExecution time (s)0.200

Error Attributes

AttributeTypeDescriptionExample
bindu.error_typestringException classValueError
bindu.error_messagestringError messageInvalid input
exception.typestringException typeValueError
exception.messagestringException messageInvalid input
exception.stacktracestringStack traceTraceback...

Span Events

Events are timeline markers within a span:

Task State Changes

span.add_event("task.state_changed", {
    "from_state": "working",
    "to_state": "completed",
    "timestamp": "2025-01-01T12:00:00Z"
})

Custom Events

span.add_event("llm.call_started", {
    "model": "gpt-4",
    "tokens": 150
})

span.add_event("llm.call_completed", {
    "duration_ms": 1500,
    "tokens_generated": 75
})

Best Practices

1. Consistent Naming

Use clear, hierarchical span names:
# Good
"task_manager.send_message"
"agent.execute"
"scheduler.enqueue"
"worker.process_task"

# Avoid
"send_msg"
"exec"
"queue"

2. Rich Attributes

Add meaningful context:
span.set_attribute("bindu.agent.name", agent_name)
span.set_attribute("bindu.task_id", task_id)
span.set_attribute("bindu.message_count", len(messages))
span.set_attribute("bindu.model", "gpt-4")

3. Error Handling

Always record exceptions:
try:
    result = await agent.execute()
except Exception as e:
    span.record_exception(e)
    span.set_status(Status(StatusCode.ERROR, str(e)))
    span.set_attribute("bindu.error_type", type(e).__name__)
    span.set_attribute("bindu.error_message", str(e))
    raise

4. Span Events vs Attributes

Use Attributes for:
  • Static metadata (IDs, names)
  • Final results (execution time, success)
  • Configuration values
Use Events for:
  • State transitions
  • Timeline markers
  • Multiple occurrences
  • Detailed logs

5. Sampling Strategy

For high-volume production:
# Sample 10% of traces
export OTEL_TRACES_SAMPLER="parentbased_traceidratio"
export OTEL_TRACES_SAMPLER_ARG="0.1"

# Always sample errors
export OTEL_TRACES_SAMPLER="parentbased_always_on"

Performance Considerations

Batch Processing

Development:
# Immediate export for debugging
export OTEL_USE_BATCH_PROCESSOR="false"
Production:
# Batched export for efficiency
export OTEL_USE_BATCH_PROCESSOR="true"
export OTEL_BSP_MAX_QUEUE_SIZE="2048"
export OTEL_BSP_SCHEDULE_DELAY="5000"
export OTEL_BSP_MAX_EXPORT_BATCH_SIZE="512"
export OTEL_BSP_EXPORT_TIMEOUT="30000"

Tuning Guidelines

WorkloadQueue SizeDelay (ms)Batch Size
Low volume10245000256
Medium volume20485000512
High volume4096100001024
Very high volume8192150002048

Troubleshooting

Check:
  1. Observability initialization in logs:
[INFO] Initializing observability...
[INFO] Configured OTLP exporter endpoint=...
  1. OTLP endpoint is reachable:
curl -X POST http://localhost:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{"resourceSpans":[]}'
  1. Environment variables:
echo $OTEL_EXPORTER_OTLP_ENDPOINT
echo $OTEL_SERVICE_NAME
Cause: Span context not properly propagatedSolution: Ensure _current_span is passed through async boundaries:
task_operation["_current_span"] = get_current_span()
Cause: Agent execution not creating spansSolution: Verify manifest_worker.py has agent execution tracing:
with tracer.start_as_current_span("agent.execute"):
    # agent execution
Cause: Queue size too large or export delaysSolution: Tune batch processor:
export OTEL_BSP_MAX_QUEUE_SIZE="2048"
export OTEL_BSP_SCHEDULE_DELAY="5000"

Integration Examples

Custom Agent Tracing

Add custom spans in your agent:
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

class MyAgent:
    async def execute(self, messages):
        with tracer.start_as_current_span("my_agent.process") as span:
            span.set_attribute("message_count", len(messages))
            
            # LLM call
            with tracer.start_as_current_span("llm.call") as llm_span:
                llm_span.set_attribute("model", "gpt-4")
                response = await self.llm.generate(messages)
                llm_span.set_attribute("tokens", response.tokens)
            
            return response

External Service Tracing

Trace calls to external services:
with tracer.start_as_current_span("external.api_call") as span:
    span.set_attribute("http.method", "POST")
    span.set_attribute("http.url", "https://api.example.com")
    
    try:
        response = await http_client.post(url, data=payload)
        span.set_attribute("http.status_code", response.status_code)
    except Exception as e:
        span.record_exception(e)
        span.set_status(Status(StatusCode.ERROR))
        raise

Next Steps


Resources