Distributed Tracing
Bindu implements comprehensive distributed tracing using OpenTelemetry to provide end-to-end visibility across the entire task execution lifecycle.Overview
Distributed tracing tracks requests as they flow through your agent system: TaskManager → Scheduler → Worker → Agent Each component creates spans that form a complete trace, showing:- ⏱️ Timing - How long each operation takes
- 🔗 Relationships - Parent-child span hierarchy
- 📊 Attributes - Rich metadata about operations
- 📝 Events - Timeline markers for key moments
- ❌ Errors - Exception details and stack traces
Architecture
Key Components
1. TaskManager Tracing
File:task_telemetry.py
Decorator: @trace_task_operation(operation_name)
Creates root spans for all task operations:
bindu.operation- Operation name (e.g., “send_message”)bindu.request_id- JSON-RPC request IDbindu.task_id- Task UUIDbindu.context_id- Context UUIDbindu.component- “task_manager”bindu.success- Boolean success flagbindu.error_type- Exception class (on error)bindu.error_message- Error description (on error)
2. Scheduler Span Propagation
Files:base.py, memory_scheduler.py
Challenge: Async boundaries break automatic context propagation
Solution: Explicit span passing
- Scheduler captures active span:
get_current_span() - Stores span in
_TaskOperation._current_span - Sends operation to worker queue
- Worker restores span to continue trace
3. Worker Tracing
File:base.py
Restores parent span and creates child spans:
- Trace continuity across async boundaries
- Parent-child span relationships
- Complete trace hierarchy
4. Agent Execution Tracing
File:manifest_worker.py
Tracks agent-level execution:
bindu.agent.name- Agent name from manifestbindu.agent.did- Agent DID identifierbindu.agent.message_count- Number of messagesbindu.agent.execution_time- Processing time (seconds)bindu.component- “agent_execution”
Complete Trace Example
Scenario: User sends a message
Configuration
Agent Config (Recommended)
Environment Variables
Agent config parameters take precedence over environment variables.
Span Attributes Reference
Standard Attributes
| Attribute | Type | Description | Example |
|---|---|---|---|
bindu.operation | string | Operation name | send_message |
bindu.task_id | string | Task UUID | task-abc123 |
bindu.context_id | string | Context UUID | ctx-def456 |
bindu.request_id | string | Request ID | req-ghi789 |
bindu.component | string | Component name | task_manager |
bindu.success | boolean | Success flag | true |
Agent Attributes
| Attribute | Type | Description | Example |
|---|---|---|---|
bindu.agent.name | string | Agent name | my-agent |
bindu.agent.did | string | Agent DID | did:bindu:... |
bindu.agent.message_count | int | Message count | 3 |
bindu.agent.execution_time | float | Execution time (s) | 0.200 |
Error Attributes
| Attribute | Type | Description | Example |
|---|---|---|---|
bindu.error_type | string | Exception class | ValueError |
bindu.error_message | string | Error message | Invalid input |
exception.type | string | Exception type | ValueError |
exception.message | string | Exception message | Invalid input |
exception.stacktrace | string | Stack trace | Traceback... |
Span Events
Events are timeline markers within a span:Task State Changes
Custom Events
Best Practices
1. Consistent Naming
Use clear, hierarchical span names:2. Rich Attributes
Add meaningful context:3. Error Handling
Always record exceptions:4. Span Events vs Attributes
Use Attributes for:- Static metadata (IDs, names)
- Final results (execution time, success)
- Configuration values
- State transitions
- Timeline markers
- Multiple occurrences
- Detailed logs
5. Sampling Strategy
For high-volume production:Performance Considerations
Batch Processing
Development:Tuning Guidelines
| Workload | Queue Size | Delay (ms) | Batch Size |
|---|---|---|---|
| Low volume | 1024 | 5000 | 256 |
| Medium volume | 2048 | 5000 | 512 |
| High volume | 4096 | 10000 | 1024 |
| Very high volume | 8192 | 15000 | 2048 |
Troubleshooting
Spans not appearing
Spans not appearing
Check:
- Observability initialization in logs:
- OTLP endpoint is reachable:
- Environment variables:
Broken trace hierarchy
Broken trace hierarchy
Cause: Span context not properly propagatedSolution: Ensure
_current_span is passed through async boundaries:Missing agent spans
Missing agent spans
Cause: Agent execution not creating spansSolution: Verify
manifest_worker.py has agent execution tracing:High memory usage
High memory usage
Cause: Queue size too large or export delaysSolution: Tune batch processor:
Integration Examples
Custom Agent Tracing
Add custom spans in your agent:External Service Tracing
Trace calls to external services:Next Steps
Jaeger Setup
Set up Jaeger for trace visualization
Metrics
Learn about metrics collection
Overview
Back to observability overview
GitHub
View implementation details
Resources
- OpenTelemetry Tracing - Official tracing documentation
- Semantic Conventions - Standard attribute names
- Python API - OpenTelemetry Python docs
- Best Practices - Manual instrumentation guide