Retry Mechanism
Bindu includes a built-in Tenacity-based retry mechanism to handle transient failures gracefully across workers, storage, schedulers, and API calls. This ensures your agents remain resilient in production environments.Why Use Retry Mechanism?
1. Automatic Failure Recovery
Handle transient failures without manual intervention:- Network timeouts - Retry when connections drop
- Database connection errors - Recover from temporary DB issues
- Redis connection failures - Handle queue service interruptions
- API rate limits - Automatically back off and retry
- Temporary service unavailability - Wait and retry when services recover
2. Exponential Backoff with Jitter
Smart retry strategy prevents system overload:- Exponential backoff - Wait longer between each retry attempt
- Random jitter - Distribute retry attempts over time
- Prevents thundering herd - Avoid overwhelming recovering services
- Configurable timing - Adjust wait times per operation type
3. Comprehensive Coverage
Retry logic applied across all critical operations:- Worker operations - Task execution and cancellation
- Storage operations - Database reads and writes
- Scheduler operations - Task queuing and distribution
- API calls - External service integration
- Application startup - Resilient initialization
4. Observability
Full visibility into retry behavior:- Retry attempt logging - Track when retries occur
- Failure tracking - Monitor whatβs failing and why
- Performance metrics - Measure retry impact
- Debug information - Detailed context for troubleshooting
5. Flexible Configuration
Customize retry behavior per environment:- Environment variables - Configure via
.envfiles - Per-operation overrides - Fine-tune specific operations
- Different strategies - Separate settings for workers, storage, schedulers
- Easy tuning - Adjust without code changes
When to Use Retry Mechanism
β Automatically enabled for:- Production deployments with external dependencies
- Distributed systems with network calls
- Database and Redis operations
- Worker task processing
- API integrations
- Pure in-memory operations (though still available)
- Operations that should fail fast
- Non-idempotent operations without proper handling
Architecture
Binduβs retry mechanism uses Tenacity for robust retry logic:How It Works
- Operation Execution: Decorated function is called
- Failure Detection: Exception is caught by retry decorator
- Retry Decision: Tenacity determines if retry should occur
- Backoff Calculation: Exponential backoff with jitter applied
- Wait Period: Sleep for calculated duration
- Retry Attempt: Function is called again
- Success or Fail: Either succeeds or exhausts retry attempts
Configuration
Environment Variables
Configure retry behavior via.env file:
Default Settings
If not configured, Bindu uses these defaults:| Operation Type | Max Attempts | Min Wait | Max Wait |
|---|---|---|---|
| Worker | 3 | 1.0s | 10.0s |
| Storage | 5 | 0.5s | 5.0s |
| Scheduler | 3 | 1.0s | 8.0s |
| API | 4 | 1.0s | 15.0s |
Configuration Parameters
- max_attempts: Maximum number of retry attempts before giving up
- min_wait: Minimum wait time between retries (seconds)
- max_wait: Maximum wait time between retries (seconds)
min_wait * (2 ^ attempt) + random_jitter, capped at max_wait.
Retry Decorators
1. Worker Operations
For task execution and worker operations:2. Storage Operations
For database and storage operations:3. Scheduler Operations
For task scheduling and queue operations:4. API Operations
For external API calls:Ad-hoc Retry
For one-off retry logic without decorators:Retryable Exceptions
By default, retries occur on:ConnectionError- Network connection failuresTimeoutError- Operation timeoutsasyncio.TimeoutError- Async operation timeouts- Generic
Exception- Catch-all (can be refined)
Custom Exception Handling
You can customize which exceptions trigger retries:Applied Retry Logic
Worker Operations
File:bindu/server/workers/manifest_worker.py
PostgreSQL Storage
File:bindu/server/storage/postgres_storage.py
All database operations use execute_with_retry() via _retry_on_connection_error():
Redis Scheduler
File:bindu/server/scheduler/redis_scheduler.py
In-Memory Storage
File:bindu/server/storage/memory_storage.py
Application Initialization
File:bindu/server/applications.py
Best Practices
1. Use Appropriate Retry Settings
Match retry settings to operation characteristics:2. Override Defaults When Needed
Customize retry behavior for specific operations:3. Monitor Retry Attempts
Watch logs for retry patterns:4. Ensure Idempotency
Make operations safe to retry:5. Handle Non-Retryable Errors
Distinguish between transient and permanent failures:6. Set Reasonable Timeouts
Combine retries with timeouts:7. Log Retry Context
Add context to retry logs:Monitoring & Observability
Retry Logs
Retry attempts are automatically logged:Key Metrics to Monitor
- Retry Rate: Percentage of operations requiring retries
- Retry Success Rate: Percentage of retries that eventually succeed
- Average Retry Attempts: Mean number of attempts per operation
- Retry Duration: Total time spent in retry loops
- Failure Types: Which exceptions trigger most retries
Integration with Sentry
Retry failures are automatically captured by Sentry:Troubleshooting
Too Many Retries
Symptom: Operations taking too long due to excessive retries Solutions:- Reduce
max_attempts - Decrease
max_waittime - Fix underlying issue causing failures
- Add circuit breaker pattern
Retries Not Working
Symptom: Operations failing without retry attempts Solutions:- Check decorator is applied:
@retry_worker_operation() - Verify exception is retryable
- Check retry settings are loaded
- Review logs for retry messages
Thundering Herd
Symptom: All instances retry simultaneously, overwhelming service Solutions:- Jitter is automatic, but increase
max_wait - Stagger instance startup times
- Add circuit breaker
- Use rate limiting
Non-Idempotent Operations
Symptom: Retries cause duplicate operations or data corruption Solutions:- Make operations idempotent
- Use idempotency keys
- Check state before retry
- Reduce retry attempts
Testing
Unit Tests
Test retry behavior:Integration Tests
Test retry with real dependencies:Performance Considerations
Retry Overhead
Each retry adds latency:- 1 retry: ~1-2s additional latency
- 3 retries: ~5-10s additional latency
- 5 retries: ~15-30s additional latency
Memory Impact
Minimal memory overhead:- Retry state: ~100 bytes per operation
- Logging: ~500 bytes per retry attempt
- Total: Negligible for most applications
CPU Impact
Negligible CPU overhead:- Backoff calculation: ~0.01ms
- Jitter generation: ~0.001ms
- Logging: ~0.1ms
Future Enhancements
1. Circuit Breaker
Prevent retry storms when service is down:2. Retry Metrics
Export retry metrics to observability platforms:3. Adaptive Backoff
Adjust backoff based on system load:4. Retry Budget
Limit total retry time across all operations:Comparison with Alternatives
| Feature | Tenacity | Backoff | Retry | Custom |
|---|---|---|---|---|
| Async support | β Native | β οΈ Limited | β No | β οΈ Manual |
| Exponential backoff | β Built-in | β Built-in | β Built-in | β οΈ Manual |
| Jitter | β Built-in | β Built-in | β No | β οΈ Manual |
| Decorators | β Yes | β Yes | β Yes | β οΈ Manual |
| Configurability | β Extensive | β οΈ Moderate | β οΈ Basic | β Full |
| Logging | β Integrated | β οΈ Basic | β No | β οΈ Manual |
| Best for | Production | Simple cases | Legacy code | Special needs |
Conclusion
Binduβs retry mechanism provides robust, automatic failure recovery for production deployments. It offers:- β Automatic retry: Handle transient failures without code changes
- β Smart backoff: Exponential backoff with jitter prevents overload
- β Full coverage: Workers, storage, schedulers, and APIs
- β Observability: Integrated logging and monitoring
- β Flexibility: Configurable per environment and operation
- β Battle-tested: Built on Tenacity, used by thousands of projects
Next Steps: