Things fail in production for ordinary reasons. A database connection drops. Redis goes away for a moment. An API times out. Most of those failures are transient. If the system treats every one of them as final, tasks fail for no good reason and recovering services get hammered the instant they come back. Bindu wraps every brittle boundary — workers, storage, scheduler, outbound HTTP — in a small set of Tenacity decorators that retry transient errors with exponential backoff and reraise everything else immediately.Documentation Index
Fetch the complete documentation index at: https://docs.getbindu.com/llms.txt
Use this file to discover all available pages before exploring further.
Why Retry Matters
| Without retry | With Bindu retry |
|---|---|
| Temporary failures surface as immediate task failures | Transient errors recover automatically before users see them |
| Recovering services get a thundering herd | Per-attempt backoff (with jitter on three of four families) spreads load |
| Worker, storage, scheduler, and HTTP each need custom handling | Four named decorators wrap the same Tenacity machinery |
| Logic bugs and transient errors are retried indistinguishably | Only a narrow allowlist of transient exceptions is retried |
| Tuning behaviour requires code changes | RETRY__* env vars override every default |
How Bindu Retry Works
All four decorators are thin wrappers around a single factory,create_retry_decorator(operation_type, ...), defined in bindu/utils/retry.py. The factory:
- Looks up the family’s defaults on
app_settings.retry(or honours your override). - Picks a wait strategy —
wait_random_exponential(jitter) orwait_exponential(no jitter). - Builds an
AsyncRetryingloop that retries only onTRANSIENT_EXCEPTIONS. - Logs at
WARNINGbefore each sleep (viabefore_sleep_log) and atINFOafter each attempt (viaafter_log). - Reraises the original exception once attempts are exhausted (
reraise=True).
Only transient exceptions are retried. Application errors like
ValueError or KeyError raise on the first attempt — they are not in the retry list.The Lifecycle: Fail, Wait, Try Again
What Counts As Transient
The allowlist lives inbindu/utils/retry.py as TRANSIENT_EXCEPTIONS:
HTTP_RETRYABLE_EXCEPTIONS, extends this with HTTPConnectionError, HTTPTimeoutError, and HTTPServerError (5xx). It is defined for HTTP callers but the four headline decorators currently all use TRANSIENT_EXCEPTIONS.
Subclasses count: any custom exception that inherits from
ConnectionError, TimeoutError, or OSError is retried automatically.Backoff: Plain vs. Jittered
Bindu picks between two Tenacity wait strategies per family:wait_exponentialdoubles the wait each attempt, clamped to[min_wait, max_wait]. Deterministic. Used for storage.wait_random_exponentialsamples uniformly in[0, min(max_wait, multiplier * 2^attempt)]. Spreads retries to avoid thundering herds. Used for worker, scheduler, and api.
The Four Decorator Families
retry_worker_operation
Wraps
ManifestWorker task execution. Default 3 attempts, 1.0–10.0 s, jittered. Used in bindu/server/workers/manifest_worker.py on run_task and cancel_task.retry_storage_operation
Wraps storage CRUD on the in-memory backend. Default 5 attempts, 0.5–5.0 s, no jitter. Used in
bindu/server/storage/memory_storage.py on load_task, submit_task, update_task.retry_scheduler_operation
Wraps scheduler enqueue calls. Default 3 attempts, 1.0–8.0 s, jittered. Used in
bindu/server/scheduler/memory_scheduler.py and bindu/server/scheduler/redis_scheduler.py on run_task, cancel_task, pause_task, resume_task.retry_api_call
Wraps outbound HTTP. Default 4 attempts, 1.0–15.0 s, jittered. Used via
create_retry_decorator("api") on the HTTP client in bindu/utils/http/client.py (get, post, put, delete, request) and on push delivery in bindu/utils/notifications.py (_post_with_retry).create_retry_decorator(operation_type, ...) factory. They exist as named convenience wrappers for grep-ability and for backward compatibility — calling create_retry_decorator("api") is exactly equivalent to retry_api_call().
Why four decorators, not one?
The split is operational, not technical:- Storage retries should be many and fast — a flaky local connection deserves five 0.5–5 s pokes, not three 10 s sulks. Storage runs in-process, so jitter buys you nothing.
- API retries should be fewer and longer — remote services need room to breathe, and jitter prevents pods from synchronising.
- Worker retries cover task execution and should be conservative; retrying agent logic too aggressively masks real bugs.
- Scheduler retries cover broker hand-off, where the failure mode is “Redis briefly unavailable” — short attempts, modest wait.
Defaults and Configuration
Family Defaults
Defined inRetrySettings (bindu/settings.py):
| Family | max_attempts | min_wait | max_wait | Jitter |
|---|---|---|---|---|
| worker | 3 | 1.0 s | 10.0 s | yes |
| storage | 5 | 0.5 s | 5.0 s | no |
| scheduler | 3 | 1.0 s | 8.0 s | yes |
| api | 4 | 1.0 s | 15.0 s | yes |
Environment Variables
RetrySettings lives under the top-level Settings model, which uses env_nested_delimiter="__". The variable name is RETRY__<field>:
Per-call overrides on the decorator (
@retry_storage_operation(max_attempts=10)) win over env vars, which win over the defaults baked into RetrySettings. The or fallback inside create_retry_decorator means an override of 0 or None falls back to settings — pass a real positive value.Decorator Reference
retry_worker_operation()
retry_worker_operation()
Family:
worker · Jitter: yes · Defaults: 3 attempts, 1.0–10.0 sWraps task execution on ManifestWorker. Failures during manifest.run(...) only retry when they bubble up as ConnectionError/TimeoutError/OSError. Agent-side ValueError or RuntimeError is not retried — the worker catches it, marks the task failed, and reraises.Real call sites (bindu/server/workers/manifest_worker.py):cancel_task deliberately caps at 2 attempts: a cancel that fails twice is not going to start working on attempt three.retry_storage_operation()
retry_storage_operation()
Family: Applied at lines 71, 103, 242:Same decorator covers
storage · Jitter: no (wait_exponential) · Defaults: 5 attempts, 0.5–5.0 sWraps storage CRUD on InMemoryStorage. The implementation overrides per-call to a tighter budget tuned for in-process memory:bindu/server/storage/memory_storage.py (lines 41–44):submit_task and update_task.The Postgres storage backend does not use
@retry_storage_operation. It calls execute_with_retry directly via its own _retry_on_connection_error helper (bindu/server/storage/postgres_storage.py line 243), keyed off storage.postgres_max_retries and storage.postgres_retry_delay from StorageSettings. So the RETRY__STORAGE_* env vars affect the in-memory backend and any code that uses the decorator directly — they do not retune Postgres.retry_scheduler_operation()
retry_scheduler_operation()
Family:
scheduler · Jitter: yes · Defaults: 3 attempts, 1.0–8.0 sWraps the four enqueue operations on both scheduler backends.bindu/server/scheduler/redis_scheduler.py (lines 114, 124, 134, 144):bindu/server/scheduler/memory_scheduler.py (lines 73, 83, 93, 103) overrides defaults for its anyio stream — tight 0.1–1.0 s window across 3 attempts:retry_api_call()
retry_api_call()
Family: Push delivery additionally short-circuits the retry for 4xx (except 429) inside the wrapped body — the decorator only sees the exceptions you let escape.
api · Jitter: yes · Defaults: 4 attempts, 1.0–15.0 sThe headline name. Internally, Bindu’s HTTP client and push notifier reach for the factory directly so they can mix in extra parameters.bindu/utils/http/client.py (lines 195, 219, 245, 271, 291):bindu/utils/notifications.py (line 125) — push delivery uses a tighter override:Inside an Attempt
Invoke the wrapped function
AsyncRetrying enters its loop with stop=stop_after_attempt(N), wait=<exponential strategy>, retry=retry_if_exception_type(TRANSIENT_EXCEPTIONS), reraise=True. A debug log records the attempt number.On success: stop
The
with attempt: block records success; the async for exits and the wrapper returns the value.On non-transient exception: reraise now
Anything outside
TRANSIENT_EXCEPTIONS (e.g. ValueError) skips the retry-decision path and propagates immediately. There is no backoff and no further attempt.On transient exception: log and sleep
before_sleep_log(logger, WARNING) writes a warning. The wait strategy computes the next sleep — min(max_wait, multiplier * 2^attempt) either deterministic (storage) or sampled uniformly (everyone else). after_log(logger, INFO) records the attempt outcome.Examples
Custom decorator usage
Ad-hoc retry (no decorator)
execute_with_retry is what applications.py uses to retry storage and scheduler construction at startup, and what postgres_storage.py uses for every query.
wait_random_exponential (jitter) and the same TRANSIENT_EXCEPTIONS allowlist.
Env-var overrides for a noisy network
Make sure operations are idempotent
Anything wrapped by a retry decorator should be safe to run twice. Set-like operations are naturally idempotent:Distinguish transient from logic errors
Sample log output
A storage call that fails twice then succeeds (loggerbindu.utils.retry):
before_sleep_log / after_log lines come from Tenacity directly; the Executing ... operation line comes from the wrapper inside create_retry_decorator.
Troubleshooting
Retries are taking too long. Lowermax_attempts and/or max_wait:
TRANSIENT_EXCEPTIONS. Either subclass ConnectionError/TimeoutError/OSError in your own exception, or wrap it before raising:
ConnectionError will be retried N times before failing — exactly what you don’t want. Keep TRANSIENT_EXCEPTIONS narrow and raise application errors as ValueError/RuntimeError so they fail fast.
Postgres retries don’t respond to RETRY__STORAGE_*. Correct — Postgres uses storage.postgres_max_retries / storage.postgres_retry_delay from StorageSettings, not RetrySettings.
Testing
Related
- Storage — backend that exposes the in-memory
@retry_storage_operationcalls and the Postgres_retry_on_connection_errorhelper. - Scheduler — Redis and in-memory schedulers whose enqueue paths are retry-wrapped.
- Notifications — push delivery uses
@create_retry_decorator("api", ...)with its own tighter budget. - Observability — every retry attempt is logged via the
bindu.utils.retrylogger and surfaced through your existing log pipeline.