A process can be “running” and still broken. Postgres is down, Redis dropped the connection, the worker pool deadlocked — the agent is still binding portDocumentation Index
Fetch the complete documentation index at: https://docs.getbindu.com/llms.txt
Use this file to discover all available pages before exploring further.
:3773 and answering TCP, but the next real request fails.
So /health in Bindu does more than return 200 OK. It walks the runtime — storage, scheduler, task manager — and returns 503 Service Unavailable when any of them is missing or not running. /metrics exposes Prometheus text-format counters, gauges, histograms, and summaries for request traffic, latency, task state, errors, and per-agent task completion.
You do not configure either. Both endpoints are registered on BinduApplication startup, both are on the auth allowlist, and both are exempt from the startup gate so Kubernetes can probe the pod while storage and the scheduler are still coming up. Point your load balancer at /health and your scrapers at /metrics — you have visibility from request one.
Liveness, readiness, metrics
Bindu does not split liveness and readiness across multiple paths. There is one health endpoint and it makes the strict choice for you: 200 means every dependency is ready, 503 means at least one is not.| Question | Endpoint | Status code | Use for |
|---|---|---|---|
| Is the process alive at all? | /health reachable | TCP-level | Liveness probe |
| Is the agent ready to take work? | /health returns 200 | 200 vs 503 | Readiness probe, load balancer health checks |
| How has it been behaving over time? | /metrics | 200 | Prometheus scrape, Grafana dashboards |
/health returns HTTP 503 with
"status": "degraded", "ready": false, "health": "degraded".Readable
/health returns a single JSON document with runtime, application, and system blocks.Scrapeable
/metrics is Prometheus text format version=0.0.4 — scrape it with anything OpenMetrics-compatible.Probe-safe
Request flow
The metrics middleware runs on every request except/metrics itself (skipped to avoid feedback loops), sanitizes path parameters (UUIDs and numeric IDs are rewritten to :id to keep cardinality bounded), and feeds the global PrometheusMetrics singleton.
/health
Read the fields
| Field | Type | Meaning |
|---|---|---|
status | string | "ok" when strict_ready=true, else "degraded" |
ready | bool | Mirrors runtime.strict_ready — use this for readiness probes |
health | string | "healthy" or "degraded" — same trigger as status |
uptime_seconds | float | Seconds since the process imported health.py (monotonic clock) |
version | string | Value of bindu.__version__ |
runtime.storage_backend | string | null | Class name of app._storage (e.g. PostgresStorage, InMemoryStorage) |
runtime.scheduler_backend | string | null | Class name of app._scheduler (e.g. RedisScheduler, MemoryScheduler) |
runtime.task_manager_running | bool | True only if app.task_manager.is_running |
runtime.strict_ready | bool | True when all of storage, scheduler, task manager are present and running |
application.penguin_id | string | UUID identifying this server process |
application.agent_did | string | null | Agent DID from the manifest, if present |
system.python_version | string | sys.version.split()[0] |
system.platform | string | platform.system() — e.g. Linux, Darwin |
system.environment | string | Value of $ENV env var, defaults to "development" |
Wire it to your orchestrator
strict_ready, so a single config covers both liveness and readiness:/health returns 503 without raising. The startup gate in
BinduApplication.__call__ whitelists /health, /healthz, and /metrics, so probes
get a proper response code instead of a 500./health is registered as a route. The /healthz path is whitelisted in the
startup gate (so it cannot 500 mid-boot) but no handler exists for it — a request to
/healthz returns 404 once the app is running. Use /health everywhere./metrics
text/plain; version=0.0.4; charset=utf-8 with Cache-Control: no-cache. Before generating output, the endpoint refreshes the per-agent active-task gauge by counting tasks in submitted, working, and input-required states from storage.
Metric reference
http_requests_total — counter
http_requests_total — counter
MetricsMiddleware, labelled by method, sanitized endpoint, and status code.Path parameters are normalized: UUIDs match [0-9a-f]{8}-… and runs of digits both collapse to /:id. That keeps cardinality bounded even with high-traffic task endpoints like /tasks/<uuid>.http_request_duration_seconds — histogram
http_request_duration_seconds — histogram
0.1, 0.5, 1.0, +Inf. Exposes _bucket, _sum, _count.The histogram is global, not per-endpoint — there are no method or path labels on the buckets. If you need per-route p95, build it from request count buckets in your aggregator, or alert on the global p95.agent_tasks_active — gauge
agent_tasks_active — gauge
submitted, working, or input-required state for the agent. Updated lazily — only when /metrics is scraped, by querying storage.count_tasks(status=…) for each active state. Only emitted when at least one agent has reported.agent_tasks_completed_total — counter
agent_tasks_completed_total — counter
(agent_id, status). Status values come from the task manager and are typically success, failed, or canceled. Only emitted when at least one task has completed.task_duration_seconds — histogram
task_duration_seconds — histogram
agent_id and status. Buckets: 1, 5, 10, 30, 60, +Inf seconds. Exposes _bucket, _sum, _count. Only emitted when at least one task has finished.agent_errors_total — counter
agent_errors_total — counter
agent_id and error_type (e.g. timeout, validation, execution). Only emitted when at least one error has been recorded.http_request_size_bytes / http_response_size_bytes — summary
http_request_size_bytes / http_response_size_bytes — summary
_sum and _count only (no quantiles). Sizes are read from the
Content-Length header — chunked responses without a length register as zero.http_requests_in_flight — gauge
http_requests_in_flight — gauge
call_next, decremented in the finally block, so it always reflects current
concurrency even on exceptions.Always emitted, even at zero — useful for “agent is alive but idle” dashboards.Prometheus scrape config
/metrics itself, so scrape calls do not
inflate http_requests_total. Your dashboards measure user traffic, not Prometheus
polling.Use cases
Kubernetes liveness + readiness
Kubernetes liveness + readiness
/health for both probes. The endpoint returns 503 while storage or scheduler is
initializing, so kubelet does not route traffic to a half-booted pod, and it returns 200
only once task_manager.is_running is true.Load-balancer health check
Load-balancer health check
GET /health. Anything other than 200 should drain the instance.
status: "degraded" is your signal that the process is up but should not receive
traffic.Prometheus + Grafana monitoring
Prometheus + Grafana monitoring
/metrics every 15s. Build dashboards on
rate(http_requests_total), histogram_quantile(0.95, …) for p95 latency,
agent_tasks_active for concurrency, and agent_errors_total for fault rates.Debugging a stuck or degraded agent
Debugging a stuck or degraded agent
/health and inspect the runtime
block. A null storage_backend or scheduler_backend means initialization never
completed; task_manager_running: false means the loop exited or never started.Sentry + APM cohabitation
Sentry + APM cohabitation
/health, /healthz, and /metrics out of
transaction traces by default (SentrySettings.filter_transactions), so probe and
scrape traffic does not eat your Sentry quota or pollute performance dashboards.Best practices
Probe readiness, not just the port
:3773 will pass while storage is still connecting. Use HTTP /health
and treat any non-200 as not-ready.Scrape every 15s
agent_tasks_active to track real
concurrency. Below 10s mostly buys noise.Alert on degraded, not down
/health is unreachable you are already on fire. Alert on status="degraded"
or health="degraded" returns for an early warning.Keep both paths public
/health and /metrics are in the auth allowlist by default. Do not put them
behind your OAuth gateway — probes and scrapers do not carry tokens.