Health API

Monitor system health and configure heartbeat schedules.

Health check (web)

GET /api/health

No authentication required. Returns system health metrics including CPU and memory usage.

The backend service exposes its own health check at GET /health (without the /api prefix). The web and backend health endpoints are independent — the web endpoint reports on the web application process while the backend endpoint reports on the API service. See backend health check below for details.

Response

{
  "status": "ok",
  "health": "healthy",
  "timestamp": "2026-03-19T00:00:00Z",
  "cpu": {
    "usage": 15.3,
    "cores": 4
  },
  "memory": {
    "usage": 42.1,
    "total": 8589934592,
    "used": 3617054720,
    "free": 4972879872
  },
  "uptime": 86400
}

The health field reflects overall system status:

Value	Condition
`healthy`	CPU and memory usage both at or below 70%
`degraded`	CPU or memory usage above 70% but at or below 85%
`unhealthy`	CPU or memory usage above 85%

Degraded and unhealthy responses

When the system is degraded or unhealthy, the endpoint still returns HTTP 200 with the health field set to degraded or unhealthy. The status field remains ok.

{
  "status": "ok",
  "health": "unhealthy",
  "timestamp": "2026-03-19T00:00:00Z",
  "cpu": { "usage": 92.5, "cores": 4 },
  "memory": { "usage": 88.0, "total": 8589934592, "used": 7558529024, "free": 1031405568 },
  "uptime": 86400
}

Error response

An HTTP 500 is returned only when an unexpected error occurs while collecting health metrics, not for degraded or unhealthy status:

{
  "status": "error",
  "health": "unhealthy",
  "timestamp": "2026-03-19T00:00:00Z"
}

Code	Description
200	Health check succeeded. Check the `health` field for `healthy`, `degraded`, or `unhealthy`.
500	Unexpected error collecting health metrics.

Backend health check

GET /health

No authentication required. Returns backend service status including Render API availability. This endpoint is served by the backend API service (without the /api prefix).

The backend API continues to serve non-provisioning endpoints (health, metrics, auth, AI, registration) even when the Render API is not reachable. Agent provisioning and lifecycle operations are disabled until the Render API becomes available.

Response

{
  "status": "ok",
  "timestamp": "2026-03-19T00:00:00Z",
  "docker": "available",
  "provisioning": "enabled",
  "provider": "render"
}

Field	Type	Description
`status`	string	Always `ok` when the backend is running
`timestamp`	string	ISO 8601 timestamp of the health check
`docker`	string	Provisioning infrastructure availability. `available` when the Render API is reachable, `unavailable` otherwise. Despite the field name, this checks the Render API — the name is a legacy artifact from when agents ran as local Docker containers.
`provisioning`	string	Agent provisioning capability. `enabled` when the Render API is available, `disabled` otherwise.
`provider`	string	Infrastructure provider. Always `render`.

Response when Render API is unavailable

When the Render API is not reachable, the health endpoint still returns HTTP 200 but reports degraded capabilities:

{
  "status": "ok",
  "timestamp": "2026-03-19T00:00:00Z",
  "docker": "unavailable",
  "provisioning": "disabled",
  "provider": "render"
}

When provisioning is disabled, any request to a provisioning-dependent endpoint (such as deploying, starting, stopping, or restarting an agent) returns a 500 error. Non-provisioning endpoints continue to operate normally.

Get heartbeat settings

GET /api/heartbeat?agentId=agent_123

Requires session authentication. Returns the heartbeat configuration for a specific agent.

Query parameters

Parameter	Type	Required	Description
`agentId`	string	Yes	The agent to retrieve heartbeat settings for

Response

{
  "heartbeat": {
    "frequency": "3h",
    "enabled": true,
    "lastHeartbeat": "2026-03-19T00:00:00Z",
    "nextHeartbeat": "2026-03-19T03:00:00Z"
  }
}

When no settings have been saved for the agent, the endpoint returns defaults:

{
  "heartbeat": {
    "frequency": "3h",
    "enabled": true,
    "lastHeartbeat": null,
    "nextHeartbeat": null
  }
}

Field	Type	Description
`heartbeat.frequency`	string	Heartbeat interval (for example, `3h`, `30m`, `1d`)
`heartbeat.enabled`	boolean	Whether heartbeats are enabled
`heartbeat.lastHeartbeat`	string \| null	ISO 8601 timestamp of the last heartbeat, or `null` if never set
`heartbeat.nextHeartbeat`	string \| null	ISO 8601 timestamp of the next scheduled heartbeat, or `null` if never set

Errors

Code	Description
400	`agentId required` — the `agentId` query parameter is missing
401	Unauthorized
500	Failed to fetch heartbeat settings

Update heartbeat settings

POST /api/heartbeat

Requires session authentication. Updates heartbeat settings for a specific agent. The agent must belong to the authenticated user.

Request body

Field	Type	Required	Description
`agentId`	string	Yes	The agent to update heartbeat settings for
`frequency`	string	No	Heartbeat interval (for example, `3h`, `30m`, `1d`). Defaults to `3h`. Supported units: `m` (minutes), `h` (hours), `d` (days).
`enabled`	boolean	No	Enable or disable heartbeats. Defaults to `true`.

Response

{
  "heartbeat": {
    "frequency": "3h",
    "enabled": true,
    "lastHeartbeat": "2026-03-19T00:00:00Z",
    "nextHeartbeat": "2026-03-19T03:00:00Z",
    "lastUpdated": "2026-03-19T00:00:00Z"
  }
}

Field	Type	Description
`heartbeat.frequency`	string	Configured heartbeat interval
`heartbeat.enabled`	boolean	Whether heartbeats are enabled
`heartbeat.lastHeartbeat`	string	ISO 8601 timestamp when the settings were saved
`heartbeat.nextHeartbeat`	string	ISO 8601 timestamp of the next scheduled heartbeat, calculated from the current time plus the frequency
`heartbeat.lastUpdated`	string	ISO 8601 timestamp of the last settings update

Errors

Code	Description
400	`agentId required` — the `agentId` field is missing from the request body
401	Unauthorized
404	Agent not found — the agent does not exist or is not owned by the authenticated user
500	Heartbeat update failed

Delete heartbeat settings

DELETE /api/heartbeat

Requires session authentication. Resets heartbeat configuration for a specific agent by removing saved settings.

Request body

Field	Type	Required	Description
`agentId`	string	Yes	The agent to reset heartbeat settings for

Response

{
  "success": true
}

Errors

Code	Description
400	`agentId required` — the `agentId` field is missing from the request body
401	Unauthorized
500	Heartbeat reset failed

Container health checks

Agent services run the official OpenClaw image, which exposes built-in health endpoints on port 18789. The backend uses these to determine service readiness during provisioning and ongoing monitoring.

Built-in health endpoints

The OpenClaw image (ghcr.io/openclaw/openclaw:2026.3.22) provides two health endpoints on each agent service:

Endpoint	Purpose	Description
`GET /healthz`	Liveness	Returns `200` when the gateway process is running. Used by the health check to detect crashed or hung services.
`GET /readyz`	Readiness	Returns `200` when the gateway is ready to accept requests. Use this to verify the service has completed startup before routing traffic.

Both endpoints are unauthenticated and bind to the service’s internal port (18789).

`/healthz` response

{
  "ok": true,
  "status": "live"
}

Field	Type	Description
`ok`	boolean	`true` when the gateway process is running
`status`	string	Always `live` when the endpoint responds

`/readyz` response

{
  "ready": true,
  "failing": [],
  "uptimeMs": 68163
}

Field	Type	Description
`ready`	boolean	`true` when the gateway is ready to accept requests
`failing`	array	List of failing readiness checks. Empty when all checks pass.
`uptimeMs`	number	Gateway uptime in milliseconds since startup

The backend also probes /health on port 18789 for application-level health checks. The /healthz and /readyz endpoints are provided by the OpenClaw image itself and are available on all agent services.

Container health statuses

Status	Condition
`healthy`	Service is running and the internal health endpoint responds successfully
`starting`	Service is running but the health endpoint is not yet responding after all retries
`stopped`	Service has exited
`suspended`	Service has been suspended (paused to save resources)
`unhealthy`	Service is in an unexpected state or cannot be inspected

Health check behavior

The backend probes each agent’s /healthz endpoint to determine service health. The health check uses a 5-second timeout per request.
The waitForHealthy function polls service health every 2 seconds, with a default overall timeout of 60 seconds.

Watchdog monitoring

The backend runs a per-agent watchdog that continuously monitors agent health, detects crash loops, and performs automatic recovery. The watchdog operates internally and does not expose dedicated API endpoints. Status information is surfaced through the existing agent status and lifecycle endpoints.

Health check cycle

The watchdog probes each agent’s gateway at GET /healthz on the agent’s internal port. Health checks run on a configurable interval (default: every 2 minutes). When the gateway reports unhealthy, the watchdog transitions the agent to a degraded state and increases the check frequency to every 5 seconds.

Parameter	Default	Environment variable
Health check interval	120 seconds	`WATCHDOG_CHECK_INTERVAL`
Degraded check interval	5 seconds	`WATCHDOG_DEGRADED_CHECK_INTERVAL`
Startup failure threshold	3 consecutive failures	`WATCHDOG_STARTUP_FAILURE_THRESHOLD`
Max repair attempts	2	`WATCHDOG_MAX_REPAIR_ATTEMPTS`
Crash loop window	5 minutes	`WATCHDOG_CRASH_LOOP_WINDOW`
Crash loop threshold	3 crashes in window	`WATCHDOG_CRASH_LOOP_THRESHOLD`

Lifecycle states

The watchdog tracks the following lifecycle states for each agent:

State	Description
`stopped`	Agent is not running
`starting`	Agent service has started; waiting for the first successful health check
`running`	Agent is healthy and serving requests
`degraded`	Health checks are failing after a previous healthy state
`crash_loop`	Multiple crashes detected within the crash loop window
`repairing`	Auto-repair is in progress

Auto-repair

When the watchdog detects an unhealthy agent, it can automatically attempt recovery. Auto-repair is enabled by default and can be disabled by setting the WATCHDOG_AUTO_REPAIR environment variable to false. The repair sequence is:

Kill the agent gateway process
Wait 5 seconds
Restart the gateway
Wait 30 seconds (startup grace period)
Verify health

If the repair fails, the watchdog retries up to the configured maximum (default: 2 attempts). After exhausting all repair attempts, the agent transitions to the crash_loop state.

Crash loop detection

The watchdog tracks crash timestamps within a sliding window (default: 5 minutes). When the number of crashes in the window reaches the threshold (default: 3), the agent enters the crash_loop state. This prevents infinite restart loops for agents with persistent failures.

Notifications

The watchdog sends notifications for critical events (degraded, crash loop, repair attempts) through configured channels:

Telegram — when TELEGRAM_BOT_TOKEN and TELEGRAM_ADMIN_CHAT_ID are set
Discord — when DISCORD_WEBHOOK_URL is set

​Health API

​Health check (web)

​Response

​Degraded and unhealthy responses

​Error response

​Backend health check

​Response

​Response when Render API is unavailable

​Get heartbeat settings

​Query parameters

​Response

​Errors

​Update heartbeat settings

​Request body

​Response

​Errors

​Delete heartbeat settings

​Request body

​Response

​Errors

​Container health checks

​Built-in health endpoints

​/healthz response

​/readyz response

​Container health statuses

​Health check behavior

​Watchdog monitoring

​Health check cycle

​Lifecycle states

​Auto-repair

​Crash loop detection

​Notifications

Health API

Health check (web)

Response

Degraded and unhealthy responses

Error response

Backend health check

Response

Response when Render API is unavailable

Get heartbeat settings

Query parameters

Response

Errors

Update heartbeat settings

Request body

Response

Errors

Delete heartbeat settings

Request body

Response

Errors

Container health checks

Built-in health endpoints

`/healthz` response

`/readyz` response

Container health statuses

Health check behavior

Watchdog monitoring

Health check cycle

Lifecycle states

Auto-repair

Crash loop detection

Notifications