Crash recovery¶
Anvil 1.0.0 is crash-safe by design: an API restart mid-benchmark cannot silently lose work or record a broken run as success.
What happens on restart¶
The FastAPI lifespan runs reconcile_on_startup() before the queue
worker takes its first job. This function:
-
Finds every run with
status IN (preflight, running)and marks themfailedwith:API restarted while this run was in progress; partial state is unrecoverable. Re-queue the run to try again.
-
Finds every run with
status = queuedand re-pushes them into the in-memoryasyncio.Queue. This catches runs that were committed to Postgres but never reachedqueue.put()because the API process died in between. -
Logs
anvil_reconciledwithrequeued_countandrequeued_idsso operators can see what happened.
Reconciliation is idempotent — running it twice produces the same result. Failures during reconciliation are logged and do not block startup.
Runner disconnect handling¶
Anvil's runner RPC enforces an explicit terminal-event contract. Every successful run ends with exactly one of:
run_completerun_failedrun_aborted
If the runner's unix-domain socket closes (EOF) or the backend's
read times out (3600 s default) before one of those events
arrives, RunnerClient.run_benchmark() raises
RunnerStreamTruncated. The orchestrator's worker catches the
exception and marks the run failed with a clear message.
This is the guard against a truncated fio that would otherwise fall
through to the smart_after / status=complete path and quietly
publish broken numbers.
Postgres outage mid-run¶
If the database becomes unreachable while a run is in progress:
- The in-flight fio keeps running inside the runner container
- Attempts to persist
phase_sample/smart_sampleevents fail with a logged error - If the error propagates out of
_execute_run, the worker's except-handler calls_safe_mark_failed(), which also handles DB errors and does not re-raise. The worker loop survives.
When the DB comes back, you'll have a failed run plus whatever
metrics landed before the outage. Re-queue to re-run.
Bootstrap admin¶
On every boot, Anvil ensures at least one active admin exists:
- If any
role=admin, is_active=trueuser is in the table → no-op - Else if a user named
adminexists (even if disabled or in a different role) → promote that row - Else → create a fresh
adminrow
The password is always the first 16 characters of
ANVIL_BEARER_TOKEN. Rotate it immediately after first login.
Force-repair paths¶
There's no CLI for manual repair yet; everything goes through the
API. If you need to force a stuck run into failed:
```bash
Manually via psql (Postgres admin only)¶
UPDATE runs SET status='failed', error_message='force-reset by operator', finished_at=NOW() WHERE id = '01K...' AND status IN ('preflight', 'running'); ```
After the update, call POST /api/runs/{id}/abort to broadcast a
run_aborted WebSocket event to any watching UI.