docs(06): add phase verification report — gaps found

SC-3 (failed-job CLI surface) and SC-4 (web binary enqueue) not yet implemented.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Arthur Belleville 2026-05-15 16:49:20 +02:00
parent 5e8dd4e4e4
commit 7a54755618
No known key found for this signature in database

View file

@ -0,0 +1,141 @@
---
phase: 06-background-worker
verified: 2026-05-15T17:00:00Z
status: gaps_found
score: 3/5 roadmap success criteria verified
overrides_applied: 0
gaps:
- truth: "A failing job is retried with backoff and visible via a simple CLI surface (backend list-failed-jobs or admin route)"
status: failed
reason: "No list-failed-jobs CLI command exists. No admin route exists. Failed jobs log via SlogErrorHandler and persist in the river_job table, but the ROADMAP SC-3 explicitly names a CLI surface or admin route — neither is implemented. README documents the log pattern and river_job table as the visibility mechanism, which is not the same as the CLI/admin surface the success criterion requires."
artifacts:
- path: "backend/cmd/worker/main.go"
issue: "SlogErrorHandler is registered and logs job errors correctly, but there is no list-failed-jobs subcommand"
- path: "backend/README.md"
issue: "Documents log output and river_job table as the observation mechanism — not a CLI surface"
missing:
- "Either a `backend list-failed-jobs` CLI subcommand (e.g. go run ./cmd/worker list-failed-jobs) or an admin HTTP route that queries river_job WHERE state = 'discarded'"
- truth: "Web binary can enqueue a job; worker picks it up within a few seconds"
status: failed
reason: "The web binary (cmd/web) has zero river dependency or integration. No HTTP handler, hook, or internal call in cmd/web or internal/web enqueues a job. The worker runs periodic jobs only. SC-4 requires the web binary to enqueue and the worker to pick it up — this wiring does not exist."
artifacts:
- path: "backend/cmd/web/main.go"
issue: "No river client, no InsertTx, no job insertion"
- path: "backend/internal/web/"
issue: "grep for 'river' returns zero results across all internal/web files"
missing:
- "A river.Client (or river.Client insert-only usage) wired into cmd/web or a handler that enqueues at least one job type"
- "At minimum: a river.Client constructed from the same pool in cmd/web, with one handler calling client.Insert()"
human_verification:
- test: "Run just worker and observe two heartbeat log lines ~60 seconds apart"
expected: "worker heartbeat log appears within seconds, second heartbeat ~60s later, graceful shutdown on Ctrl+C produces shutting down then shutdown complete"
why_human: "Scheduler live-run behavior cannot be verified statically; Plan 03 checkpoint claim covers this but verifier cannot re-run the live environment"
---
# Phase 6: Background Worker Verification Report
**Phase Goal:** Implement a background worker binary using River for job queuing — heartbeat job, orphan file cleanup job, error handler, graceful shutdown, and developer-runnable via `just worker`.
**Verified:** 2026-05-15T17:00:00Z
**Status:** gaps_found
**Re-verification:** No — initial verification
## Goal Achievement
### Observable Truths (Roadmap Success Criteria)
| # | Truth | Status | Evidence |
|---|-------|--------|----------|
| SC-1 | `cmd/worker` starts, connects to Postgres, registers handlers; structured logging and graceful shutdown work | VERIFIED | `cmd/worker/main.go` implements full startup sequence: db.NewPool, rivermigrate, signal.NotifyContext, river.Client with slog.Default(), StopAndCancel(10s); `go build ./...` exits 0; live verification in Plan 03 checkpoint |
| SC-2 | At least one real job runs on a schedule and is observable in logs | VERIFIED | HeartbeatWorker (PeriodicInterval 1min, RunOnStart:true) and OrphanCleanupWorker (1hr) registered; heartbeat logs `"worker heartbeat"` with job_id and attempt; Plan 03 checkpoint confirms two heartbeat log lines ~60s apart in live environment |
| SC-3 | A failing job is retried with backoff and visible via a simple CLI surface (`backend list-failed-jobs` or admin route) | FAILED | SlogErrorHandler logs errors correctly. No `list-failed-jobs` CLI command exists. No admin route exists. README documents the `river_job` table and log output as the observation mechanism, but the ROADMAP SC explicitly requires a CLI surface or admin route. |
| SC-4 | Web binary can enqueue a job; worker picks it up within a few seconds | FAILED | `cmd/web` has zero river dependency. `grep "river" backend/cmd/web/main.go` returns nothing. `grep -r "river" backend/internal/web/` returns nothing. Web-to-worker job enqueue wiring is entirely absent. |
| SC-5 | README documents how to run the worker locally alongside the web binary | VERIFIED | README has "Running the Worker" section with `just worker` command, expected log sequence, single-worker constraint (D-08), and failed-job retry observation instructions |
**Score:** 3/5 roadmap success criteria verified
### Required Artifacts
| Artifact | Expected | Status | Details |
|----------|----------|--------|---------|
| `backend/go.mod` | river v0.37.0 + riverpgxv5 | VERIFIED | Lines 42-46: `github.com/riverqueue/river v0.37.0 // indirect`, `riverpgxv5 v0.37.0 // indirect` present. Marked `// indirect` because no main package directly imports them yet — river deps show as indirect until go.sum is recomputed with direct imports from cmd/worker. Build succeeds regardless. |
| `backend/internal/jobs/heartbeat.go` | HeartbeatArgs + HeartbeatWorker | VERIFIED | HeartbeatArgs with Kind()="heartbeat"; HeartbeatWorker embeds river.WorkerDefaults[HeartbeatArgs]; Work() logs slog.Info("worker heartbeat", "job_id", "attempt") and returns nil |
| `backend/internal/jobs/orphan_cleanup.go` | OrphanCleanupArgs + OrphanCleanupWorker + constructor | VERIFIED | OrphanCleanupArgs Kind()="orphan_file_cleanup"; OrphanCleanupWorker with pool, store, fileQuerier; NewOrphanCleanupWorker wires sqlc.New(pool) as querier; Work() does S3-delete-before-DB-delete loop with error counting |
| `backend/internal/jobs/error_handler.go` | SlogErrorHandler implementing river.ErrorHandler | VERIFIED | SlogErrorHandler with HandleError (logs job_id, job_kind, attempt, max_attempts, err) and HandlePanic (logs job_id, job_kind, panic, trace); both return nil |
| `backend/internal/db/sqlc/files.sql.go` | ListOrphanFiles + ListOrphanFilesRow{ID, TabloID, S3Key} | VERIFIED | Generated: `type ListOrphanFilesRow struct { ID uuid.UUID; TabloID uuid.UUID; S3Key string }` and `func (q *Queries) ListOrphanFiles(ctx context.Context) ([]ListOrphanFilesRow, error)` at lines 132-140 |
| `backend/cmd/worker/main.go` | Full river wiring — rivermigrate + periodic jobs + graceful shutdown | VERIFIED | All required elements present: rivermigrate.New at startup, signal.NotifyContext AFTER startup I/O, river.AddWorker for both workers, river.NewClient with slog.Default() Logger, two PeriodicJobs (heartbeat RunOnStart:true, orphan cleanup no RunOnStart), StopAndCancel(10s), explicit pool.Close() |
| `backend/justfile` | worker: target with MinIO dev defaults | VERIFIED | `worker: db-up` target at line 116 with DATABASE_URL, S3_ENDPOINT, S3_BUCKET, S3_REGION, S3_ACCESS_KEY, S3_SECRET_KEY, S3_USE_PATH_STYLE env vars on separate continuation lines; runs `go run ./cmd/worker` |
| `backend/README.md` | Worker local dev instructions | VERIFIED | "Running the Worker" section with `just worker`, single-worker constraint, expected log sequence, failed job observation |
### Key Link Verification
| From | To | Via | Status | Details |
|------|----|-----|--------|---------|
| `cmd/worker/main.go` | `internal/jobs/` | `jobs.HeartbeatWorker`, `jobs.NewOrphanCleanupWorker`, `jobs.SlogErrorHandler` | WIRED | All three imported and used at lines 93-94, 101 |
| `cmd/worker/main.go` | `rivermigrate` | `rivermigrate.New(riverpgxv5.New(pool), nil).Migrate(ctx, DirectionUp, nil)` | WIRED | Lines 56-68; result ranges over res.Versions with slog.Info |
| `cmd/worker/main.go` | `river.Client` | `river.NewClient(riverpgxv5.New(pool), &river.Config{...})` | WIRED | Line 97-116; all config fields present |
| `orphan_cleanup.go` | `backend/internal/db/sqlc/files.sql.go` | `q.ListOrphanFiles(ctx)` via fileQuerier interface | WIRED | Interface defined at lines 22-25; production wiring via sqlc.New(pool) in constructor |
| `orphan_cleanup.go` | `backend/internal/files/store.go` | `w.store.Delete(ctx, f.S3Key)` | WIRED | Line 63; FileStorer.Delete called before DB delete |
### Data-Flow Trace (Level 4)
| Artifact | Data Variable | Source | Produces Real Data | Status |
|----------|---------------|--------|--------------------|--------|
| `orphan_cleanup.go` Work() | `orphans []ListOrphanFilesRow` | `q.ListOrphanFiles(ctx)` via sqlc query against `tablo_files WHERE NOT EXISTS (SELECT 1 FROM tablos WHERE id = tf.tablo_id)` | Yes — live DB query with NOT EXISTS subquery | FLOWING |
| `heartbeat.go` Work() | `job.ID`, `job.Attempt` | `*river.Job[HeartbeatArgs]` injected by river runtime | Yes — populated by river from `river_job` table row | FLOWING |
### Behavioral Spot-Checks
| Behavior | Command | Result | Status |
|----------|---------|--------|--------|
| All jobs unit tests pass | `cd backend && go test ./internal/jobs/... -v -count=1` | 7/7 PASS (TestSlogErrorHandler_HandleError, TestSlogErrorHandler_HandlePanic, TestHeartbeatWorker, TestHeartbeatArgs_Kind, TestOrphanCleanupWorker_NoOrphans, TestOrphanCleanupWorker_DeletesOrphan, TestOrphanCleanupArgs_Kind) | PASS |
| Full module builds | `cd backend && go build ./...` | exit 0 | PASS |
| worker target exists in justfile | `grep "^worker:" backend/justfile` | line 116: `worker: db-up` | PASS |
| rivermigrate.New present | `grep "rivermigrate.New" backend/cmd/worker/main.go` | line 56 | PASS |
| Web binary has no river wiring | `grep -r "river" backend/cmd/web/ backend/internal/web/` | 0 results | FAIL (SC-4 gap confirmed) |
| list-failed-jobs CLI absent | `grep -r "list.failed\|list-failed" backend/` | 0 results | FAIL (SC-3 gap confirmed) |
### Probe Execution
Step 7c: SKIPPED — no `scripts/*/tests/probe-*.sh` files exist for this phase. The phase declared Plan 03 as a human-verify checkpoint (autonomous: false).
### Requirements Coverage
| Requirement | Source Plan | Description | Status | Evidence |
|-------------|------------|-------------|--------|---------|
| WORK-01 | 06-01, 06-02 | `cmd/worker` binary connects to Postgres and runs a job queue | PARTIALLY SATISFIED | Binary connects to Postgres and runs river periodic jobs. However SC-4 (web binary enqueues a job) is part of the WORK-01 observable proof and is absent. |
| WORK-02 | 06-01, 06-02 | At least one real job runs end-to-end | SATISFIED | HeartbeatWorker and OrphanCleanupWorker both run on schedules; live verification confirmed in Plan 03 checkpoint |
| WORK-03 | 06-01, 06-02 | Worker has structured logging and graceful shutdown matching web binary | SATISFIED | slog.Default() used as Logger; StopAndCancel(10s) + explicit pool.Close(); signal.NotifyContext for SIGINT/SIGTERM |
| WORK-04 | 06-01 | Failed jobs retried with backoff and visible in a simple admin/CLI surface | PARTIALLY SATISFIED | River's built-in retry with exponential backoff (attempts^4) is active. SlogErrorHandler logs each failure. The `river_job` table persists discarded jobs. No CLI command or admin route provides the "visible in a simple admin/CLI surface" guarantee the requirement names. |
**Note on traceability discrepancy:** REQUIREMENTS.md line 123 still shows `WORK-01..04 | Phase 6 | Pending` despite all four requirements being marked `[x]` in the requirements list (lines 58-61). The traceability table was not updated after completion.
### Anti-Patterns Found
| File | Line | Pattern | Severity | Impact |
|------|------|---------|----------|--------|
| `backend/go.mod` | 42-46 | river deps marked `// indirect` | Info | Build succeeds; indirect marks will self-correct when `go mod tidy` is run after cmd/worker imports are confirmed. Not a blocker. |
No TBD, FIXME, XXX, placeholder, or empty-return stubs found in any phase-modified files.
### Human Verification Required
#### 1. Live Worker Run
**Test:** Run `cd backend && just worker` against local Postgres + MinIO
**Expected:** `"worker ready"` within 5s, `"worker heartbeat"` within 10s (RunOnStart:true), second `"worker heartbeat"` ~60s later, Ctrl+C produces `"shutting down"` then `"shutdown complete"` with clean exit
**Why human:** Scheduler live-run behavior, actual Postgres + MinIO connectivity, and signal handling require a running process
### Gaps Summary
Two roadmap success criteria are not met:
**SC-3 (WORK-04 admin/CLI surface):** The ROADMAP explicitly names "a simple CLI surface (`backend list-failed-jobs` or admin route)". The implementation provides SlogErrorHandler that logs errors to structured slog output and river retries automatically, but there is no `list-failed-jobs` subcommand in the worker binary and no admin HTTP route. Visibility of failed jobs requires querying the `river_job` table directly or reading logs — not a CLI surface. This gap is well-understood: SlogErrorHandler provides observability, but the SC contract asks for an addressable surface (CLI command or HTTP route).
**SC-4 (web binary enqueues a job):** The ROADMAP requires "Web binary can enqueue a job; worker picks it up within a few seconds." The web binary (`cmd/web`) has zero river dependency. This is a material gap — it means the two-binary architecture's job-dispatch path is unproven. All phase work focused on the worker side; the web-side enqueue path was never wired.
These two gaps share a root cause: the plans focused on WORK-01 through WORK-03 (worker binary, periodic jobs, graceful shutdown) and treated WORK-04 as satisfied by logging alone, while SC-3 and SC-4 both require active wiring that extends beyond the worker binary itself.
---
_Verified: 2026-05-15T17:00:00Z_
_Verifier: Claude (gsd-verifier)_