xtablo-source/.planning/phases/06-background-worker/06-VERIFICATION.md
Arthur Belleville eec691442a
docs(06): resolve SC-3/SC-4 conflict — accept D-03/D-06 deferrals
ROADMAP SC-3 (list-failed-jobs CLI) and SC-4 (web binary enqueue) struck
through as deferred per CONTEXT.md decisions D-03 and D-06. VERIFICATION.md
status updated to complete (3/3 active success criteria).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 17:21:25 +02:00

14 KiB

phase verified status score overrides_applied resolution gaps human_verification
06-background-worker 2026-05-15T17:00:00Z complete 3/3 active roadmap success criteria verified 0 SC-3 and SC-4 deferred per CONTEXT.md D-03 and D-06; ROADMAP updated 2026-05-15
truth status reason artifacts missing
A failing job is retried with backoff and visible via a simple CLI surface (backend list-failed-jobs or admin route) failed No list-failed-jobs CLI command exists. No admin route exists. Failed jobs log via SlogErrorHandler and persist in the river_job table, but the ROADMAP SC-3 explicitly names a CLI surface or admin route — neither is implemented. README documents the log pattern and river_job table as the visibility mechanism, which is not the same as the CLI/admin surface the success criterion requires.
path issue
backend/cmd/worker/main.go SlogErrorHandler is registered and logs job errors correctly, but there is no list-failed-jobs subcommand
path issue
backend/README.md Documents log output and river_job table as the observation mechanism — not a CLI surface
Either a `backend list-failed-jobs` CLI subcommand (e.g. go run ./cmd/worker list-failed-jobs) or an admin HTTP route that queries river_job WHERE state = 'discarded'
truth status reason artifacts missing
Web binary can enqueue a job; worker picks it up within a few seconds failed The web binary (cmd/web) has zero river dependency or integration. No HTTP handler, hook, or internal call in cmd/web or internal/web enqueues a job. The worker runs periodic jobs only. SC-4 requires the web binary to enqueue and the worker to pick it up — this wiring does not exist.
path issue
backend/cmd/web/main.go No river client, no InsertTx, no job insertion
path issue
backend/internal/web/ grep for 'river' returns zero results across all internal/web files
A river.Client (or river.Client insert-only usage) wired into cmd/web or a handler that enqueues at least one job type
At minimum: a river.Client constructed from the same pool in cmd/web, with one handler calling client.Insert()
test expected why_human
Run just worker and observe two heartbeat log lines ~60 seconds apart worker heartbeat log appears within seconds, second heartbeat ~60s later, graceful shutdown on Ctrl+C produces shutting down then shutdown complete Scheduler live-run behavior cannot be verified statically; Plan 03 checkpoint claim covers this but verifier cannot re-run the live environment

Phase 6: Background Worker Verification Report

Phase Goal: Implement a background worker binary using River for job queuing — heartbeat job, orphan file cleanup job, error handler, graceful shutdown, and developer-runnable via just worker. Verified: 2026-05-15T17:00:00Z Status: gaps_found Re-verification: No — initial verification

Goal Achievement

Observable Truths (Roadmap Success Criteria)

# Truth Status Evidence
SC-1 cmd/worker starts, connects to Postgres, registers handlers; structured logging and graceful shutdown work VERIFIED cmd/worker/main.go implements full startup sequence: db.NewPool, rivermigrate, signal.NotifyContext, river.Client with slog.Default(), StopAndCancel(10s); go build ./... exits 0; live verification in Plan 03 checkpoint
SC-2 At least one real job runs on a schedule and is observable in logs VERIFIED HeartbeatWorker (PeriodicInterval 1min, RunOnStart:true) and OrphanCleanupWorker (1hr) registered; heartbeat logs "worker heartbeat" with job_id and attempt; Plan 03 checkpoint confirms two heartbeat log lines ~60s apart in live environment
SC-3 A failing job is retried with backoff and visible via a simple CLI surface (backend list-failed-jobs or admin route) FAILED SlogErrorHandler logs errors correctly. No list-failed-jobs CLI command exists. No admin route exists. README documents the river_job table and log output as the observation mechanism, but the ROADMAP SC explicitly requires a CLI surface or admin route.
SC-4 Web binary can enqueue a job; worker picks it up within a few seconds FAILED cmd/web has zero river dependency. grep "river" backend/cmd/web/main.go returns nothing. grep -r "river" backend/internal/web/ returns nothing. Web-to-worker job enqueue wiring is entirely absent.
SC-5 README documents how to run the worker locally alongside the web binary VERIFIED README has "Running the Worker" section with just worker command, expected log sequence, single-worker constraint (D-08), and failed-job retry observation instructions

Score: 3/5 roadmap success criteria verified

Required Artifacts

Artifact Expected Status Details
backend/go.mod river v0.37.0 + riverpgxv5 VERIFIED Lines 42-46: github.com/riverqueue/river v0.37.0 // indirect, riverpgxv5 v0.37.0 // indirect present. Marked // indirect because no main package directly imports them yet — river deps show as indirect until go.sum is recomputed with direct imports from cmd/worker. Build succeeds regardless.
backend/internal/jobs/heartbeat.go HeartbeatArgs + HeartbeatWorker VERIFIED HeartbeatArgs with Kind()="heartbeat"; HeartbeatWorker embeds river.WorkerDefaults[HeartbeatArgs]; Work() logs slog.Info("worker heartbeat", "job_id", "attempt") and returns nil
backend/internal/jobs/orphan_cleanup.go OrphanCleanupArgs + OrphanCleanupWorker + constructor VERIFIED OrphanCleanupArgs Kind()="orphan_file_cleanup"; OrphanCleanupWorker with pool, store, fileQuerier; NewOrphanCleanupWorker wires sqlc.New(pool) as querier; Work() does S3-delete-before-DB-delete loop with error counting
backend/internal/jobs/error_handler.go SlogErrorHandler implementing river.ErrorHandler VERIFIED SlogErrorHandler with HandleError (logs job_id, job_kind, attempt, max_attempts, err) and HandlePanic (logs job_id, job_kind, panic, trace); both return nil
backend/internal/db/sqlc/files.sql.go ListOrphanFiles + ListOrphanFilesRow{ID, TabloID, S3Key} VERIFIED Generated: type ListOrphanFilesRow struct { ID uuid.UUID; TabloID uuid.UUID; S3Key string } and func (q *Queries) ListOrphanFiles(ctx context.Context) ([]ListOrphanFilesRow, error) at lines 132-140
backend/cmd/worker/main.go Full river wiring — rivermigrate + periodic jobs + graceful shutdown VERIFIED All required elements present: rivermigrate.New at startup, signal.NotifyContext AFTER startup I/O, river.AddWorker for both workers, river.NewClient with slog.Default() Logger, two PeriodicJobs (heartbeat RunOnStart:true, orphan cleanup no RunOnStart), StopAndCancel(10s), explicit pool.Close()
backend/justfile worker: target with MinIO dev defaults VERIFIED worker: db-up target at line 116 with DATABASE_URL, S3_ENDPOINT, S3_BUCKET, S3_REGION, S3_ACCESS_KEY, S3_SECRET_KEY, S3_USE_PATH_STYLE env vars on separate continuation lines; runs go run ./cmd/worker
backend/README.md Worker local dev instructions VERIFIED "Running the Worker" section with just worker, single-worker constraint, expected log sequence, failed job observation
From To Via Status Details
cmd/worker/main.go internal/jobs/ jobs.HeartbeatWorker, jobs.NewOrphanCleanupWorker, jobs.SlogErrorHandler WIRED All three imported and used at lines 93-94, 101
cmd/worker/main.go rivermigrate rivermigrate.New(riverpgxv5.New(pool), nil).Migrate(ctx, DirectionUp, nil) WIRED Lines 56-68; result ranges over res.Versions with slog.Info
cmd/worker/main.go river.Client river.NewClient(riverpgxv5.New(pool), &river.Config{...}) WIRED Line 97-116; all config fields present
orphan_cleanup.go backend/internal/db/sqlc/files.sql.go q.ListOrphanFiles(ctx) via fileQuerier interface WIRED Interface defined at lines 22-25; production wiring via sqlc.New(pool) in constructor
orphan_cleanup.go backend/internal/files/store.go w.store.Delete(ctx, f.S3Key) WIRED Line 63; FileStorer.Delete called before DB delete

Data-Flow Trace (Level 4)

Artifact Data Variable Source Produces Real Data Status
orphan_cleanup.go Work() orphans []ListOrphanFilesRow q.ListOrphanFiles(ctx) via sqlc query against tablo_files WHERE NOT EXISTS (SELECT 1 FROM tablos WHERE id = tf.tablo_id) Yes — live DB query with NOT EXISTS subquery FLOWING
heartbeat.go Work() job.ID, job.Attempt *river.Job[HeartbeatArgs] injected by river runtime Yes — populated by river from river_job table row FLOWING

Behavioral Spot-Checks

Behavior Command Result Status
All jobs unit tests pass cd backend && go test ./internal/jobs/... -v -count=1 7/7 PASS (TestSlogErrorHandler_HandleError, TestSlogErrorHandler_HandlePanic, TestHeartbeatWorker, TestHeartbeatArgs_Kind, TestOrphanCleanupWorker_NoOrphans, TestOrphanCleanupWorker_DeletesOrphan, TestOrphanCleanupArgs_Kind) PASS
Full module builds cd backend && go build ./... exit 0 PASS
worker target exists in justfile grep "^worker:" backend/justfile line 116: worker: db-up PASS
rivermigrate.New present grep "rivermigrate.New" backend/cmd/worker/main.go line 56 PASS
Web binary has no river wiring grep -r "river" backend/cmd/web/ backend/internal/web/ 0 results FAIL (SC-4 gap confirmed)
list-failed-jobs CLI absent grep -r "list.failed|list-failed" backend/ 0 results FAIL (SC-3 gap confirmed)

Probe Execution

Step 7c: SKIPPED — no scripts/*/tests/probe-*.sh files exist for this phase. The phase declared Plan 03 as a human-verify checkpoint (autonomous: false).

Requirements Coverage

Requirement Source Plan Description Status Evidence
WORK-01 06-01, 06-02 cmd/worker binary connects to Postgres and runs a job queue PARTIALLY SATISFIED Binary connects to Postgres and runs river periodic jobs. However SC-4 (web binary enqueues a job) is part of the WORK-01 observable proof and is absent.
WORK-02 06-01, 06-02 At least one real job runs end-to-end SATISFIED HeartbeatWorker and OrphanCleanupWorker both run on schedules; live verification confirmed in Plan 03 checkpoint
WORK-03 06-01, 06-02 Worker has structured logging and graceful shutdown matching web binary SATISFIED slog.Default() used as Logger; StopAndCancel(10s) + explicit pool.Close(); signal.NotifyContext for SIGINT/SIGTERM
WORK-04 06-01 Failed jobs retried with backoff and visible in a simple admin/CLI surface PARTIALLY SATISFIED River's built-in retry with exponential backoff (attempts^4) is active. SlogErrorHandler logs each failure. The river_job table persists discarded jobs. No CLI command or admin route provides the "visible in a simple admin/CLI surface" guarantee the requirement names.

Note on traceability discrepancy: REQUIREMENTS.md line 123 still shows WORK-01..04 | Phase 6 | Pending despite all four requirements being marked [x] in the requirements list (lines 58-61). The traceability table was not updated after completion.

Anti-Patterns Found

File Line Pattern Severity Impact
backend/go.mod 42-46 river deps marked // indirect Info Build succeeds; indirect marks will self-correct when go mod tidy is run after cmd/worker imports are confirmed. Not a blocker.

No TBD, FIXME, XXX, placeholder, or empty-return stubs found in any phase-modified files.

Human Verification Required

1. Live Worker Run

Test: Run cd backend && just worker against local Postgres + MinIO Expected: "worker ready" within 5s, "worker heartbeat" within 10s (RunOnStart:true), second "worker heartbeat" ~60s later, Ctrl+C produces "shutting down" then "shutdown complete" with clean exit Why human: Scheduler live-run behavior, actual Postgres + MinIO connectivity, and signal handling require a running process

Gaps Summary

Two roadmap success criteria are not met:

SC-3 (WORK-04 admin/CLI surface): The ROADMAP explicitly names "a simple CLI surface (backend list-failed-jobs or admin route)". The implementation provides SlogErrorHandler that logs errors to structured slog output and river retries automatically, but there is no list-failed-jobs subcommand in the worker binary and no admin HTTP route. Visibility of failed jobs requires querying the river_job table directly or reading logs — not a CLI surface. This gap is well-understood: SlogErrorHandler provides observability, but the SC contract asks for an addressable surface (CLI command or HTTP route).

SC-4 (web binary enqueues a job): The ROADMAP requires "Web binary can enqueue a job; worker picks it up within a few seconds." The web binary (cmd/web) has zero river dependency. This is a material gap — it means the two-binary architecture's job-dispatch path is unproven. All phase work focused on the worker side; the web-side enqueue path was never wired.

These two gaps share a root cause: the plans focused on WORK-01 through WORK-03 (worker binary, periodic jobs, graceful shutdown) and treated WORK-04 as satisfied by logging alone, while SC-3 and SC-4 both require active wiring that extends beyond the worker binary itself.


Verified: 2026-05-15T17:00:00Z Verifier: Claude (gsd-verifier)