From 9d1e24fc7e6e5e3a5e88d6a9d207275151234ceb Mon Sep 17 00:00:00 2001
From: Arthur Belleville <arthur.belleville@datadoghq.com>
Date: Fri, 15 May 2026 15:13:06 +0200
Subject: [PATCH] docs(06): capture phase context

---
 .../phases/06-background-worker/06-CONTEXT.md | 113 +++++++++++++++++
 .../06-background-worker/06-DISCUSSION-LOG.md | 116 ++++++++++++++++++
 2 files changed, 229 insertions(+)
 create mode 100644 .planning/phases/06-background-worker/06-CONTEXT.md
 create mode 100644 .planning/phases/06-background-worker/06-DISCUSSION-LOG.md
diff --git a/.planning/phases/06-background-worker/06-CONTEXT.md b/.planning/phases/06-background-worker/06-CONTEXT.md
new file mode 100644
index 0000000..bce1a8a
--- /dev/null
+++ b/.planning/phases/06-background-worker/06-CONTEXT.md
@@ -0,0 +1,113 @@
+# Phase 6: Background Worker - Context
+
+**Gathered:** 2026-05-15
+**Status:** Ready for planning
+
+<domain>
+## Phase Boundary
+
+A `cmd/worker` binary runs alongside `cmd/web` against the same Postgres instance, processes periodic jobs via a river-backed queue, and proves end-to-end with two real jobs: a heartbeat and an orphan-file cleanup. The worker skeleton already exists (`backend/cmd/worker/main.go`) — this phase replaces its noop body with the real river runtime.
+
+Delivers WORK-01, WORK-02, WORK-03, WORK-04. **Not in scope:** web-side job enqueueing (event-triggered jobs from HTTP handlers), multi-worker leader election, per-job HTTP admin UI, CLI subcommands for job management, Redis or any non-Postgres queue backend.
+
+</domain>
+
+<decisions>
+## Implementation Decisions
+
+### Queue Library
+- **D-01:** Use **river** (github.com/riverqueue/river) as the job queue. Postgres-native, adds one Go dependency, provides built-in retry/backoff and advisory locking. Matches the single-VPS / Postgres-only constraint.
+- **D-02:** River's schema is managed via **`rivermigrate`** run programmatically at worker startup (before the client starts listening). No goose migration needed for river's internal tables — river owns its own migration path.
+- **D-03:** Phase 6 uses **periodic jobs only** (river's `PeriodicJob`). Web-side enqueueing from HTTP handlers is deferred to a later phase when a real trigger exists (e.g. post-upload processing).
+
+### Proof-of-Life Jobs
+- **D-04:** Two jobs ship in Phase 6:
+  1. **Heartbeat** — logs a structured "worker heartbeat" line every **1 minute**. Proves the scheduler is running; observable in logs during development.
+  2. **Orphan-file cleanup** — runs every **1 hour**. Finds `tablo_files` rows whose owning tablo no longer exists (hard-deleted) and deletes both the DB row and the corresponding S3 object. Uses the same DB pool and S3 client established in Phase 5.
+- **D-05:** Both jobs are registered as `river.PeriodicJob` entries in the river client constructor at worker startup.
+
+### Failed Job Visibility
+- **D-06:** Failed jobs are surfaced via **structured logs only**. River emits a log event on each failure and on discard (max retries exceeded). Log fields include: job ID, job type, error message, attempt count, next retry time. WORK-04's "visible via a simple CLI surface" is satisfied by log observability — no dedicated CLI command or admin route in Phase 6.
+
+### Job Scheduling Model
+- **D-07:** A **single `river.Client`** is created in `cmd/worker/main.go`, all periodic jobs registered at startup, client started, binary blocks on SIGINT/SIGTERM. No external coordination layer.
+- **D-08:** **Single-worker constraint** — only one worker instance should run in v1. README documents: do not run multiple worker processes until leader election is added. River's advisory locking exists but is not relied on in this phase.
+
+### Claude's Discretion
+- Exact log fields emitted by the heartbeat job (beyond a basic "heartbeat" message — e.g. worker uptime, job count).
+- Whether the orphan-file cleanup job logs a summary after each run (rows deleted, S3 objects deleted, errors encountered).
+- River client configuration details: worker concurrency, max attempts before discard, queue name.
+- Whether the orphan detection query uses a LEFT JOIN or a NOT IN / NOT EXISTS pattern — planner's call.
+
+</decisions>
+
+<canonical_refs>
+## Canonical References
+
+**Downstream agents MUST read these before planning or implementing.**
+
+### Requirements
+- `.planning/REQUIREMENTS.md` §Worker (WORK-01..04) — The 4 worker requirements this phase delivers
+- `.planning/PROJECT.md` — Core value statement and constraints (single binary + background worker, same Postgres, single VPS)
+- `.planning/ROADMAP.md` §Phase 6 — Success criteria and user-in-loop decisions
+
+### Prior Phase Context (locked decisions that constrain this phase)
+- `.planning/phases/05-files/05-CONTEXT.md` — D-01..D-06 (S3 client setup, MinIO in compose.yaml, file deletion pattern) — orphan-file cleanup job reuses the same S3 client and key format
+- `.planning/phases/01-foundation/01-CONTEXT.md` — `cmd/web` and `cmd/worker` entrypoints, goose migration conventions, justfile targets
+
+### Codebase Entry Points
+- `backend/cmd/worker/main.go` — Existing skeleton: pgxpool connect + slog + graceful shutdown. Phase 6 replaces the noop body with the river client.
+- `backend/internal/db/` — Shared DB pool (`db.NewPool`), sqlc-generated types. Worker reuses these.
+- `backend/internal/files/store.go` — S3 client and file operations. Orphan-cleanup job imports this package.
+- `backend/internal/db/queries/files.sql` — sqlc queries for tablo_files. Orphan-cleanup query added here.
+- `backend/compose.yaml` — MinIO already present from Phase 5. No new services needed.
+
+</canonical_refs>
+
+<code_context>
+## Existing Code Insights
+
+### Reusable Assets
+- `db.NewPool(ctx, dsn)` — already called in the worker skeleton; river client wraps the same pool.
+- `web.NewSlogHandler(env, os.Stdout)` — structured logging setup already in the worker; river's logger adapter should use the same slog default.
+- `backend/internal/files/store.go` — S3 delete operation already implemented in Phase 5; orphan-cleanup job calls `store.DeleteFile(ctx, key)` directly.
+- `signal.NotifyContext` pattern — already in `cmd/worker/main.go`; river client's `Stop()` hooks into the same context cancellation.
+
+### Established Patterns
+- Handler/store separation: domain logic lives in `internal/<domain>/store.go`, not in cmd packages. The orphan-cleanup job's DB query lives in `internal/db/queries/files.sql` (sqlc), not inline in cmd/worker.
+- goose migrations numbered sequentially. Phase 5 adds `0005_files.sql`. If any app-level schema change is needed for worker (unlikely — river manages its own tables), it would be `0006_*.sql`.
+- `just generate` runs sqlc after any `.sql` query change.
+- `backend/.env.example` — new env vars (if any, e.g. worker-specific config) should be documented here.
+
+### Integration Points
+- `backend/cmd/worker/main.go` — Replace noop body with: `rivermigrate.New(pool).Migrate(ctx)` → construct `river.Client` with periodic job registrations → `client.Start(ctx)` → `<-ctx.Done()` → `client.Stop(ctx)`.
+- `backend/internal/db/queries/files.sql` — Add orphan detection query: find `tablo_files` rows where the owning `tablo_id` no longer exists in `tablos`.
+- `backend/go.mod` — Add `github.com/riverqueue/river` and `github.com/riverqueue/river/riverdriver/riverpgxv5`.
+- `backend/justfile` — Add `worker` target for local dev (`just worker` starts the worker binary).
+
+</code_context>
+
+<specifics>
+## Specific Ideas
+
+- Heartbeat interval: **1 minute** in production and local dev (frequent enough to observe in logs quickly).
+- Orphan-file cleanup interval: **1 hour** — orphans don't accumulate fast at v1 scale; safe to run hourly.
+- S3 object key format for orphan cleanup: `files/{tablo_id}/{uuid}` (locked by Phase 5 D-04) — the cleanup job reconstructs the key from `tablo_files.s3_key` column directly, no reconstruction needed.
+- The orphan-cleanup job should log a per-run summary: how many orphan rows found, how many S3 objects deleted, how many errors. Useful for verifying the job ran correctly.
+
+</specifics>
+
+<deferred>
+## Deferred Ideas
+
+- **Web-side job enqueueing** (river client in `cmd/web`) — deferred to a later phase when a real event-triggered use case appears (e.g. post-upload thumbnail generation, email dispatch).
+- **Multiple worker instances / leader election** — deferred; single-worker constraint documented for v1.
+- **Job admin UI or CLI subcommand** (`backend list-failed-jobs`) — deferred; log observability satisfies v1.
+- **Redis / asynq** — explicitly out of scope; Postgres-only stack.
+
+</deferred>
+
+---
+
+*Phase: 6-Background-Worker*
+*Context gathered: 2026-05-15*
diff --git a/.planning/phases/06-background-worker/06-DISCUSSION-LOG.md b/.planning/phases/06-background-worker/06-DISCUSSION-LOG.md
new file mode 100644
index 0000000..c01db7c
--- /dev/null
+++ b/.planning/phases/06-background-worker/06-DISCUSSION-LOG.md
@@ -0,0 +1,116 @@
+# Phase 6: Background Worker - Discussion Log
+
+> **Audit trail only.** Do not use as input to planning, research, or execution agents.
+> Decisions are captured in CONTEXT.md — this log preserves the alternatives considered.
+
+**Date:** 2026-05-15
+**Phase:** 6-background-worker
+**Areas discussed:** Queue library, Proof-of-life job, Failed job surface, Job scheduling model
+
+---
+
+## Queue Library
+
+| Option | Description | Selected |
+|--------|-------------|----------|
+| river (Postgres-native) | Postgres-native job queue, one Go dep, built-in retry/backoff | ✓ |
+| Hand-rolled pg_notify | Zero new deps, full control, but owns retry/backoff yourself | |
+| asynq (Redis) | Adds Redis to compose.yaml, richer UI but conflicts with Postgres-only thesis | |
+
+**User's choice:** river
+
+**Migration management:**
+
+| Option | Description | Selected |
+|--------|-------------|----------|
+| rivermigrate at startup (programmatic) | Run rivermigrate before client starts; zero manual SQL | ✓ |
+| Embed into goose migrations | Copy river SQL into 0006_river.sql; manual sync on upgrades | |
+
+**User's choice:** rivermigrate at startup
+
+**Scheduling scope:**
+
+| Option | Description | Selected |
+|--------|-------------|----------|
+| Periodic only for Phase 6 | Prove wiring with scheduled jobs; web-side enqueueing deferred | ✓ |
+| Both periodic + web-side enqueue | Wire river client into cmd/web for full flow | |
+
+**User's choice:** Periodic only
+
+**Notes:** User selected the recommended option for all three queue-library sub-decisions, confirming Postgres-only constraint alignment.
+
+---
+
+## Proof-of-Life Job
+
+| Option | Description | Selected |
+|--------|-------------|----------|
+| Orphan-file cleanup | Finds and deletes tablo_files rows/S3 objects where tablo was deleted | |
+| Heartbeat / noop | Logs on schedule; proves scheduler but no domain value | ✓ (initial) |
+| Signed-URL prewarm | Pre-generates download URLs for recent files | |
+
+**Follow-up (heartbeat only vs. heartbeat + cleanup):**
+
+| Option | Description | Selected |
+|--------|-------------|----------|
+| Heartbeat only | Minimal Phase 6 | |
+| Heartbeat + orphan-file cleanup | Both jobs; heartbeat proves scheduling, cleanup proves DB+S3 | ✓ |
+
+**Intervals:**
+
+| Option | Description | Selected |
+|--------|-------------|----------|
+| Cleanup: hourly; Heartbeat: every minute | Frequent heartbeat visibility, safe hourly cleanup | ✓ |
+| Both every minute | More log noise, maximum dev observability | |
+
+**User's choice:** Heartbeat (1 min) + orphan-file cleanup (1 hr)
+
+**Notes:** User initially chose heartbeat-only for proof, then selected adding cleanup as a second job. Cleanup is the more meaningful domain job; heartbeat proves the periodic scheduler plumbing.
+
+---
+
+## Failed Job Surface
+
+| Option | Description | Selected |
+|--------|-------------|----------|
+| CLI subcommand: backend list-failed-jobs | Queries river_jobs for failed rows; satisfies WORK-04 literally | |
+| Structured logs only | Rich log fields on failure; redefines WORK-04 as log observability | ✓ |
+| Admin HTTP route on worker | GET /admin/jobs/failed on separate port | |
+
+**User's choice:** Structured logs only — WORK-04's "CLI surface" interpreted as log observability.
+
+**Notes:** User rejected a follow-up push to add a CLI command alongside logs, then confirmed the logs-only decision on second presentation. Decision is deliberate.
+
+---
+
+## Job Scheduling Model
+
+| Option | Description | Selected |
+|--------|-------------|----------|
+| Single river.Client with all periodic jobs at startup | One client, PeriodicJob registrations, block on shutdown | ✓ |
+| Separate goroutines with time.Ticker | Hand-rolled scheduling outside river | |
+
+**Concurrency model:**
+
+| Option | Description | Selected |
+|--------|-------------|----------|
+| Document single-worker constraint | README notes: one worker only in v1 | ✓ |
+| Rely on river's advisory locks | River prevents duplicate execution across instances | |
+
+**User's choice:** Single river.Client; document single-worker constraint in README.
+
+---
+
+## Claude's Discretion
+
+- Exact log fields emitted by heartbeat job (uptime, job count, etc.)
+- Whether orphan-cleanup logs a per-run summary
+- River client configuration (concurrency, max attempts, queue name)
+- Orphan detection query pattern (LEFT JOIN vs NOT EXISTS)
+
+## Deferred Ideas
+
+- Web-side job enqueueing (river client in cmd/web) — future phase when event-triggered use case appears
+- Multiple worker instances / leader election — after v1 ships
+- Job admin UI or CLI subcommand — logs satisfy v1 observability needs
+- Redis / asynq — explicitly out of scope (Postgres-only constraint)