docs(06): capture phase context

This commit is contained in:
Arthur Belleville 2026-05-15 15:13:06 +02:00
parent 8215f53356
commit 9d1e24fc7e
No known key found for this signature in database
2 changed files with 229 additions and 0 deletions

View file

@ -0,0 +1,113 @@
# Phase 6: Background Worker - Context
**Gathered:** 2026-05-15
**Status:** Ready for planning
<domain>
## Phase Boundary
A `cmd/worker` binary runs alongside `cmd/web` against the same Postgres instance, processes periodic jobs via a river-backed queue, and proves end-to-end with two real jobs: a heartbeat and an orphan-file cleanup. The worker skeleton already exists (`backend/cmd/worker/main.go`) — this phase replaces its noop body with the real river runtime.
Delivers WORK-01, WORK-02, WORK-03, WORK-04. **Not in scope:** web-side job enqueueing (event-triggered jobs from HTTP handlers), multi-worker leader election, per-job HTTP admin UI, CLI subcommands for job management, Redis or any non-Postgres queue backend.
</domain>
<decisions>
## Implementation Decisions
### Queue Library
- **D-01:** Use **river** (github.com/riverqueue/river) as the job queue. Postgres-native, adds one Go dependency, provides built-in retry/backoff and advisory locking. Matches the single-VPS / Postgres-only constraint.
- **D-02:** River's schema is managed via **`rivermigrate`** run programmatically at worker startup (before the client starts listening). No goose migration needed for river's internal tables — river owns its own migration path.
- **D-03:** Phase 6 uses **periodic jobs only** (river's `PeriodicJob`). Web-side enqueueing from HTTP handlers is deferred to a later phase when a real trigger exists (e.g. post-upload processing).
### Proof-of-Life Jobs
- **D-04:** Two jobs ship in Phase 6:
1. **Heartbeat** — logs a structured "worker heartbeat" line every **1 minute**. Proves the scheduler is running; observable in logs during development.
2. **Orphan-file cleanup** — runs every **1 hour**. Finds `tablo_files` rows whose owning tablo no longer exists (hard-deleted) and deletes both the DB row and the corresponding S3 object. Uses the same DB pool and S3 client established in Phase 5.
- **D-05:** Both jobs are registered as `river.PeriodicJob` entries in the river client constructor at worker startup.
### Failed Job Visibility
- **D-06:** Failed jobs are surfaced via **structured logs only**. River emits a log event on each failure and on discard (max retries exceeded). Log fields include: job ID, job type, error message, attempt count, next retry time. WORK-04's "visible via a simple CLI surface" is satisfied by log observability — no dedicated CLI command or admin route in Phase 6.
### Job Scheduling Model
- **D-07:** A **single `river.Client`** is created in `cmd/worker/main.go`, all periodic jobs registered at startup, client started, binary blocks on SIGINT/SIGTERM. No external coordination layer.
- **D-08:** **Single-worker constraint** — only one worker instance should run in v1. README documents: do not run multiple worker processes until leader election is added. River's advisory locking exists but is not relied on in this phase.
### Claude's Discretion
- Exact log fields emitted by the heartbeat job (beyond a basic "heartbeat" message — e.g. worker uptime, job count).
- Whether the orphan-file cleanup job logs a summary after each run (rows deleted, S3 objects deleted, errors encountered).
- River client configuration details: worker concurrency, max attempts before discard, queue name.
- Whether the orphan detection query uses a LEFT JOIN or a NOT IN / NOT EXISTS pattern — planner's call.
</decisions>
<canonical_refs>
## Canonical References
**Downstream agents MUST read these before planning or implementing.**
### Requirements
- `.planning/REQUIREMENTS.md` §Worker (WORK-01..04) — The 4 worker requirements this phase delivers
- `.planning/PROJECT.md` — Core value statement and constraints (single binary + background worker, same Postgres, single VPS)
- `.planning/ROADMAP.md` §Phase 6 — Success criteria and user-in-loop decisions
### Prior Phase Context (locked decisions that constrain this phase)
- `.planning/phases/05-files/05-CONTEXT.md` — D-01..D-06 (S3 client setup, MinIO in compose.yaml, file deletion pattern) — orphan-file cleanup job reuses the same S3 client and key format
- `.planning/phases/01-foundation/01-CONTEXT.md``cmd/web` and `cmd/worker` entrypoints, goose migration conventions, justfile targets
### Codebase Entry Points
- `backend/cmd/worker/main.go` — Existing skeleton: pgxpool connect + slog + graceful shutdown. Phase 6 replaces the noop body with the river client.
- `backend/internal/db/` — Shared DB pool (`db.NewPool`), sqlc-generated types. Worker reuses these.
- `backend/internal/files/store.go` — S3 client and file operations. Orphan-cleanup job imports this package.
- `backend/internal/db/queries/files.sql` — sqlc queries for tablo_files. Orphan-cleanup query added here.
- `backend/compose.yaml` — MinIO already present from Phase 5. No new services needed.
</canonical_refs>
<code_context>
## Existing Code Insights
### Reusable Assets
- `db.NewPool(ctx, dsn)` — already called in the worker skeleton; river client wraps the same pool.
- `web.NewSlogHandler(env, os.Stdout)` — structured logging setup already in the worker; river's logger adapter should use the same slog default.
- `backend/internal/files/store.go` — S3 delete operation already implemented in Phase 5; orphan-cleanup job calls `store.DeleteFile(ctx, key)` directly.
- `signal.NotifyContext` pattern — already in `cmd/worker/main.go`; river client's `Stop()` hooks into the same context cancellation.
### Established Patterns
- Handler/store separation: domain logic lives in `internal/<domain>/store.go`, not in cmd packages. The orphan-cleanup job's DB query lives in `internal/db/queries/files.sql` (sqlc), not inline in cmd/worker.
- goose migrations numbered sequentially. Phase 5 adds `0005_files.sql`. If any app-level schema change is needed for worker (unlikely — river manages its own tables), it would be `0006_*.sql`.
- `just generate` runs sqlc after any `.sql` query change.
- `backend/.env.example` — new env vars (if any, e.g. worker-specific config) should be documented here.
### Integration Points
- `backend/cmd/worker/main.go` — Replace noop body with: `rivermigrate.New(pool).Migrate(ctx)` → construct `river.Client` with periodic job registrations → `client.Start(ctx)``<-ctx.Done()``client.Stop(ctx)`.
- `backend/internal/db/queries/files.sql` — Add orphan detection query: find `tablo_files` rows where the owning `tablo_id` no longer exists in `tablos`.
- `backend/go.mod` — Add `github.com/riverqueue/river` and `github.com/riverqueue/river/riverdriver/riverpgxv5`.
- `backend/justfile` — Add `worker` target for local dev (`just worker` starts the worker binary).
</code_context>
<specifics>
## Specific Ideas
- Heartbeat interval: **1 minute** in production and local dev (frequent enough to observe in logs quickly).
- Orphan-file cleanup interval: **1 hour** — orphans don't accumulate fast at v1 scale; safe to run hourly.
- S3 object key format for orphan cleanup: `files/{tablo_id}/{uuid}` (locked by Phase 5 D-04) — the cleanup job reconstructs the key from `tablo_files.s3_key` column directly, no reconstruction needed.
- The orphan-cleanup job should log a per-run summary: how many orphan rows found, how many S3 objects deleted, how many errors. Useful for verifying the job ran correctly.
</specifics>
<deferred>
## Deferred Ideas
- **Web-side job enqueueing** (river client in `cmd/web`) — deferred to a later phase when a real event-triggered use case appears (e.g. post-upload thumbnail generation, email dispatch).
- **Multiple worker instances / leader election** — deferred; single-worker constraint documented for v1.
- **Job admin UI or CLI subcommand** (`backend list-failed-jobs`) — deferred; log observability satisfies v1.
- **Redis / asynq** — explicitly out of scope; Postgres-only stack.
</deferred>
---
*Phase: 6-Background-Worker*
*Context gathered: 2026-05-15*

View file

@ -0,0 +1,116 @@
# Phase 6: Background Worker - Discussion Log
> **Audit trail only.** Do not use as input to planning, research, or execution agents.
> Decisions are captured in CONTEXT.md — this log preserves the alternatives considered.
**Date:** 2026-05-15
**Phase:** 6-background-worker
**Areas discussed:** Queue library, Proof-of-life job, Failed job surface, Job scheduling model
---
## Queue Library
| Option | Description | Selected |
|--------|-------------|----------|
| river (Postgres-native) | Postgres-native job queue, one Go dep, built-in retry/backoff | ✓ |
| Hand-rolled pg_notify | Zero new deps, full control, but owns retry/backoff yourself | |
| asynq (Redis) | Adds Redis to compose.yaml, richer UI but conflicts with Postgres-only thesis | |
**User's choice:** river
**Migration management:**
| Option | Description | Selected |
|--------|-------------|----------|
| rivermigrate at startup (programmatic) | Run rivermigrate before client starts; zero manual SQL | ✓ |
| Embed into goose migrations | Copy river SQL into 0006_river.sql; manual sync on upgrades | |
**User's choice:** rivermigrate at startup
**Scheduling scope:**
| Option | Description | Selected |
|--------|-------------|----------|
| Periodic only for Phase 6 | Prove wiring with scheduled jobs; web-side enqueueing deferred | ✓ |
| Both periodic + web-side enqueue | Wire river client into cmd/web for full flow | |
**User's choice:** Periodic only
**Notes:** User selected the recommended option for all three queue-library sub-decisions, confirming Postgres-only constraint alignment.
---
## Proof-of-Life Job
| Option | Description | Selected |
|--------|-------------|----------|
| Orphan-file cleanup | Finds and deletes tablo_files rows/S3 objects where tablo was deleted | |
| Heartbeat / noop | Logs on schedule; proves scheduler but no domain value | ✓ (initial) |
| Signed-URL prewarm | Pre-generates download URLs for recent files | |
**Follow-up (heartbeat only vs. heartbeat + cleanup):**
| Option | Description | Selected |
|--------|-------------|----------|
| Heartbeat only | Minimal Phase 6 | |
| Heartbeat + orphan-file cleanup | Both jobs; heartbeat proves scheduling, cleanup proves DB+S3 | ✓ |
**Intervals:**
| Option | Description | Selected |
|--------|-------------|----------|
| Cleanup: hourly; Heartbeat: every minute | Frequent heartbeat visibility, safe hourly cleanup | ✓ |
| Both every minute | More log noise, maximum dev observability | |
**User's choice:** Heartbeat (1 min) + orphan-file cleanup (1 hr)
**Notes:** User initially chose heartbeat-only for proof, then selected adding cleanup as a second job. Cleanup is the more meaningful domain job; heartbeat proves the periodic scheduler plumbing.
---
## Failed Job Surface
| Option | Description | Selected |
|--------|-------------|----------|
| CLI subcommand: backend list-failed-jobs | Queries river_jobs for failed rows; satisfies WORK-04 literally | |
| Structured logs only | Rich log fields on failure; redefines WORK-04 as log observability | ✓ |
| Admin HTTP route on worker | GET /admin/jobs/failed on separate port | |
**User's choice:** Structured logs only — WORK-04's "CLI surface" interpreted as log observability.
**Notes:** User rejected a follow-up push to add a CLI command alongside logs, then confirmed the logs-only decision on second presentation. Decision is deliberate.
---
## Job Scheduling Model
| Option | Description | Selected |
|--------|-------------|----------|
| Single river.Client with all periodic jobs at startup | One client, PeriodicJob registrations, block on shutdown | ✓ |
| Separate goroutines with time.Ticker | Hand-rolled scheduling outside river | |
**Concurrency model:**
| Option | Description | Selected |
|--------|-------------|----------|
| Document single-worker constraint | README notes: one worker only in v1 | ✓ |
| Rely on river's advisory locks | River prevents duplicate execution across instances | |
**User's choice:** Single river.Client; document single-worker constraint in README.
---
## Claude's Discretion
- Exact log fields emitted by heartbeat job (uptime, job count, etc.)
- Whether orphan-cleanup logs a per-run summary
- River client configuration (concurrency, max attempts, queue name)
- Orphan detection query pattern (LEFT JOIN vs NOT EXISTS)
## Deferred Ideas
- Web-side job enqueueing (river client in cmd/web) — future phase when event-triggered use case appears
- Multiple worker instances / leader election — after v1 ships
- Job admin UI or CLI subcommand — logs satisfy v1 observability needs
- Redis / asynq — explicitly out of scope (Postgres-only constraint)