docs(07-03): extend README with Deploy, Rollback, and Incident Runbook sections

- Deploy section: prerequisites, first-time setup, deploying new versions (DEPLOY-05) - First-time setup documents DATABASE_URL internal URL, SESSION_SECRET generation, full S3/R2 var list, chmod 600 .env.prod reminder (T-07-10), TLS staging note - Rollback section: image tag redeployment + break-glass schema rollback via goose CLI - Incident Runbook: /readyz 503, Caddy TLS rate limits, log viewing, distroless debug (ephemeral busybox container technique for shell-less runtime image, RESEARCH Pitfall 7)
2026-05-15 18:25:03 +02:00 · 2026-05-15 18:25:03 +02:00 · f261fb39b8
commit f261fb39b8
parent 273f0632be
1 changed files with 328 additions and 0 deletions
--- a/backend/README.md
+++ b/backend/README.md
@ -237,3 +237,331 @@ Run `just db-down` first if you also want to drop the Postgres container.
 - File uploads + R2/S3 → Phase 5
 - Real worker jobs → Phase 6
 - Production deploy, Dockerfile, `/readyz` → Phase 7
+
+## Deploy
+
+The production host is a Hetzner VM running plain Docker Compose (D-01, D-02). No
+Kubernetes or managed orchestration is needed — `docker compose up -d` on the VM is
+the entire deployment mechanism. Postgres runs inside the compose stack (D-03); there
+is no external managed database.
+
+### Prerequisites
+
+Install on the production VM before first deploy:
+
+- **Docker** ≥ 24 with the **Docker Compose** plugin (`docker compose` — not the
+  standalone `docker-compose` binary)
+- **git** (optional — useful for pulling the repo directly onto the VM)
+
+No other runtimes are needed. Go, Node, and all build tooling run in the Dockerfile's
+multi-stage build and are not required on the VM.
+
+### First-time setup
+
+Run all commands on the VM via SSH unless noted otherwise.
+
+1. **SSH to the VM.**
+
+   ```
+   ssh user@<vm-ip>
+   ```
+
+2. **Copy the `backend/` directory to the VM** (or clone the repo).
+
+   ```
+   # Option A — rsync from local machine:
+   rsync -av --exclude '.git' backend/ user@<vm-ip>:~/xtablo/
+
+   # Option B — clone the repo directly on the VM:
+   git clone <repo-url> ~/xtablo && cd ~/xtablo/backend
+   ```
+
+3. **Create `.env.prod`** by copying `.env.example` and filling in real values.
+
+   ```
+   cp .env.example .env.prod
+   chmod 600 .env.prod      # restrict read access — file contains secrets (T-07-10)
+   ```
+
+   Mandatory variables to set in `.env.prod`:
+
+   | Variable | Value |
+   |---|---|
+   | `DATABASE_URL` | `postgres://xtablo:<POSTGRES_PASSWORD>@postgres:5432/xtablo?sslmode=disable` (internal compose network — hostname is `postgres`) |
+   | `POSTGRES_PASSWORD` | Strong random password (also used by the postgres service). Example: `openssl rand -hex 24` |
+   | `POSTGRES_USER` | `xtablo` (or your custom user; must match `DATABASE_URL`) |
+   | `POSTGRES_DB` | `xtablo` (or your custom db; must match `DATABASE_URL`) |
+   | `SESSION_SECRET` | 32 random bytes hex-encoded. Generate with: `openssl rand -hex 32` |
+   | `S3_ENDPOINT` | R2 endpoint URL: `https://<account-id>.r2.cloudflarestorage.com` |
+   | `S3_BUCKET` | R2 bucket name |
+   | `S3_ACCESS_KEY` | R2 API token key ID |
+   | `S3_SECRET_KEY` | R2 API token secret |
+   | `S3_USE_PATH_STYLE` | `false` for Cloudflare R2 (virtual-hosted-style URLs) |
+   | `S3_REGION` | `auto` or `us-east-1` (R2 accepts both) |
+   | `MAX_UPLOAD_SIZE_MB` | `25` (or your preferred limit) |
+   | `ENV` | `production` (activates JSON slog handler) |
+   | `PORT` | `8080` |
+   | `DOMAIN` | `app.yourdomain.com` (Caddy reads this for TLS) |
+
+   Do **not** include `TEST_DATABASE_URL` in `.env.prod` — it is a dev/test-only
+   variable and is not used by the runtime binaries.
+
+4. **Build the Docker image** (from inside `backend/` — either locally or on the VM).
+
+   ```
+   # From inside backend/
+   docker build -f Dockerfile -t ghcr.io/yourusername/xtablo:v0.1.0 .
+   ```
+
+   If building locally, push to a registry and pull on the VM:
+
+   ```
+   docker push ghcr.io/yourusername/xtablo:v0.1.0
+   # On the VM:
+   docker pull ghcr.io/yourusername/xtablo:v0.1.0
+   ```
+
+5. **Set image coordinates as environment variables** (used by `docker-compose.prod.yaml`).
+
+   ```
+   export IMAGE=ghcr.io/yourusername/xtablo
+   export TAG=v0.1.0
+   ```
+
+6. **Start the stack.**
+
+   ```
+   docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
+   ```
+
+   The postgres service must pass its healthcheck before web and worker start.
+   Migrations run automatically at web startup via `goose.Up()` (D-10).
+
+7. **Verify the deployment.**
+
+   ```
+   curl https://app.yourdomain.com/healthz   # → {"status":"ok"}
+   curl https://app.yourdomain.com/readyz    # → {"status":"ok","db":"ok"}
+   ```
+
+   If the domain is not yet configured, use the VM's public IP temporarily with
+   HTTP (Caddy will not yet have a certificate):
+
+   ```
+   curl http://<vm-ip>:80/healthz
+   ```
+
+8. **Let's Encrypt staging (for initial TLS testing).**
+
+   To avoid hitting Let's Encrypt production rate limits (5 duplicate certificates
+   per week per domain) during initial setup, uncomment the staging global block in
+   `deploy/Caddyfile`:
+
+   ```
+   {
+     acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
+   }
+   ```
+
+   Restart Caddy after editing (`docker compose -f docker-compose.prod.yaml restart caddy`),
+   verify TLS works (browsers will show a staging cert warning — that is expected),
+   then remove the global block and clear the `caddy_data` volume to issue a real
+   production certificate.
+
+### Deploying a new version
+
+1. **Build and tag the new image** (same as first-time, with a new tag):
+
+   ```
+   docker build -f Dockerfile -t ghcr.io/yourusername/xtablo:v0.2.0 .
+   docker push ghcr.io/yourusername/xtablo:v0.2.0   # if using a registry
+   ```
+
+2. **On the VM** — update `TAG` in `.env.prod`:
+
+   ```
+   # Edit .env.prod:
+   TAG=v0.2.0
+   ```
+
+   Or pass it inline without editing the file:
+
+   ```
+   export TAG=v0.2.0
+   ```
+
+3. **Pull and recreate only the changed services:**
+
+   ```
+   docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
+   ```
+
+   Compose recreates only the web and worker containers (their image tag changed).
+   Postgres and Caddy are unaffected. Migrations run automatically at web startup
+   (D-10) — `goose.Up()` is idempotent and skips already-applied migrations.
+
+## Rollback
+
+Rollback means redeploying the previous image tag (D-11). No special tooling is
+required — it is the same as deploying a new version, but with an older tag.
+
+1. **On the VM** — set `TAG` to the previous tag in `.env.prod` (or inline):
+
+   ```
+   export TAG=v0.1.0
+   ```
+
+2. **Redeploy:**
+
+   ```
+   docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
+   ```
+
+   Compose recreates web and worker with the old image. The rollback is complete.
+
+### Schema rollback (break-glass)
+
+`goose.Up()` is idempotent — rolling back to a previous binary does not automatically
+run `goose down`. In most cases this is fine: the old binary ignores columns it does
+not know about.
+
+If a migration introduced a schema change that is **incompatible** with the old binary
+(e.g. a NOT NULL column without a default that the old binary does not supply), run a
+manual goose down as a break-glass step:
+
+1. Connect to Postgres inside the container:
+
+   ```
+   docker exec -it <postgres-container-name> psql -U xtablo -d xtablo
+   ```
+
+   (Find the container name with `docker compose -f docker-compose.prod.yaml ps`.)
+
+2. The production image is distroless — the `goose` CLI is not inside the runtime
+   container. Install the goose CLI separately on the VM or use the goose Docker
+   image against the internal network:
+
+   ```
+   # Install goose CLI on the VM:
+   go install github.com/pressly/goose/v3/cmd/goose@latest
+   goose -dir ./migrations postgres "$DATABASE_URL" down
+   ```
+
+   Or use an ephemeral container on the same compose network:
+
+   ```
+   docker run --rm --network <compose-network> \
+     -e GOOSE_DRIVER=postgres \
+     -e GOOSE_DBSTRING="postgres://xtablo:<password>@postgres:5432/xtablo?sslmode=disable" \
+     -v $(pwd)/migrations:/migrations \
+     ghcr.io/kukymbr/goose-docker:latest \
+     goose -dir /migrations down
+   ```
+
+   After reverting the migration, the old binary will start cleanly.
+
+## Incident Runbook
+
+### /readyz returns 503
+
+`/readyz` pings Postgres. A 503 means the web container cannot reach the database.
+
+1. Check container status:
+
+   ```
+   docker compose -f docker-compose.prod.yaml ps
+   ```
+
+2. If `postgres` is down or unhealthy, restart it:
+
+   ```
+   docker compose -f docker-compose.prod.yaml up -d postgres
+   ```
+
+   Then restart web and worker (they will wait for postgres to be healthy):
+
+   ```
+   docker compose -f docker-compose.prod.yaml up -d
+   ```
+
+3. Check web logs for the actual error:
+
+   ```
+   docker compose -f docker-compose.prod.yaml logs web --tail=50
+   ```
+
+   All application logs are JSON when `ENV=production` is set. Look for
+   `"level":"ERROR"` lines with a `"msg":"db ping failed"` or similar.
+
+### Caddy TLS certificate errors
+
+1. Check caddy logs:
+
+   ```
+   docker compose -f docker-compose.prod.yaml logs caddy --tail=50
+   ```
+
+2. If you see "too many certificates already issued for" (Let's Encrypt rate limit,
+   RESEARCH Pitfall 4):
+   - Caddy hit the 5 duplicate certificates per week limit for the domain.
+   - Confirm the `caddy_data` named volume exists and is mounted — if the volume was
+     accidentally deleted, Caddy cannot reuse the cached certificate and must
+     re-issue on every restart, quickly exhausting the rate limit.
+   - Recovery options:
+     - Wait up to 1 week for the rate limit window to reset.
+     - Switch to the Let's Encrypt staging endpoint temporarily (see
+       "Let's Encrypt staging" in the First-time setup section above).
+     - Restore from a `caddy_data` volume backup if available.
+
+3. If the `caddy_data` volume was lost:
+
+   ```
+   # Verify the volume still exists:
+   docker volume ls | grep caddy_data
+
+   # If missing, the volume must be recreated (certificates will be re-issued):
+   docker compose -f docker-compose.prod.yaml up -d caddy
+   ```
+
+### Checking logs
+
+Follow logs for any service:
+
+```
+docker compose -f docker-compose.prod.yaml logs web --tail=100 --follow
+docker compose -f docker-compose.prod.yaml logs worker --tail=100 --follow
+docker compose -f docker-compose.prod.yaml logs caddy --tail=100 --follow
+docker compose -f docker-compose.prod.yaml logs postgres --tail=50
+```
+
+All application logs are JSON in production (`ENV=production` activates the slog
+JSON handler). Pipe through `jq` for readable output:
+
+```
+docker compose -f docker-compose.prod.yaml logs web --follow --no-log-prefix | jq .
+```
+
+### Debugging the distroless container
+
+The runtime image (`gcr.io/distroless/static-debian12:nonroot`) has **no shell**
+(RESEARCH Pitfall 7). You cannot `docker exec -it <web-container> sh`.
+
+To debug network or filesystem issues, attach an ephemeral busybox container to the
+same network:
+
+```
+# Find the web container ID:
+docker compose -f docker-compose.prod.yaml ps
+
+# Attach busybox to the web container's network namespace:
+docker run --rm -it --network container:<web-container-id> busybox sh
+```
+
+From the busybox shell you can run `wget`, `nc`, `ping`, etc. to diagnose
+connectivity. To inspect the compose network directly (e.g. reach `postgres:5432`):
+
+```
+docker run --rm -it \
+  --network $(docker inspect <web-container-id> --format '{{range .NetworkSettings.Networks}}{{.NetworkID}}{{end}}') \
+  busybox sh
+```