diff --git a/backend/README.md b/backend/README.md index d016ecb..572ae54 100644 --- a/backend/README.md +++ b/backend/README.md @@ -237,3 +237,331 @@ Run `just db-down` first if you also want to drop the Postgres container. - File uploads + R2/S3 → Phase 5 - Real worker jobs → Phase 6 - Production deploy, Dockerfile, `/readyz` → Phase 7 + +## Deploy + +The production host is a Hetzner VM running plain Docker Compose (D-01, D-02). No +Kubernetes or managed orchestration is needed — `docker compose up -d` on the VM is +the entire deployment mechanism. Postgres runs inside the compose stack (D-03); there +is no external managed database. + +### Prerequisites + +Install on the production VM before first deploy: + +- **Docker** ≥ 24 with the **Docker Compose** plugin (`docker compose` — not the + standalone `docker-compose` binary) +- **git** (optional — useful for pulling the repo directly onto the VM) + +No other runtimes are needed. Go, Node, and all build tooling run in the Dockerfile's +multi-stage build and are not required on the VM. + +### First-time setup + +Run all commands on the VM via SSH unless noted otherwise. + +1. **SSH to the VM.** + + ``` + ssh user@ + ``` + +2. **Copy the `backend/` directory to the VM** (or clone the repo). + + ``` + # Option A — rsync from local machine: + rsync -av --exclude '.git' backend/ user@:~/xtablo/ + + # Option B — clone the repo directly on the VM: + git clone ~/xtablo && cd ~/xtablo/backend + ``` + +3. **Create `.env.prod`** by copying `.env.example` and filling in real values. + + ``` + cp .env.example .env.prod + chmod 600 .env.prod # restrict read access — file contains secrets (T-07-10) + ``` + + Mandatory variables to set in `.env.prod`: + + | Variable | Value | + |---|---| + | `DATABASE_URL` | `postgres://xtablo:@postgres:5432/xtablo?sslmode=disable` (internal compose network — hostname is `postgres`) | + | `POSTGRES_PASSWORD` | Strong random password (also used by the postgres service). Example: `openssl rand -hex 24` | + | `POSTGRES_USER` | `xtablo` (or your custom user; must match `DATABASE_URL`) | + | `POSTGRES_DB` | `xtablo` (or your custom db; must match `DATABASE_URL`) | + | `SESSION_SECRET` | 32 random bytes hex-encoded. Generate with: `openssl rand -hex 32` | + | `S3_ENDPOINT` | R2 endpoint URL: `https://.r2.cloudflarestorage.com` | + | `S3_BUCKET` | R2 bucket name | + | `S3_ACCESS_KEY` | R2 API token key ID | + | `S3_SECRET_KEY` | R2 API token secret | + | `S3_USE_PATH_STYLE` | `false` for Cloudflare R2 (virtual-hosted-style URLs) | + | `S3_REGION` | `auto` or `us-east-1` (R2 accepts both) | + | `MAX_UPLOAD_SIZE_MB` | `25` (or your preferred limit) | + | `ENV` | `production` (activates JSON slog handler) | + | `PORT` | `8080` | + | `DOMAIN` | `app.yourdomain.com` (Caddy reads this for TLS) | + + Do **not** include `TEST_DATABASE_URL` in `.env.prod` — it is a dev/test-only + variable and is not used by the runtime binaries. + +4. **Build the Docker image** (from inside `backend/` — either locally or on the VM). + + ``` + # From inside backend/ + docker build -f Dockerfile -t ghcr.io/yourusername/xtablo:v0.1.0 . + ``` + + If building locally, push to a registry and pull on the VM: + + ``` + docker push ghcr.io/yourusername/xtablo:v0.1.0 + # On the VM: + docker pull ghcr.io/yourusername/xtablo:v0.1.0 + ``` + +5. **Set image coordinates as environment variables** (used by `docker-compose.prod.yaml`). + + ``` + export IMAGE=ghcr.io/yourusername/xtablo + export TAG=v0.1.0 + ``` + +6. **Start the stack.** + + ``` + docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d + ``` + + The postgres service must pass its healthcheck before web and worker start. + Migrations run automatically at web startup via `goose.Up()` (D-10). + +7. **Verify the deployment.** + + ``` + curl https://app.yourdomain.com/healthz # → {"status":"ok"} + curl https://app.yourdomain.com/readyz # → {"status":"ok","db":"ok"} + ``` + + If the domain is not yet configured, use the VM's public IP temporarily with + HTTP (Caddy will not yet have a certificate): + + ``` + curl http://:80/healthz + ``` + +8. **Let's Encrypt staging (for initial TLS testing).** + + To avoid hitting Let's Encrypt production rate limits (5 duplicate certificates + per week per domain) during initial setup, uncomment the staging global block in + `deploy/Caddyfile`: + + ``` + { + acme_ca https://acme-staging-v02.api.letsencrypt.org/directory + } + ``` + + Restart Caddy after editing (`docker compose -f docker-compose.prod.yaml restart caddy`), + verify TLS works (browsers will show a staging cert warning — that is expected), + then remove the global block and clear the `caddy_data` volume to issue a real + production certificate. + +### Deploying a new version + +1. **Build and tag the new image** (same as first-time, with a new tag): + + ``` + docker build -f Dockerfile -t ghcr.io/yourusername/xtablo:v0.2.0 . + docker push ghcr.io/yourusername/xtablo:v0.2.0 # if using a registry + ``` + +2. **On the VM** — update `TAG` in `.env.prod`: + + ``` + # Edit .env.prod: + TAG=v0.2.0 + ``` + + Or pass it inline without editing the file: + + ``` + export TAG=v0.2.0 + ``` + +3. **Pull and recreate only the changed services:** + + ``` + docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d + ``` + + Compose recreates only the web and worker containers (their image tag changed). + Postgres and Caddy are unaffected. Migrations run automatically at web startup + (D-10) — `goose.Up()` is idempotent and skips already-applied migrations. + +## Rollback + +Rollback means redeploying the previous image tag (D-11). No special tooling is +required — it is the same as deploying a new version, but with an older tag. + +1. **On the VM** — set `TAG` to the previous tag in `.env.prod` (or inline): + + ``` + export TAG=v0.1.0 + ``` + +2. **Redeploy:** + + ``` + docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d + ``` + + Compose recreates web and worker with the old image. The rollback is complete. + +### Schema rollback (break-glass) + +`goose.Up()` is idempotent — rolling back to a previous binary does not automatically +run `goose down`. In most cases this is fine: the old binary ignores columns it does +not know about. + +If a migration introduced a schema change that is **incompatible** with the old binary +(e.g. a NOT NULL column without a default that the old binary does not supply), run a +manual goose down as a break-glass step: + +1. Connect to Postgres inside the container: + + ``` + docker exec -it psql -U xtablo -d xtablo + ``` + + (Find the container name with `docker compose -f docker-compose.prod.yaml ps`.) + +2. The production image is distroless — the `goose` CLI is not inside the runtime + container. Install the goose CLI separately on the VM or use the goose Docker + image against the internal network: + + ``` + # Install goose CLI on the VM: + go install github.com/pressly/goose/v3/cmd/goose@latest + goose -dir ./migrations postgres "$DATABASE_URL" down + ``` + + Or use an ephemeral container on the same compose network: + + ``` + docker run --rm --network \ + -e GOOSE_DRIVER=postgres \ + -e GOOSE_DBSTRING="postgres://xtablo:@postgres:5432/xtablo?sslmode=disable" \ + -v $(pwd)/migrations:/migrations \ + ghcr.io/kukymbr/goose-docker:latest \ + goose -dir /migrations down + ``` + + After reverting the migration, the old binary will start cleanly. + +## Incident Runbook + +### /readyz returns 503 + +`/readyz` pings Postgres. A 503 means the web container cannot reach the database. + +1. Check container status: + + ``` + docker compose -f docker-compose.prod.yaml ps + ``` + +2. If `postgres` is down or unhealthy, restart it: + + ``` + docker compose -f docker-compose.prod.yaml up -d postgres + ``` + + Then restart web and worker (they will wait for postgres to be healthy): + + ``` + docker compose -f docker-compose.prod.yaml up -d + ``` + +3. Check web logs for the actual error: + + ``` + docker compose -f docker-compose.prod.yaml logs web --tail=50 + ``` + + All application logs are JSON when `ENV=production` is set. Look for + `"level":"ERROR"` lines with a `"msg":"db ping failed"` or similar. + +### Caddy TLS certificate errors + +1. Check caddy logs: + + ``` + docker compose -f docker-compose.prod.yaml logs caddy --tail=50 + ``` + +2. If you see "too many certificates already issued for" (Let's Encrypt rate limit, + RESEARCH Pitfall 4): + - Caddy hit the 5 duplicate certificates per week limit for the domain. + - Confirm the `caddy_data` named volume exists and is mounted — if the volume was + accidentally deleted, Caddy cannot reuse the cached certificate and must + re-issue on every restart, quickly exhausting the rate limit. + - Recovery options: + - Wait up to 1 week for the rate limit window to reset. + - Switch to the Let's Encrypt staging endpoint temporarily (see + "Let's Encrypt staging" in the First-time setup section above). + - Restore from a `caddy_data` volume backup if available. + +3. If the `caddy_data` volume was lost: + + ``` + # Verify the volume still exists: + docker volume ls | grep caddy_data + + # If missing, the volume must be recreated (certificates will be re-issued): + docker compose -f docker-compose.prod.yaml up -d caddy + ``` + +### Checking logs + +Follow logs for any service: + +``` +docker compose -f docker-compose.prod.yaml logs web --tail=100 --follow +docker compose -f docker-compose.prod.yaml logs worker --tail=100 --follow +docker compose -f docker-compose.prod.yaml logs caddy --tail=100 --follow +docker compose -f docker-compose.prod.yaml logs postgres --tail=50 +``` + +All application logs are JSON in production (`ENV=production` activates the slog +JSON handler). Pipe through `jq` for readable output: + +``` +docker compose -f docker-compose.prod.yaml logs web --follow --no-log-prefix | jq . +``` + +### Debugging the distroless container + +The runtime image (`gcr.io/distroless/static-debian12:nonroot`) has **no shell** +(RESEARCH Pitfall 7). You cannot `docker exec -it sh`. + +To debug network or filesystem issues, attach an ephemeral busybox container to the +same network: + +``` +# Find the web container ID: +docker compose -f docker-compose.prod.yaml ps + +# Attach busybox to the web container's network namespace: +docker run --rm -it --network container: busybox sh +``` + +From the busybox shell you can run `wget`, `nc`, `ping`, etc. to diagnose +connectivity. To inspect the compose network directly (e.g. reach `postgres:5432`): + +``` +docker run --rm -it \ + --network $(docker inspect --format '{{range .NetworkSettings.Networks}}{{.NetworkID}}{{end}}') \ + busybox sh +```