docs(07-03): extend README with Deploy, Rollback, and Incident Runbook sections

- Deploy section: prerequisites, first-time setup, deploying new versions (DEPLOY-05)
- First-time setup documents DATABASE_URL internal URL, SESSION_SECRET generation,
  full S3/R2 var list, chmod 600 .env.prod reminder (T-07-10), TLS staging note
- Rollback section: image tag redeployment + break-glass schema rollback via goose CLI
- Incident Runbook: /readyz 503, Caddy TLS rate limits, log viewing, distroless debug
  (ephemeral busybox container technique for shell-less runtime image, RESEARCH Pitfall 7)
This commit is contained in:
Arthur Belleville 2026-05-15 18:25:03 +02:00
parent 273f0632be
commit f261fb39b8
No known key found for this signature in database

View file

@ -237,3 +237,331 @@ Run `just db-down` first if you also want to drop the Postgres container.
- File uploads + R2/S3 → Phase 5
- Real worker jobs → Phase 6
- Production deploy, Dockerfile, `/readyz` → Phase 7
## Deploy
The production host is a Hetzner VM running plain Docker Compose (D-01, D-02). No
Kubernetes or managed orchestration is needed — `docker compose up -d` on the VM is
the entire deployment mechanism. Postgres runs inside the compose stack (D-03); there
is no external managed database.
### Prerequisites
Install on the production VM before first deploy:
- **Docker** ≥ 24 with the **Docker Compose** plugin (`docker compose` — not the
standalone `docker-compose` binary)
- **git** (optional — useful for pulling the repo directly onto the VM)
No other runtimes are needed. Go, Node, and all build tooling run in the Dockerfile's
multi-stage build and are not required on the VM.
### First-time setup
Run all commands on the VM via SSH unless noted otherwise.
1. **SSH to the VM.**
```
ssh user@<vm-ip>
```
2. **Copy the `backend/` directory to the VM** (or clone the repo).
```
# Option A — rsync from local machine:
rsync -av --exclude '.git' backend/ user@<vm-ip>:~/xtablo/
# Option B — clone the repo directly on the VM:
git clone <repo-url> ~/xtablo && cd ~/xtablo/backend
```
3. **Create `.env.prod`** by copying `.env.example` and filling in real values.
```
cp .env.example .env.prod
chmod 600 .env.prod # restrict read access — file contains secrets (T-07-10)
```
Mandatory variables to set in `.env.prod`:
| Variable | Value |
|---|---|
| `DATABASE_URL` | `postgres://xtablo:<POSTGRES_PASSWORD>@postgres:5432/xtablo?sslmode=disable` (internal compose network — hostname is `postgres`) |
| `POSTGRES_PASSWORD` | Strong random password (also used by the postgres service). Example: `openssl rand -hex 24` |
| `POSTGRES_USER` | `xtablo` (or your custom user; must match `DATABASE_URL`) |
| `POSTGRES_DB` | `xtablo` (or your custom db; must match `DATABASE_URL`) |
| `SESSION_SECRET` | 32 random bytes hex-encoded. Generate with: `openssl rand -hex 32` |
| `S3_ENDPOINT` | R2 endpoint URL: `https://<account-id>.r2.cloudflarestorage.com` |
| `S3_BUCKET` | R2 bucket name |
| `S3_ACCESS_KEY` | R2 API token key ID |
| `S3_SECRET_KEY` | R2 API token secret |
| `S3_USE_PATH_STYLE` | `false` for Cloudflare R2 (virtual-hosted-style URLs) |
| `S3_REGION` | `auto` or `us-east-1` (R2 accepts both) |
| `MAX_UPLOAD_SIZE_MB` | `25` (or your preferred limit) |
| `ENV` | `production` (activates JSON slog handler) |
| `PORT` | `8080` |
| `DOMAIN` | `app.yourdomain.com` (Caddy reads this for TLS) |
Do **not** include `TEST_DATABASE_URL` in `.env.prod` — it is a dev/test-only
variable and is not used by the runtime binaries.
4. **Build the Docker image** (from inside `backend/` — either locally or on the VM).
```
# From inside backend/
docker build -f Dockerfile -t ghcr.io/yourusername/xtablo:v0.1.0 .
```
If building locally, push to a registry and pull on the VM:
```
docker push ghcr.io/yourusername/xtablo:v0.1.0
# On the VM:
docker pull ghcr.io/yourusername/xtablo:v0.1.0
```
5. **Set image coordinates as environment variables** (used by `docker-compose.prod.yaml`).
```
export IMAGE=ghcr.io/yourusername/xtablo
export TAG=v0.1.0
```
6. **Start the stack.**
```
docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
```
The postgres service must pass its healthcheck before web and worker start.
Migrations run automatically at web startup via `goose.Up()` (D-10).
7. **Verify the deployment.**
```
curl https://app.yourdomain.com/healthz # → {"status":"ok"}
curl https://app.yourdomain.com/readyz # → {"status":"ok","db":"ok"}
```
If the domain is not yet configured, use the VM's public IP temporarily with
HTTP (Caddy will not yet have a certificate):
```
curl http://<vm-ip>:80/healthz
```
8. **Let's Encrypt staging (for initial TLS testing).**
To avoid hitting Let's Encrypt production rate limits (5 duplicate certificates
per week per domain) during initial setup, uncomment the staging global block in
`deploy/Caddyfile`:
```
{
acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
}
```
Restart Caddy after editing (`docker compose -f docker-compose.prod.yaml restart caddy`),
verify TLS works (browsers will show a staging cert warning — that is expected),
then remove the global block and clear the `caddy_data` volume to issue a real
production certificate.
### Deploying a new version
1. **Build and tag the new image** (same as first-time, with a new tag):
```
docker build -f Dockerfile -t ghcr.io/yourusername/xtablo:v0.2.0 .
docker push ghcr.io/yourusername/xtablo:v0.2.0 # if using a registry
```
2. **On the VM** — update `TAG` in `.env.prod`:
```
# Edit .env.prod:
TAG=v0.2.0
```
Or pass it inline without editing the file:
```
export TAG=v0.2.0
```
3. **Pull and recreate only the changed services:**
```
docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
```
Compose recreates only the web and worker containers (their image tag changed).
Postgres and Caddy are unaffected. Migrations run automatically at web startup
(D-10) — `goose.Up()` is idempotent and skips already-applied migrations.
## Rollback
Rollback means redeploying the previous image tag (D-11). No special tooling is
required — it is the same as deploying a new version, but with an older tag.
1. **On the VM** — set `TAG` to the previous tag in `.env.prod` (or inline):
```
export TAG=v0.1.0
```
2. **Redeploy:**
```
docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
```
Compose recreates web and worker with the old image. The rollback is complete.
### Schema rollback (break-glass)
`goose.Up()` is idempotent — rolling back to a previous binary does not automatically
run `goose down`. In most cases this is fine: the old binary ignores columns it does
not know about.
If a migration introduced a schema change that is **incompatible** with the old binary
(e.g. a NOT NULL column without a default that the old binary does not supply), run a
manual goose down as a break-glass step:
1. Connect to Postgres inside the container:
```
docker exec -it <postgres-container-name> psql -U xtablo -d xtablo
```
(Find the container name with `docker compose -f docker-compose.prod.yaml ps`.)
2. The production image is distroless — the `goose` CLI is not inside the runtime
container. Install the goose CLI separately on the VM or use the goose Docker
image against the internal network:
```
# Install goose CLI on the VM:
go install github.com/pressly/goose/v3/cmd/goose@latest
goose -dir ./migrations postgres "$DATABASE_URL" down
```
Or use an ephemeral container on the same compose network:
```
docker run --rm --network <compose-network> \
-e GOOSE_DRIVER=postgres \
-e GOOSE_DBSTRING="postgres://xtablo:<password>@postgres:5432/xtablo?sslmode=disable" \
-v $(pwd)/migrations:/migrations \
ghcr.io/kukymbr/goose-docker:latest \
goose -dir /migrations down
```
After reverting the migration, the old binary will start cleanly.
## Incident Runbook
### /readyz returns 503
`/readyz` pings Postgres. A 503 means the web container cannot reach the database.
1. Check container status:
```
docker compose -f docker-compose.prod.yaml ps
```
2. If `postgres` is down or unhealthy, restart it:
```
docker compose -f docker-compose.prod.yaml up -d postgres
```
Then restart web and worker (they will wait for postgres to be healthy):
```
docker compose -f docker-compose.prod.yaml up -d
```
3. Check web logs for the actual error:
```
docker compose -f docker-compose.prod.yaml logs web --tail=50
```
All application logs are JSON when `ENV=production` is set. Look for
`"level":"ERROR"` lines with a `"msg":"db ping failed"` or similar.
### Caddy TLS certificate errors
1. Check caddy logs:
```
docker compose -f docker-compose.prod.yaml logs caddy --tail=50
```
2. If you see "too many certificates already issued for" (Let's Encrypt rate limit,
RESEARCH Pitfall 4):
- Caddy hit the 5 duplicate certificates per week limit for the domain.
- Confirm the `caddy_data` named volume exists and is mounted — if the volume was
accidentally deleted, Caddy cannot reuse the cached certificate and must
re-issue on every restart, quickly exhausting the rate limit.
- Recovery options:
- Wait up to 1 week for the rate limit window to reset.
- Switch to the Let's Encrypt staging endpoint temporarily (see
"Let's Encrypt staging" in the First-time setup section above).
- Restore from a `caddy_data` volume backup if available.
3. If the `caddy_data` volume was lost:
```
# Verify the volume still exists:
docker volume ls | grep caddy_data
# If missing, the volume must be recreated (certificates will be re-issued):
docker compose -f docker-compose.prod.yaml up -d caddy
```
### Checking logs
Follow logs for any service:
```
docker compose -f docker-compose.prod.yaml logs web --tail=100 --follow
docker compose -f docker-compose.prod.yaml logs worker --tail=100 --follow
docker compose -f docker-compose.prod.yaml logs caddy --tail=100 --follow
docker compose -f docker-compose.prod.yaml logs postgres --tail=50
```
All application logs are JSON in production (`ENV=production` activates the slog
JSON handler). Pipe through `jq` for readable output:
```
docker compose -f docker-compose.prod.yaml logs web --follow --no-log-prefix | jq .
```
### Debugging the distroless container
The runtime image (`gcr.io/distroless/static-debian12:nonroot`) has **no shell**
(RESEARCH Pitfall 7). You cannot `docker exec -it <web-container> sh`.
To debug network or filesystem issues, attach an ephemeral busybox container to the
same network:
```
# Find the web container ID:
docker compose -f docker-compose.prod.yaml ps
# Attach busybox to the web container's network namespace:
docker run --rm -it --network container:<web-container-id> busybox sh
```
From the busybox shell you can run `wget`, `nc`, `ping`, etc. to diagnose
connectivity. To inspect the compose network directly (e.g. reach `postgres:5432`):
```
docker run --rm -it \
--network $(docker inspect <web-container-id> --format '{{range .NetworkSettings.Networks}}{{.NetworkID}}{{end}}') \
busybox sh
```