docs(07-03): extend README with Deploy, Rollback, and Incident Runbook sections
- Deploy section: prerequisites, first-time setup, deploying new versions (DEPLOY-05) - First-time setup documents DATABASE_URL internal URL, SESSION_SECRET generation, full S3/R2 var list, chmod 600 .env.prod reminder (T-07-10), TLS staging note - Rollback section: image tag redeployment + break-glass schema rollback via goose CLI - Incident Runbook: /readyz 503, Caddy TLS rate limits, log viewing, distroless debug (ephemeral busybox container technique for shell-less runtime image, RESEARCH Pitfall 7)
This commit is contained in:
parent
273f0632be
commit
f261fb39b8
1 changed files with 328 additions and 0 deletions
|
|
@ -237,3 +237,331 @@ Run `just db-down` first if you also want to drop the Postgres container.
|
|||
- File uploads + R2/S3 → Phase 5
|
||||
- Real worker jobs → Phase 6
|
||||
- Production deploy, Dockerfile, `/readyz` → Phase 7
|
||||
|
||||
## Deploy
|
||||
|
||||
The production host is a Hetzner VM running plain Docker Compose (D-01, D-02). No
|
||||
Kubernetes or managed orchestration is needed — `docker compose up -d` on the VM is
|
||||
the entire deployment mechanism. Postgres runs inside the compose stack (D-03); there
|
||||
is no external managed database.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Install on the production VM before first deploy:
|
||||
|
||||
- **Docker** ≥ 24 with the **Docker Compose** plugin (`docker compose` — not the
|
||||
standalone `docker-compose` binary)
|
||||
- **git** (optional — useful for pulling the repo directly onto the VM)
|
||||
|
||||
No other runtimes are needed. Go, Node, and all build tooling run in the Dockerfile's
|
||||
multi-stage build and are not required on the VM.
|
||||
|
||||
### First-time setup
|
||||
|
||||
Run all commands on the VM via SSH unless noted otherwise.
|
||||
|
||||
1. **SSH to the VM.**
|
||||
|
||||
```
|
||||
ssh user@<vm-ip>
|
||||
```
|
||||
|
||||
2. **Copy the `backend/` directory to the VM** (or clone the repo).
|
||||
|
||||
```
|
||||
# Option A — rsync from local machine:
|
||||
rsync -av --exclude '.git' backend/ user@<vm-ip>:~/xtablo/
|
||||
|
||||
# Option B — clone the repo directly on the VM:
|
||||
git clone <repo-url> ~/xtablo && cd ~/xtablo/backend
|
||||
```
|
||||
|
||||
3. **Create `.env.prod`** by copying `.env.example` and filling in real values.
|
||||
|
||||
```
|
||||
cp .env.example .env.prod
|
||||
chmod 600 .env.prod # restrict read access — file contains secrets (T-07-10)
|
||||
```
|
||||
|
||||
Mandatory variables to set in `.env.prod`:
|
||||
|
||||
| Variable | Value |
|
||||
|---|---|
|
||||
| `DATABASE_URL` | `postgres://xtablo:<POSTGRES_PASSWORD>@postgres:5432/xtablo?sslmode=disable` (internal compose network — hostname is `postgres`) |
|
||||
| `POSTGRES_PASSWORD` | Strong random password (also used by the postgres service). Example: `openssl rand -hex 24` |
|
||||
| `POSTGRES_USER` | `xtablo` (or your custom user; must match `DATABASE_URL`) |
|
||||
| `POSTGRES_DB` | `xtablo` (or your custom db; must match `DATABASE_URL`) |
|
||||
| `SESSION_SECRET` | 32 random bytes hex-encoded. Generate with: `openssl rand -hex 32` |
|
||||
| `S3_ENDPOINT` | R2 endpoint URL: `https://<account-id>.r2.cloudflarestorage.com` |
|
||||
| `S3_BUCKET` | R2 bucket name |
|
||||
| `S3_ACCESS_KEY` | R2 API token key ID |
|
||||
| `S3_SECRET_KEY` | R2 API token secret |
|
||||
| `S3_USE_PATH_STYLE` | `false` for Cloudflare R2 (virtual-hosted-style URLs) |
|
||||
| `S3_REGION` | `auto` or `us-east-1` (R2 accepts both) |
|
||||
| `MAX_UPLOAD_SIZE_MB` | `25` (or your preferred limit) |
|
||||
| `ENV` | `production` (activates JSON slog handler) |
|
||||
| `PORT` | `8080` |
|
||||
| `DOMAIN` | `app.yourdomain.com` (Caddy reads this for TLS) |
|
||||
|
||||
Do **not** include `TEST_DATABASE_URL` in `.env.prod` — it is a dev/test-only
|
||||
variable and is not used by the runtime binaries.
|
||||
|
||||
4. **Build the Docker image** (from inside `backend/` — either locally or on the VM).
|
||||
|
||||
```
|
||||
# From inside backend/
|
||||
docker build -f Dockerfile -t ghcr.io/yourusername/xtablo:v0.1.0 .
|
||||
```
|
||||
|
||||
If building locally, push to a registry and pull on the VM:
|
||||
|
||||
```
|
||||
docker push ghcr.io/yourusername/xtablo:v0.1.0
|
||||
# On the VM:
|
||||
docker pull ghcr.io/yourusername/xtablo:v0.1.0
|
||||
```
|
||||
|
||||
5. **Set image coordinates as environment variables** (used by `docker-compose.prod.yaml`).
|
||||
|
||||
```
|
||||
export IMAGE=ghcr.io/yourusername/xtablo
|
||||
export TAG=v0.1.0
|
||||
```
|
||||
|
||||
6. **Start the stack.**
|
||||
|
||||
```
|
||||
docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
|
||||
```
|
||||
|
||||
The postgres service must pass its healthcheck before web and worker start.
|
||||
Migrations run automatically at web startup via `goose.Up()` (D-10).
|
||||
|
||||
7. **Verify the deployment.**
|
||||
|
||||
```
|
||||
curl https://app.yourdomain.com/healthz # → {"status":"ok"}
|
||||
curl https://app.yourdomain.com/readyz # → {"status":"ok","db":"ok"}
|
||||
```
|
||||
|
||||
If the domain is not yet configured, use the VM's public IP temporarily with
|
||||
HTTP (Caddy will not yet have a certificate):
|
||||
|
||||
```
|
||||
curl http://<vm-ip>:80/healthz
|
||||
```
|
||||
|
||||
8. **Let's Encrypt staging (for initial TLS testing).**
|
||||
|
||||
To avoid hitting Let's Encrypt production rate limits (5 duplicate certificates
|
||||
per week per domain) during initial setup, uncomment the staging global block in
|
||||
`deploy/Caddyfile`:
|
||||
|
||||
```
|
||||
{
|
||||
acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
|
||||
}
|
||||
```
|
||||
|
||||
Restart Caddy after editing (`docker compose -f docker-compose.prod.yaml restart caddy`),
|
||||
verify TLS works (browsers will show a staging cert warning — that is expected),
|
||||
then remove the global block and clear the `caddy_data` volume to issue a real
|
||||
production certificate.
|
||||
|
||||
### Deploying a new version
|
||||
|
||||
1. **Build and tag the new image** (same as first-time, with a new tag):
|
||||
|
||||
```
|
||||
docker build -f Dockerfile -t ghcr.io/yourusername/xtablo:v0.2.0 .
|
||||
docker push ghcr.io/yourusername/xtablo:v0.2.0 # if using a registry
|
||||
```
|
||||
|
||||
2. **On the VM** — update `TAG` in `.env.prod`:
|
||||
|
||||
```
|
||||
# Edit .env.prod:
|
||||
TAG=v0.2.0
|
||||
```
|
||||
|
||||
Or pass it inline without editing the file:
|
||||
|
||||
```
|
||||
export TAG=v0.2.0
|
||||
```
|
||||
|
||||
3. **Pull and recreate only the changed services:**
|
||||
|
||||
```
|
||||
docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
|
||||
```
|
||||
|
||||
Compose recreates only the web and worker containers (their image tag changed).
|
||||
Postgres and Caddy are unaffected. Migrations run automatically at web startup
|
||||
(D-10) — `goose.Up()` is idempotent and skips already-applied migrations.
|
||||
|
||||
## Rollback
|
||||
|
||||
Rollback means redeploying the previous image tag (D-11). No special tooling is
|
||||
required — it is the same as deploying a new version, but with an older tag.
|
||||
|
||||
1. **On the VM** — set `TAG` to the previous tag in `.env.prod` (or inline):
|
||||
|
||||
```
|
||||
export TAG=v0.1.0
|
||||
```
|
||||
|
||||
2. **Redeploy:**
|
||||
|
||||
```
|
||||
docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
|
||||
```
|
||||
|
||||
Compose recreates web and worker with the old image. The rollback is complete.
|
||||
|
||||
### Schema rollback (break-glass)
|
||||
|
||||
`goose.Up()` is idempotent — rolling back to a previous binary does not automatically
|
||||
run `goose down`. In most cases this is fine: the old binary ignores columns it does
|
||||
not know about.
|
||||
|
||||
If a migration introduced a schema change that is **incompatible** with the old binary
|
||||
(e.g. a NOT NULL column without a default that the old binary does not supply), run a
|
||||
manual goose down as a break-glass step:
|
||||
|
||||
1. Connect to Postgres inside the container:
|
||||
|
||||
```
|
||||
docker exec -it <postgres-container-name> psql -U xtablo -d xtablo
|
||||
```
|
||||
|
||||
(Find the container name with `docker compose -f docker-compose.prod.yaml ps`.)
|
||||
|
||||
2. The production image is distroless — the `goose` CLI is not inside the runtime
|
||||
container. Install the goose CLI separately on the VM or use the goose Docker
|
||||
image against the internal network:
|
||||
|
||||
```
|
||||
# Install goose CLI on the VM:
|
||||
go install github.com/pressly/goose/v3/cmd/goose@latest
|
||||
goose -dir ./migrations postgres "$DATABASE_URL" down
|
||||
```
|
||||
|
||||
Or use an ephemeral container on the same compose network:
|
||||
|
||||
```
|
||||
docker run --rm --network <compose-network> \
|
||||
-e GOOSE_DRIVER=postgres \
|
||||
-e GOOSE_DBSTRING="postgres://xtablo:<password>@postgres:5432/xtablo?sslmode=disable" \
|
||||
-v $(pwd)/migrations:/migrations \
|
||||
ghcr.io/kukymbr/goose-docker:latest \
|
||||
goose -dir /migrations down
|
||||
```
|
||||
|
||||
After reverting the migration, the old binary will start cleanly.
|
||||
|
||||
## Incident Runbook
|
||||
|
||||
### /readyz returns 503
|
||||
|
||||
`/readyz` pings Postgres. A 503 means the web container cannot reach the database.
|
||||
|
||||
1. Check container status:
|
||||
|
||||
```
|
||||
docker compose -f docker-compose.prod.yaml ps
|
||||
```
|
||||
|
||||
2. If `postgres` is down or unhealthy, restart it:
|
||||
|
||||
```
|
||||
docker compose -f docker-compose.prod.yaml up -d postgres
|
||||
```
|
||||
|
||||
Then restart web and worker (they will wait for postgres to be healthy):
|
||||
|
||||
```
|
||||
docker compose -f docker-compose.prod.yaml up -d
|
||||
```
|
||||
|
||||
3. Check web logs for the actual error:
|
||||
|
||||
```
|
||||
docker compose -f docker-compose.prod.yaml logs web --tail=50
|
||||
```
|
||||
|
||||
All application logs are JSON when `ENV=production` is set. Look for
|
||||
`"level":"ERROR"` lines with a `"msg":"db ping failed"` or similar.
|
||||
|
||||
### Caddy TLS certificate errors
|
||||
|
||||
1. Check caddy logs:
|
||||
|
||||
```
|
||||
docker compose -f docker-compose.prod.yaml logs caddy --tail=50
|
||||
```
|
||||
|
||||
2. If you see "too many certificates already issued for" (Let's Encrypt rate limit,
|
||||
RESEARCH Pitfall 4):
|
||||
- Caddy hit the 5 duplicate certificates per week limit for the domain.
|
||||
- Confirm the `caddy_data` named volume exists and is mounted — if the volume was
|
||||
accidentally deleted, Caddy cannot reuse the cached certificate and must
|
||||
re-issue on every restart, quickly exhausting the rate limit.
|
||||
- Recovery options:
|
||||
- Wait up to 1 week for the rate limit window to reset.
|
||||
- Switch to the Let's Encrypt staging endpoint temporarily (see
|
||||
"Let's Encrypt staging" in the First-time setup section above).
|
||||
- Restore from a `caddy_data` volume backup if available.
|
||||
|
||||
3. If the `caddy_data` volume was lost:
|
||||
|
||||
```
|
||||
# Verify the volume still exists:
|
||||
docker volume ls | grep caddy_data
|
||||
|
||||
# If missing, the volume must be recreated (certificates will be re-issued):
|
||||
docker compose -f docker-compose.prod.yaml up -d caddy
|
||||
```
|
||||
|
||||
### Checking logs
|
||||
|
||||
Follow logs for any service:
|
||||
|
||||
```
|
||||
docker compose -f docker-compose.prod.yaml logs web --tail=100 --follow
|
||||
docker compose -f docker-compose.prod.yaml logs worker --tail=100 --follow
|
||||
docker compose -f docker-compose.prod.yaml logs caddy --tail=100 --follow
|
||||
docker compose -f docker-compose.prod.yaml logs postgres --tail=50
|
||||
```
|
||||
|
||||
All application logs are JSON in production (`ENV=production` activates the slog
|
||||
JSON handler). Pipe through `jq` for readable output:
|
||||
|
||||
```
|
||||
docker compose -f docker-compose.prod.yaml logs web --follow --no-log-prefix | jq .
|
||||
```
|
||||
|
||||
### Debugging the distroless container
|
||||
|
||||
The runtime image (`gcr.io/distroless/static-debian12:nonroot`) has **no shell**
|
||||
(RESEARCH Pitfall 7). You cannot `docker exec -it <web-container> sh`.
|
||||
|
||||
To debug network or filesystem issues, attach an ephemeral busybox container to the
|
||||
same network:
|
||||
|
||||
```
|
||||
# Find the web container ID:
|
||||
docker compose -f docker-compose.prod.yaml ps
|
||||
|
||||
# Attach busybox to the web container's network namespace:
|
||||
docker run --rm -it --network container:<web-container-id> busybox sh
|
||||
```
|
||||
|
||||
From the busybox shell you can run `wget`, `nc`, `ping`, etc. to diagnose
|
||||
connectivity. To inspect the compose network directly (e.g. reach `postgres:5432`):
|
||||
|
||||
```
|
||||
docker run --rm -it \
|
||||
--network $(docker inspect <web-container-id> --format '{{range .NetworkSettings.Networks}}{{.NetworkID}}{{end}}') \
|
||||
busybox sh
|
||||
```
|
||||
|
|
|
|||
Loading…
Reference in a new issue