xtablo-source/backend/README.md

# Xtablo backend

Go + HTMX + Postgres. Phase 1: Walking Skeleton.

This README is the contract for FOUND-05: a developer with the prerequisites below
should be able to clone the repo, follow the Quickstart, and see the HTMX-driven
page within ~5 minutes.

## Prerequisites

Install these on your dev machine before starting:

- **Go** ≥ 1.22 (this project's `go.mod` declares 1.26)
- **just** — task runner (`brew install just` on macOS, `cargo install just`, or see
  <https://github.com/casey/just>)
- **podman** with `podman compose` (preferred per D-11) **or** **docker** with
  `docker compose`
- **curl**
- **git**

You do **not** need to install `goose`, `templ`, `sqlc`, `air`, the Tailwind CLI, or
`htmx.min.js` — `just bootstrap` installs the Go tools into `$GOBIN` and
bootstrap-downloads the Tailwind binary and HTMX script into local, gitignored
paths.

## Quickstart

Clone-to-running-page in ~5 minutes. Run from inside `backend/`.

```
cd backend
cp .env.example .env       # adjust DATABASE_URL if Postgres is not on localhost:5432
just bootstrap             # installs goose/templ/sqlc/air; bootstrap-downloads tailwindcss + htmx.min.js
just db-up                 # starts postgres via podman compose (see fallback below)
just migrate up            # applies migrations from ./migrations
just dev                   # terminal 1: brings up db, runs generate, then air on :8080

# in a SECOND terminal:
just styles-watch          # rebuilds static/tailwind.css on .templ / .go changes

# open http://localhost:8080
```

The page should render with a "Fetch server time" button. Clicking it swaps an
ISO-8601 timestamp into the page via HTMX. If the page shows "No time fetched
yet." and nothing happens on click, see Troubleshooting.

`bootstrap` is the slowest step (Go tool installs + two HTTP downloads). It only
needs to run once per clone.

## docker compose fallback

`compose.yaml` is portable across podman and docker — the service definition is
identical. If you don't have podman:

- Replace `podman compose` with `docker compose` mentally throughout this README.
- The `just db-up` / `just db-down` recipes call `podman compose` directly. Run
  `docker compose up -d postgres` / `docker compose down` instead, and continue
  with the rest of the Quickstart unchanged.

(Decision D-11.)

## Project layout

```
backend/
  cmd/
    web/main.go            # HTTP server entry point
    worker/main.go         # background worker — river periodic jobs (Phase 6)
  internal/
    db/                    # pgxpool wiring + sqlc-generated queries
    web/                   # chi router, handlers, middleware, design-system
      ui/                  # custom templ component library (Button, Card, Badge)
    session/               # placeholder — Phase 2
    tablos/                # placeholder — Phase 3
    tasks/                 # placeholder — Phase 4
    files/                 # placeholder — Phase 5
  migrations/              # goose .sql migrations
  templates/               # .templ files (layout, index, fragments)
  static/
    htmx.min.js            # bootstrap-downloaded by `just bootstrap`; gitignored; no runtime CDN
    tailwind.css           # generated by the Tailwind standalone CLI
  bin/                     # gitignored — tailwindcss CLI binary, etc.
  .air.toml                # air live-reload config
  .env.example             # committed; copy to .env
  compose.yaml             # local Postgres
  go.mod / go.sum
  justfile                 # task runner recipes — the source of truth for commands
  sqlc.yaml
  tailwind.input.css
  README.md
```

HTMX is served from `/static/htmx.min.js` at runtime — no CDN. The justfile's
bootstrap-time `unpkg.com` URL is the single authoritative version pin (D-10).

## Environment variables

`backend/.env` is gitignored; `backend/.env.example` is committed and lists the
keys consumed by `cmd/web` and `cmd/worker`. Local Just recipes load
`backend/.env` automatically, so `just dev` will pick up provider credentials
such as `GOOGLE_CLIENT_ID`.

| Variable                 | Description                                                              | Default                                                          |
| ------------------------ | ------------------------------------------------------------------------ | ---------------------------------------------------------------- |
| `DATABASE_URL`           | Postgres DSN used by the web + worker binaries and by `just migrate`     | `postgres://xtablo:xtablo@localhost:5432/xtablo?sslmode=disable` |
| `PORT`                   | HTTP port for `cmd/web`                                                  | `8080`                                                           |
| `ENV`                    | `development` enables slog's text handler; `production` switches to JSON | `development`                                                    |
| `GOOGLE_CLIENT_ID`       | Google OAuth client ID                                                   | blank                                                            |
| `GOOGLE_CLIENT_SECRET`   | Google OAuth client secret                                               | blank                                                            |
| `GOOGLE_REDIRECT_URL`    | Google callback URL, usually `/auth/google/callback`                     | `http://localhost:8080/auth/google/callback`                     |

Google config is optional in local development. When it is missing, the login
and signup pages keep the Google button visible but disabled with a
not-configured label. No real provider secrets should be committed to
`.env.example`. Apple sign-in is disabled in the current product surface.

## Common commands

Every command in this table is a recipe in `backend/justfile`.

| Recipe                                          | What it does                                                                 | When to use                                              |
| ----------------------------------------------- | ---------------------------------------------------------------------------- | -------------------------------------------------------- |
| `just bootstrap`                                | Installs Go CLI tools (`goose`, `templ`, `sqlc`, `air`); bootstrap-downloads `bin/tailwindcss` and `static/htmx.min.js` | Once per clone; re-run after deleting `bin/` or `static/htmx.min.js` |
| `just db-up`                                    | Starts the local Postgres container                                          | Before `just migrate up` / `just dev` if not already running |
| `just db-down`                                  | Stops the local Postgres container                                           | When you're done for the day                             |
| `just migrate up` / `migrate down` / `migrate status` | Applies / reverts / inspects goose migrations against `DATABASE_URL`     | After `just db-up`, or any time you change `migrations/` |
| `just generate`                                 | One-shot: `templ generate`, `sqlc generate`, Tailwind compile to `static/tailwind.css` | After editing `.templ`, query SQL, or `tailwind.input.css` |
| `just styles-watch`                             | Tailwind standalone CLI in `--watch` mode                                    | In a second terminal alongside `just dev` (D-14)         |
| `just dev`                                      | Loads `backend/.env`, brings up Postgres, runs `just generate`, then runs `air` for Go live-reload on `:8080` | Main dev loop, terminal 1                                |
| `just test`                                     | `templ generate` then `go test ./...`                                        | Before committing                                        |
| `just lint`                                     | `go vet ./...` and `gofmt -l` check                                          | Before committing                                        |
| `just build`                                    | Generates assets, then builds `bin/web` and `bin/worker`                     | Producing release binaries locally                       |
| `just clean`                                    | Removes `bin/`, `tmp/`, `static/htmx.min.js`, `static/tailwind.css`, and `*_templ.go` files | Reset to a fresh-clone state without dropping the Postgres volume |

## Running the Worker

`cmd/worker` is the background job processor. It runs river periodic jobs against
the same Postgres as `cmd/web`. Start it with:

```
just worker
```

This requires `just db-up` (handled automatically as a dependency) and MinIO
running (used by the orphan-file cleanup job). If MinIO is not running, the worker
will exit on startup with "file store init failed".

### What to expect

- Structured logs appear immediately at startup.
- A `"worker ready"` log line appears within a few seconds after `rivermigrate`
  and S3 init complete.
- A `"worker heartbeat"` log line appears almost immediately (the heartbeat job
  is configured with `RunOnStart: true`, so it fires on the first scheduler tick
  which happens within seconds of startup).
- Subsequent heartbeat logs appear every ~1 minute.
- The orphan-file cleanup job runs every hour (no `RunOnStart` — first run is
  ~1 hour after startup).

### Single-worker constraint

**Run only one worker process at a time (v1).** River uses advisory locks for
leader election and concurrent rivermigrate runs are unsafe. Do not run multiple
worker instances against the same database in this version.

### Graceful shutdown

Send SIGINT (Ctrl+C) and observe:

```
{"level":"INFO","msg":"shutting down"}
{"level":"INFO","msg":"shutdown complete"}
```

The worker calls `riverClient.StopAndCancel` with a 10-second timeout, which
cancels in-flight job contexts and waits for goroutines to exit before closing
the pool.

### Observing failed job retries

River logs each failure via the `SlogErrorHandler`. A failed job produces a log
line like:

```
{"level":"ERROR","msg":"job error","job_id":42,"job_kind":"heartbeat","attempt":1,"max_attempts":25,"err":"..."}
```

River retries up to 25 times with exponential backoff (`attempts^4` + jitter).
After 25 failed attempts the job is moved to the discarded state in `river_job`.

## Troubleshooting

The three issues most likely to trip you up on a fresh clone:

- **"Fresh clone fails to build with `undefined: templates.Index`"** — Templ
  generates `*_templ.go` files from `.templ` sources, and those generated files
  are not committed. Run `just generate` (or `just dev`, which calls it) before
  invoking `go build` directly. (Pitfall 1.)

- **"First request to `/healthz` returns 503 right after `just db-up`"** — The
  Postgres container needs ~5–10 seconds to become healthy after `podman compose
  up -d` returns. Check `podman compose ps` (or `docker compose ps`) for the
  `healthy` status, or just wait and retry. Subsequent calls succeed. The 503
  during warm-up is correct behavior, not a bug. (Pitfall 2.)

- **"Tailwind classes used in `.templ` files don't appear in the compiled CSS"** —
  Tailwind v4 only scans content paths declared via `@source` in
  `tailwind.input.css`. Confirm the file contains `@source
  "../templates/**/*.templ";` (and equivalent globs for `internal/web/**/*.go`).
  Re-run `just styles-watch` so the watcher picks up the config change.
  (Pitfall 3.)

If something else is wrong and you want a clean slate without dropping the
Postgres volume:

```
just clean              # removes bin/, tmp/, static/htmx.min.js, static/tailwind.css, *_templ.go
just bootstrap          # re-download tools and assets
just dev                # back to a working state
```

Run `just db-down` first if you also want to drop the Postgres container.

## What Phase 1 ships (and doesn't)

**Ships:**

- Project scaffold (`go.mod`, justfile, `.air.toml`, `tailwind.input.css`,
  `sqlc.yaml`, `compose.yaml`)
- Local Postgres via `compose.yaml` (`pg_isready` healthcheck)
- goose migration pipeline (`migrations/0001_init.sql` is a no-op bootstrap)
- chi router with `/`, `/healthz`, `/demo/time`, `/static/*`
- slog-based structured logging with RequestID middleware
- Graceful HTTP shutdown
- pgxpool wiring exercised by `/healthz`
- templ + HTMX demo (root page + `hx-get` round-trip to a templ fragment)
- Custom templ design-system package at `internal/web/ui/` (Button, Card, Badge)
- Live-reload dev loop (`just dev` + `just styles-watch`)
- `cmd/worker` skeleton (boot, log, idle, shutdown)

**Does not ship — deferred:**

- Authentication, sessions, users → Phase 2
- Tablos CRUD → Phase 3
- Tasks / kanban → Phase 4
- File uploads + R2/S3 → Phase 5
- Real worker jobs → Phase 6
- Production deploy, Dockerfile, `/readyz` → Phase 7

## Deploy

The production host is a Hetzner VM running plain Docker Compose (D-01, D-02). No
Kubernetes or managed orchestration is needed — `docker compose up -d` on the VM is
the entire deployment mechanism. Postgres runs inside the compose stack (D-03); there
is no external managed database.

### Prerequisites

Install on the production VM before first deploy:

- **Docker** ≥ 24 with the **Docker Compose** plugin (`docker compose` — not the
  standalone `docker-compose` binary)
- **git** (optional — useful for pulling the repo directly onto the VM)

No other runtimes are needed. Go, Node, and all build tooling run in the Dockerfile's
multi-stage build and are not required on the VM.

### First-time setup

Run all commands on the VM via SSH unless noted otherwise.

1. **SSH to the VM.**

   ```
   ssh user@<vm-ip>
   ```

2. **Copy the `backend/` directory to the VM** (or clone the repo).

   ```
   # Option A — rsync from local machine:
   rsync -av --exclude '.git' backend/ user@<vm-ip>:~/xtablo/

   # Option B — clone the repo directly on the VM:
   git clone <repo-url> ~/xtablo && cd ~/xtablo/backend
   ```

3. **Create `.env.prod`** by copying `.env.example` and filling in real values.

   ```
   cp .env.example .env.prod
   chmod 600 .env.prod      # restrict read access — file contains secrets (T-07-10)
   ```

   Mandatory variables to set in `.env.prod`:

   | Variable | Value |
   |---|---|
   | `DATABASE_URL` | `postgres://xtablo:<POSTGRES_PASSWORD>@postgres:5432/xtablo?sslmode=disable` (internal compose network — hostname is `postgres`) |
   | `POSTGRES_PASSWORD` | Strong random password (also used by the postgres service). Example: `openssl rand -hex 24` |
   | `POSTGRES_USER` | `xtablo` (or your custom user; must match `DATABASE_URL`) |
   | `POSTGRES_DB` | `xtablo` (or your custom db; must match `DATABASE_URL`) |
   | `SESSION_SECRET` | 32 random bytes hex-encoded. Generate with: `openssl rand -hex 32` |
   | `S3_ENDPOINT` | R2 endpoint URL: `https://<account-id>.r2.cloudflarestorage.com` |
   | `S3_BUCKET` | R2 bucket name |
   | `S3_ACCESS_KEY` | R2 API token key ID |
   | `S3_SECRET_KEY` | R2 API token secret |
   | `S3_USE_PATH_STYLE` | `false` for Cloudflare R2 (virtual-hosted-style URLs) |
   | `S3_REGION` | `auto` or `us-east-1` (R2 accepts both) |
   | `MAX_UPLOAD_SIZE_MB` | `25` (or your preferred limit) |
   | `ENV` | `production` (activates JSON slog handler) |
   | `PORT` | `8080` |
   | `DOMAIN` | `app.yourdomain.com` (Caddy reads this for TLS) |

   Do **not** include `TEST_DATABASE_URL` in `.env.prod` — it is a dev/test-only
   variable and is not used by the runtime binaries.

4. **Build the Docker image** (from inside `backend/` — either locally or on the VM).

   ```
   # From inside backend/
   docker build -f Dockerfile -t ghcr.io/yourusername/xtablo:v0.1.0 .
   ```

   If building locally, push to a registry and pull on the VM:

   ```
   docker push ghcr.io/yourusername/xtablo:v0.1.0
   # On the VM:
   docker pull ghcr.io/yourusername/xtablo:v0.1.0
   ```

5. **Set image coordinates as environment variables** (used by `docker-compose.prod.yaml`).

   ```
   export IMAGE=ghcr.io/yourusername/xtablo
   export TAG=v0.1.0
   ```

6. **Start the stack.**

   ```
   docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
   ```

   The postgres service must pass its healthcheck before web and worker start.
   Migrations run automatically at web startup via `goose.Up()` (D-10).

7. **Verify the deployment.**

   ```
   curl https://app.yourdomain.com/healthz   # → {"status":"ok"}
   curl https://app.yourdomain.com/readyz    # → {"status":"ok","db":"ok"}
   ```

   If the domain is not yet configured, use the VM's public IP temporarily with
   HTTP (Caddy will not yet have a certificate):

   ```
   curl http://<vm-ip>:80/healthz
   ```

8. **Let's Encrypt staging (for initial TLS testing).**

   To avoid hitting Let's Encrypt production rate limits (5 duplicate certificates
   per week per domain) during initial setup, uncomment the staging global block in
   `deploy/Caddyfile`:

   ```
   {
     acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
   }
   ```

   Restart Caddy after editing (`docker compose -f docker-compose.prod.yaml restart caddy`),
   verify TLS works (browsers will show a staging cert warning — that is expected),
   then remove the global block and clear the `caddy_data` volume to issue a real
   production certificate.

### Deploying a new version

1. **Build and tag the new image** (same as first-time, with a new tag):

   ```
   docker build -f Dockerfile -t ghcr.io/yourusername/xtablo:v0.2.0 .
   docker push ghcr.io/yourusername/xtablo:v0.2.0   # if using a registry
   ```

2. **On the VM** — update `TAG` in `.env.prod`:

   ```
   # Edit .env.prod:
   TAG=v0.2.0
   ```

   Or pass it inline without editing the file:

   ```
   export TAG=v0.2.0
   ```

3. **Pull and recreate only the changed services:**

   ```
   docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
   ```

   Compose recreates only the web and worker containers (their image tag changed).
   Postgres and Caddy are unaffected. Migrations run automatically at web startup
   (D-10) — `goose.Up()` is idempotent and skips already-applied migrations.

## Rollback

Rollback means redeploying the previous image tag (D-11). No special tooling is
required — it is the same as deploying a new version, but with an older tag.

1. **On the VM** — set `TAG` to the previous tag in `.env.prod` (or inline):

   ```
   export TAG=v0.1.0
   ```

2. **Redeploy:**

   ```
   docker compose -f docker-compose.prod.yaml --env-file .env.prod up -d
   ```

   Compose recreates web and worker with the old image. The rollback is complete.

### Schema rollback (break-glass)

`goose.Up()` is idempotent — rolling back to a previous binary does not automatically
run `goose down`. In most cases this is fine: the old binary ignores columns it does
not know about.

If a migration introduced a schema change that is **incompatible** with the old binary
(e.g. a NOT NULL column without a default that the old binary does not supply), run a
manual goose down as a break-glass step:

1. Connect to Postgres inside the container:

   ```
   docker exec -it <postgres-container-name> psql -U xtablo -d xtablo
   ```

   (Find the container name with `docker compose -f docker-compose.prod.yaml ps`.)

2. The production image is distroless — the `goose` CLI is not inside the runtime
   container. Install the goose CLI separately on the VM or use the goose Docker
   image against the internal network:

   ```
   # Install goose CLI on the VM:
   go install github.com/pressly/goose/v3/cmd/goose@latest
   goose -dir ./migrations postgres "$DATABASE_URL" down
   ```

   Or use an ephemeral container on the same compose network:

   ```
   docker run --rm --network <compose-network> \
     -e GOOSE_DRIVER=postgres \
     -e GOOSE_DBSTRING="postgres://xtablo:<password>@postgres:5432/xtablo?sslmode=disable" \
     -v $(pwd)/migrations:/migrations \
     ghcr.io/kukymbr/goose-docker:latest \
     goose -dir /migrations down
   ```

   After reverting the migration, the old binary will start cleanly.

## Incident Runbook

### /readyz returns 503

`/readyz` pings Postgres. A 503 means the web container cannot reach the database.

1. Check container status:

   ```
   docker compose -f docker-compose.prod.yaml ps
   ```

2. If `postgres` is down or unhealthy, restart it:

   ```
   docker compose -f docker-compose.prod.yaml up -d postgres
   ```

   Then restart web and worker (they will wait for postgres to be healthy):

   ```
   docker compose -f docker-compose.prod.yaml up -d
   ```

3. Check web logs for the actual error:

   ```
   docker compose -f docker-compose.prod.yaml logs web --tail=50
   ```

   All application logs are JSON when `ENV=production` is set. Look for
   `"level":"ERROR"` lines with a `"msg":"db ping failed"` or similar.

### Caddy TLS certificate errors

1. Check caddy logs:

   ```
   docker compose -f docker-compose.prod.yaml logs caddy --tail=50
   ```

2. If you see "too many certificates already issued for" (Let's Encrypt rate limit,
   RESEARCH Pitfall 4):
   - Caddy hit the 5 duplicate certificates per week limit for the domain.
   - Confirm the `caddy_data` named volume exists and is mounted — if the volume was
     accidentally deleted, Caddy cannot reuse the cached certificate and must
     re-issue on every restart, quickly exhausting the rate limit.
   - Recovery options:
     - Wait up to 1 week for the rate limit window to reset.
     - Switch to the Let's Encrypt staging endpoint temporarily (see
       "Let's Encrypt staging" in the First-time setup section above).
     - Restore from a `caddy_data` volume backup if available.

3. If the `caddy_data` volume was lost:

   ```
   # Verify the volume still exists:
   docker volume ls | grep caddy_data

   # If missing, the volume must be recreated (certificates will be re-issued):
   docker compose -f docker-compose.prod.yaml up -d caddy
   ```

### Checking logs

Follow logs for any service:

```
docker compose -f docker-compose.prod.yaml logs web --tail=100 --follow
docker compose -f docker-compose.prod.yaml logs worker --tail=100 --follow
docker compose -f docker-compose.prod.yaml logs caddy --tail=100 --follow
docker compose -f docker-compose.prod.yaml logs postgres --tail=50
```

All application logs are JSON in production (`ENV=production` activates the slog
JSON handler). Pipe through `jq` for readable output:

```
docker compose -f docker-compose.prod.yaml logs web --follow --no-log-prefix | jq .
```

### Debugging the distroless container

The runtime image (`gcr.io/distroless/static-debian12:nonroot`) has **no shell**
(RESEARCH Pitfall 7). You cannot `docker exec -it <web-container> sh`.

To debug network or filesystem issues, attach an ephemeral busybox container to the
same network:

```
# Find the web container ID:
docker compose -f docker-compose.prod.yaml ps

# Attach busybox to the web container's network namespace:
docker run --rm -it --network container:<web-container-id> busybox sh
```

From the busybox shell you can run `wget`, `nc`, `ping`, etc. to diagnose
connectivity. To inspect the compose network directly (e.g. reach `postgres:5432`):

```
docker run --rm -it \
  --network $(docker inspect <web-container-id> --format '{{range .NetworkSettings.Networks}}{{.NetworkID}}{{end}}') \
  busybox sh
```