xtablo-source/.planning/phases/07-deploy-v1/07-REVIEW.md
2026-05-15 18:29:46 +02:00

16 KiB
Raw Blame History

phase reviewed depth files_reviewed files_reviewed_list findings status
07-deploy-v1 2026-05-15T00:00:00Z standard 14
backend/.env.example
backend/cmd/web/main.go
backend/docker-compose.prod.yaml
backend/embed.go
backend/internal/db/migrate.go
backend/internal/web/csrf_test.go
backend/internal/web/handlers_auth_test.go
backend/internal/web/handlers_files_test.go
backend/internal/web/handlers_tablos_test.go
backend/internal/web/handlers_tasks_test.go
backend/internal/web/handlers_test.go
backend/internal/web/handlers.go
backend/internal/web/router.go
backend/README.md
critical warning info total
3 5 3 11
issues_found

Phase 07: Code Review Report

Reviewed: 2026-05-15T00:00:00Z Depth: standard Files Reviewed: 14 Status: issues_found

Summary

This review covers the Phase 7 (deploy-v1) implementation: the HTTP server entrypoint, production compose stack, embedded assets, database migrations, CSRF middleware, auth/tablo/task/file handlers, and the router. The code is well-structured with clear security intent and good test coverage. Three blockers and five warnings were identified, primarily around S3 credential exposure in .env.example, a missing NetworkMode / network-isolation gap in docker-compose.prod.yaml, a file-upload race condition, and several quality issues in the test suite and startup logic.


Critical Issues

CR-01: .env.example Contains Default S3 Credentials That May Ship to Production

File: backend/.env.example:39-40 Issue: S3_ACCESS_KEY=minioadmin and S3_SECRET_KEY=minioadmin are committed with explicit values — not placeholders like <your-key>. Operators who copy this file for production and miss those lines will start with the public default MinIO credentials against their R2 bucket. The README's first-time-setup table (line 298-299) says to set these but does not repeat that the example file ships with insecure defaults. The corresponding DATABASE_URL and SESSION_SECRET are intentionally left blank or with obvious-placeholder DSNs, making the S3 lines the odd ones out.

Fix:

# Replace concrete default values with angle-bracket placeholders:
S3_ACCESS_KEY=<your-r2-access-key-id>
S3_SECRET_KEY=<your-r2-secret-access-key>

CR-02: File-Upload Race Condition — S3 Orphan After DB Insert Failure

File: backend/internal/web/handlers_files.go:194-210 Issue: When InsertTabloFile fails (line 198), the handler attempts a best-effort S3 cleanup:

cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if delErr := deps.Files.Delete(cleanupCtx, s3Key); delErr != nil { ... }

Two bugs compound here:

  1. defer cancel() is inside the if err != nil branch of the upload loop, but this is a handler function (not a loop). Still, calling defer cancel() after context.WithTimeout is a Go anti-pattern in a hot path — cancel will only run when the surrounding function returns, not right after Delete completes, leaking the timer for up to the full handler lifetime rather than the 10-second window. The correct idiom is defer cancel() immediately after context.WithTimeout, but here cancel should be called explicitly after Delete returns (or use a dedicated cleanup goroutine).
  2. If the process dies between the Upload success and the InsertTabloFile failure (e.g., OOM-kill, SIGKILL), cleanup never runs. This is acknowledged as a Phase 6 concern, but the defer cancel() placement still results in the cancel function never being called before the function returns — which is actually the same call site, so this is benign at runtime. The real correctness issue is that this cleanup is best-effort only with no retry, meaning any transient S3 error during cleanup leaves a permanent orphan with no reconciliation path until Phase 6.

The primary BLOCKER is the defer cancel() placement: it is inside the if err != nil block, so cancel() is called when the function exits, but cleanupCtx is already expired by then (10 seconds is plenty for the synchronous Delete call). In practice this compiles and runs correctly, but represents a code-quality footgun and signals that the author may not have noticed the defer is scoped to function return, not the block.

More critically: the handler returns http.StatusInternalServerError to the user AFTER logging the DB error, but before the defer cancel() fires. This means the context passed to Delete has not been cancelled yet (good), so the cleanup call does execute within the handler's lifetime. However, if Delete itself is slow (>10 seconds), the context expires mid-cleanup, producing a partial-cleanup log error even though the operation was in-flight. This is a real production risk on a flaky S3 connection.

Fix:

// Replace the deferred cancel with explicit cancel after Delete:
cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
if delErr := deps.Files.Delete(cleanupCtx, s3Key); delErr != nil {
    slog.Default().Error("files upload: S3 cleanup after DB failure", "s3_key", s3Key, "err", delErr)
}
cancel() // call immediately after Delete, not via defer
http.Error(w, "internal server error", http.StatusInternalServerError)
return

CR-03: docker-compose.prod.yaml Worker Service Has No depends_on Health Gate for Web

File: backend/docker-compose.prod.yaml:56-64 Issue: The worker service depends only on postgres being healthy. It does not depend on the web service completing startup (which runs migrations at boot via goose.Up). If web and worker both start concurrently and migrations are still in flight, worker may connect to a database that is mid-migration — missing tables, functions, or constraints. The healthcheck on postgres only ensures the Postgres process is ready to accept connections, not that migrations have completed.

worker:
  depends_on:
    postgres:
      condition: service_healthy

Fix: Either make worker depend on web being healthy (requires a /readyz-style healthcheck for the web service) or — simpler — run migrations in a separate init container / step before both services start:

worker:
  depends_on:
    postgres:
      condition: service_healthy
    web:
      condition: service_healthy  # web exposes /readyz once migrations complete

Add a healthcheck to the web service:

web:
  healthcheck:
    test: ["CMD", "wget", "-qO-", "http://localhost:8080/readyz"]
    interval: 10s
    timeout: 5s
    retries: 10
    start_period: 30s

Warnings

WR-01: NewRouter Panics on Missing static/ Sub-Directory — Not a Graceful Startup Failure

File: backend/internal/web/router.go:127-130 Issue: fs.Sub(staticFS, "static") will panic if the embedded FS does not contain a static/ top-level entry. The panic is caught by chi's Recoverer middleware in practice, but only after the server is already running — it fires on the first request, not at startup. This means /healthz and /readyz will also panic if the static FS is malformed, since the panic happens at router construction time (NewRouter returns before the panic is triggered? — No: fs.Sub is called during NewRouter, so the panic happens at server startup in cmd/web/main.go:125). The panic at startup is non-fatal-looking to the operator; it bypasses the srv.Shutdown and pool.Close() cleanup paths.

Fix: Convert the panic to an error return from NewRouter, propagated up to main:

sub, err := fs.Sub(staticFS, "static")
if err != nil {
    return nil, fmt.Errorf("router: failed to sub static FS: %w", err)
}

WR-02: Rate Limiter Allow Does Not Increment Counter on Validation Errors (Wrong-Format Inputs Bypass the Rate Gate)

File: backend/internal/web/handlers_auth.go:195-209 Issue: In LoginPostHandler, validation of email and password format (steps 23) runs BEFORE the rate-limit check (step 5). An attacker sending a syntactically invalid email or a password shorter than 12 characters will never reach the deps.Limiter.Allow(key) gate. This means the rate limiter only fires on valid-looking credentials, not on brute-force spam with malformed payloads. An attacker who wants to enumerate valid emails via timing differences could send malformed payloads at high speed without consuming rate-limit tokens.

The rate-limit check comment says "Rate-limit check BEFORE user lookup and argon2 work" but the actual sequencing allows validation errors to short-circuit before the rate gate.

Fix: Move the rate-limit check to before the validation block, using a key derivable from the raw (pre-validation) email:

// Rate limit first — even on invalid input, to prevent enumeration via timing
ip := clientIP(r)
key := strings.ToLower(strings.TrimSpace(r.PostFormValue("email"))) + ":" + ip
if deps.Limiter != nil && !deps.Limiter.Allow(key) {
    // return 429 immediately
}
// Then validate format

WR-03: migrate.go Opens a Second database/sql Connection Pool That Is Never Tuned

File: backend/internal/db/migrate.go:27-29 Issue: sql.Open("pgx/v5", dsn) creates a new database/sql connection pool with default settings (up to MaxOpenConns = 0 = unlimited). This pool is used only for migrations and is closed via defer sqlDB.Close(). However, the DSN contains the password. If connConfig.ConnString() returns a URL with credentials (which pgx/v5's ConnString() does for password-based auth), the DSN is handled correctly — but the second pool opening is unnecessary work. More importantly, sql.Open with MaxOpenConns = 0 could theoretically open many connections during a long migration. This is low-risk in practice (goose runs sequentially) but is an unnecessary resource-management gap.

Additionally, the return value of pool.Config().ConnConfig.ConnString() is not documented as guaranteed to include the password in all configurations (e.g., with PasswordFunc set). If the pool was initialized with a connection string that uses cert-based auth or a password callback, ConnString() may return an incomplete DSN.

Fix:

sqlDB, err := sql.Open("pgx/v5", dsn)
if err != nil {
    return fmt.Errorf("migrate: open sql.DB: %w", err)
}
sqlDB.SetMaxOpenConns(2) // goose runs sequentially — no need for a large pool
defer sqlDB.Close()

WR-04: TestIndex_UnauthRedirects and TestDemoTime_Fragment Use Outdated NewRouter Signature

File: backend/internal/web/handlers_test.go:95-96, 110-111 Issue: Two tests call NewRouter with 8 positional arguments:

router := NewRouter(stubPinger{}, os.DirFS("./static"), AuthDeps{}, TablosDeps{}, TasksDeps{}, FilesDeps{}, testCSRFKey, "dev")

But router.go declares NewRouter with trustedOrigins ...string as the final variadic parameter. The tests omit trustedOrigins, which means CSRF in test mode does not include "localhost" as a trusted origin — unlike newTestRouter (line 37) and newTestRouterWithLimiter (line 44) which explicitly pass "localhost". This discrepancy means requests in TestIndex_UnauthRedirects that go through CSRF middleware may behave differently from those in newTestRouter. For GET requests this is fine (CSRF only blocks state-changing methods), but the inconsistency is a maintenance footgun: a future POST test using this helper would silently fail with 403 instead of the expected status.

Fix:

router := NewRouter(stubPinger{}, os.DirFS("./static"), AuthDeps{}, TablosDeps{}, TasksDeps{}, FilesDeps{}, testCSRFKey, "dev", "localhost")

WR-05: FileUploadHandler Does Not Validate or Sanitize filename Beyond Trimming Whitespace

File: backend/internal/web/handlers_files.go:178-182 Issue: The filename is stored in the database as-is (after trim) and will be displayed back to users in the file list. While the S3 key uses a UUID and path traversal to the filesystem is not possible, the raw filename is rendered in the template. If the template does not HTML-escape the filename (which templ normally does automatically), this is safe. However, filenames like <script>alert(1)</script>.pdf would be stored verbatim. More concretely, filenames containing path separators (../../etc/passwd) are stored in the database — again, safe since the S3 key ignores it, but the filename is returned to the browser in the file list and could confuse a display that uses it as a download attribute.

Fix:

// Sanitize: strip path components, limit length
filename = filepath.Base(filename) // strip any directory components
if len(filename) > 255 {
    filename = filename[:255]
}
if filename == "" || filename == "." {
    http.Error(w, "bad request: invalid filename", http.StatusBadRequest)
    return
}

Info

IN-01: goose.SetBaseFS(nil) in testdb_test.go Is a Global Mutation Inside a Mutex — Fragile

File: backend/internal/web/testdb_test.go:93 Issue: goose.SetBaseFS(nil) is called inside webGooseMu.Lock() to reset the global goose base FS before applying on-disk migrations. This relies on goroutine exclusion (the mutex) but goose.SetBaseFS is a package-level global. If any goroutine outside this mutex calls goose.SetBaseFS concurrently (e.g., RunMigrations in a parallel test subprocess that somehow shares the process), the state could be corrupted. The mutex name (webGooseMu) is package-local, so it only serializes callers within the web test package.

This is the documented "verbatim copy" approach from the auth package and is probably acceptable, but it is worth noting the global state mutation.


IN-02: handlers_test.go Has TestRouter_CSRFMountedAfterResolveSession in csrf_test.go That Reads router.go Source from Disk

File: backend/internal/web/csrf_test.go:352-371 Issue: TestRouter_CSRFMountedAfterResolveSession calls os.ReadFile("router.go") to assert middleware ordering by scanning source text. This is a "meta-test" that tests the source file rather than the runtime behavior. It is inherently brittle: it breaks if the file is renamed, if the string literals change, or if the logic is refactored without changing the string identifiers. The same ordering property is already implicitly tested by TestCSRF_LoginMissingToken (which would fail if CSRF ran before ResolveSession with the wrong cookie). The source-scanning approach adds maintenance cost without adding meaningful signal.


IN-03: TabloTasksTabHandler in handlers_files.go Does Not Guard for Nil deps.Files

File: backend/internal/web/handlers_files.go:105-127 Issue: TabloTasksTabHandler lives in handlers_files.go and accepts FilesDeps, but unlike every other handler in that file, it does NOT check deps.Files == nil at the top. This is intentional (the tasks tab does not use S3), but it is conceptually confusing: the handler accepts a FilesDeps parameter that it never uses for file storage, and the nil guard pattern established by all sibling handlers is absent. When a reader scans the file they expect to see the nil guard and its absence is surprising.

Additionally, the tasks tab route is registered in the router as GET /tablos/{id}/tasks, but the handler passes task data to templates.TabloDetailPage(user, ..., tasks, nil, "tasks") — passing nil for the file list is fine but is not symmetrical with the files tab which passes fileList for the third list argument. If a future template change adds a nil-dereference on this parameter, the asymmetry will be the root cause.


Reviewed: 2026-05-15T00:00:00Z Reviewer: Claude (gsd-code-reviewer) Depth: standard