diff --git a/docs/superpowers/specs/2026-04-11-self-hosted-chat-design.md b/docs/superpowers/specs/2026-04-11-self-hosted-chat-design.md new file mode 100644 index 0000000..78f3b68 --- /dev/null +++ b/docs/superpowers/specs/2026-04-11-self-hosted-chat-design.md @@ -0,0 +1,283 @@ +# Self-Hosted Chat: Stream Chat Replacement + +**Date**: 2026-04-11 +**Status**: Approved + +## Motivation + +Replace Stream Chat (stream-chat + stream-chat-react) with a self-hosted solution to achieve: + +- **Cost control** — eliminate Stream's per-MAU pricing +- **Data ownership** — messages stored in our own Postgres +- **Vendor independence** — remove dependency on Stream's backend + +## Architecture + +Three layers: + +1. **Cloudflare Durable Objects** — one DO per tablo channel, managing WebSocket connections and real-time message broadcast. Uses the Hibernatable WebSocket API (idle rooms cost nothing). +2. **Cloudflare Worker** — authenticates WebSocket upgrades (validates Supabase JWTs), routes connections to the correct DO, and serves REST endpoints for message history and channel metadata. +3. **Supabase Postgres** — stores messages, read state. Channels and membership are implicit (tablos and tablo/org membership). + +This is a separate Cloudflare Worker from the main app, with its own DO bindings and environment variables. + +## Data Flow + +### Sending a message + +``` +User sends WS message + → DO receives + → assigns server timestamp + message ID + → broadcasts to all connected WS clients in the room + → writes to Postgres (async, non-blocking via Supabase REST API) +``` + +### Loading history (reconnect / initial load) + +``` +Client connects + → Worker authenticates JWT + → Worker checks tablo membership + → routes to DO + → DO accepts WebSocket +Client requests history + → Worker queries Postgres + → returns paginated messages +``` + +## Data Model + +### New tables + +```sql +-- Messages +CREATE TABLE messages ( + id uuid PRIMARY KEY DEFAULT gen_random_uuid(), + channel_id uuid NOT NULL REFERENCES tablos(id) ON DELETE CASCADE, + user_id uuid NOT NULL REFERENCES auth.users(id), + text text NOT NULL, + created_at timestamptz NOT NULL DEFAULT now(), + updated_at timestamptz, + deleted_at timestamptz +); + +CREATE INDEX idx_messages_channel_created ON messages(channel_id, created_at DESC); +``` + +```sql +-- Read state (last read position per user per channel) +CREATE TABLE channel_read_state ( + user_id uuid NOT NULL REFERENCES auth.users(id), + channel_id uuid NOT NULL REFERENCES tablos(id) ON DELETE CASCADE, + last_read_at timestamptz NOT NULL DEFAULT now(), + PRIMARY KEY (user_id, channel_id) +); +``` + +### No new tables needed for + +- **Channels** — tablos are the channels (tablo ID = channel ID) +- **Membership** — existing tablo/org membership tables +- **Users** — existing Supabase auth + user tables + +### Unread count query + +```sql +SELECT COUNT(*) FROM messages +WHERE channel_id = :channel_id + AND created_at > ( + SELECT last_read_at FROM channel_read_state + WHERE user_id = :user_id AND channel_id = :channel_id + ) + AND deleted_at IS NULL; +``` + +### Soft deletes + +Messages use `deleted_at` (not hard deletion), consistent with the tablo soft-delete pattern. + +## Durable Objects — Real-time Layer + +One DO per tablo channel, identified by tablo ID. + +### WebSocket lifecycle + +- **Connect**: Worker passes authenticated user ID as a tag via `acceptWebSocket(ws, [userId])`. DO tracks connected users. +- **Message**: DO parses incoming message, assigns server timestamp and ID, broadcasts to all other connected sockets via `getWebSockets()`. +- **Close**: DO cleans up. If no connections remain, it hibernates (zero cost). + +### WebSocket message types + +**Client → Server:** + +| Type | Payload | Persisted | +|------|---------|-----------| +| `message.send` | `{ text, clientId }` | Yes | +| `typing.start` | `{}` | No | +| `typing.stop` | `{}` | No | +| `presence.ping` | `{}` | No | + +**Server → Client:** + +| Type | Payload | +|------|---------| +| `message.new` | `{ id, userId, text, createdAt, clientId }` | +| `typing` | `{ userId, isTyping }` | +| `presence.update` | `{ userId, status }` | +| `error` | `{ code, message }` | + +### What the DO does NOT do + +- No message history queries (Worker + Postgres) +- No channel membership management (existing API) +- No authentication (Worker validates JWT before routing) + +### Presence + +The DO knows who's connected via `getWebSockets()`. On connect/disconnect, it broadcasts `presence.update` to the room. + +### Typing indicators + +Purely ephemeral. Client sends `typing.start`, DO broadcasts to others, never persisted. + +## Cloudflare Worker — API & Routing + +### Endpoints + +``` +GET /chat/ws/:channelId # WebSocket upgrade → routed to DO +GET /chat/channels/:channelId/messages?before=&limit=50 # Paginated history +POST /chat/channels/:channelId/read # Mark channel as read +GET /chat/unread # Unread counts for current user +``` + +### Authentication + +Every request (including WebSocket upgrade) includes the Supabase JWT in the `Authorization` header. The Worker validates the JWT using Supabase's public JWT secret (standard JWT verification, no Supabase SDK needed). + +### Membership check + +Before routing a WebSocket connection to a DO, the Worker verifies the user is a member of the tablo by querying existing membership data via the Supabase REST API. + +### Postgres access + +Both the Worker (for history/unread queries) and the DO (for message persistence) use Supabase's PostgREST API with the service role key. The DO calls PostgREST directly — it does not route writes through the Worker. No direct Postgres connection needed at this scale. + +## Frontend + +### Chat client hook — `useChat` + +```typescript +const { + messages, // Message[] — history + live messages merged + sendMessage, // (text: string) => void + isConnected, // boolean + typingUsers, // string[] — user IDs currently typing + onlineUsers, // string[] — user IDs currently connected + loadMoreMessages, // () => void — pagination trigger + hasMoreMessages, // boolean + markAsRead, // () => void +} = useChat(channelId) +``` + +Internally: + +1. Opens WebSocket to `/chat/ws/:channelId` with JWT +2. Fetches initial message history via REST +3. Merges incoming WebSocket messages with history (dedup via `clientId`) +4. Sends typing events with debounce (start on keypress, stop after 2s idle) +5. Calls `markAsRead` when channel is visible/focused + +### Unread counts — `useChatUnread` + +Polls `GET /chat/unread` on interval (every 30s) or on window focus. Replaces `useTabloDiscussionUnread`. + +### UI components — chatscope + +Using `@chatscope/chat-ui-kit-react`, styled to match existing Radix/Tailwind design system: + +| chatscope component | Replaces | +|---------------------|----------| +| `ConversationList` + `Conversation` | Stream `ChannelList` + `ChannelPreview` | +| `ChatContainer` + `MessageList` + `Message` | Stream `Channel` + `MessageList` | +| `MessageInput` | Stream `MessageInput` | +| `TypingIndicator` | New (wasn't explicitly used before) | +| `Avatar` + `StatusIndicator` | `ChannelBadge` online dot | + +### Styling + +Override chatscope's default CSS theme to match existing Tailwind/Radix design system (colors, fonts, border radius, spacing). + +## Migration — What Changes in Existing Code + +### Removed from `apps/api` + +- Stream server client initialization in `middleware.ts` +- `stream-chat` dependency +- Stream token generation in `GET /api/v1/users/me` (`user.ts:64`) +- `STREAM_CHAT_API_KEY` and `STREAM_CHAT_API_SECRET` config variables +- All `streamServerClient.channel()` calls in `tablo.ts` +- All `streamServerClient.upsertUser()` calls in `user.ts`, `helpers.ts`, `tablo.ts` +- All `channel.addMembers()` / `channel.removeMembers()` calls + +### Replaced with Postgres operations + +- **Tablo creation**: No extra chat setup needed. The tablo IS the channel. +- **Add/remove member**: Already handled by tablo/org membership. No chat-specific call. +- **Update channel name**: Not needed. Channel name = tablo name, read from `tablos` table. +- **Delete channel**: `ON DELETE CASCADE` on `messages.channel_id` handles cleanup. +- **Upsert Stream user**: Removed entirely. No user sync needed. + +### Removed from `apps/main` + +- `stream-chat` and `stream-chat-react` dependencies +- `VITE_STREAM_CHAT_API_KEY` from all env files +- `ChatProvider.tsx` +- `streamToken` from user data flow +- `ChannelPreview.tsx`, `CustomChannelHeader.tsx`, `ChannelBadge.tsx` (replaced by chatscope equivalents) +- `useChannelFromUrl`, `useTabloDiscussionUnread` hooks (replaced by `useChat`, `useChatUnread`) + +### Unchanged + +- Tablo CRUD, membership management, auth — all stay in existing API +- Chat page routes (`/chat`, `/chat/:channelId`) — same URLs, new implementation + +## Error Handling & Edge Cases + +### WebSocket disconnection / reconnect + +- `useChat` implements automatic reconnection with exponential backoff +- On reconnect, fetches messages with `?after=` to fill the gap +- Messages sent while disconnected are queued locally and sent on reconnect (or shown as "failed to send" after timeout) + +### Message ordering + +- DO assigns server timestamps, which are authoritative +- Optimistic messages use `clientId` for dedup — replaced by server echo on arrival +- History from Postgres is ordered by `created_at DESC` + +### Membership enforcement + +- Worker checks tablo membership before allowing WebSocket connection +- If a user is removed while connected, they are evicted on next reconnect (acceptable at small scale) + +### Postgres write failures + +- DO retries up to 3 times +- Messages already delivered via WebSocket — user experience unaffected +- Persistent failure is logged as an operational alert + +### Deploys + +- DO eviction on deploy closes all WebSockets +- Clients reconnect and fetch the gap from Postgres via standard reconnect flow + +## Scale Considerations + +Designed for small scale (under 100 concurrent users, occasional messages). At this scale: + +- A single DO per room handles all connections easily +- Supabase PostgREST is sufficient (no connection pooling needed) +- Polling for unread counts (every 30s) is fine +- No need for message queues, caches, or read replicas