docs: add self-hosted chat design spec (Stream Chat replacement)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Arthur Belleville 2026-04-11 11:40:20 +02:00
parent 1268a268c1
commit 973d745753
No known key found for this signature in database

View file

@ -0,0 +1,283 @@
# Self-Hosted Chat: Stream Chat Replacement
**Date**: 2026-04-11
**Status**: Approved
## Motivation
Replace Stream Chat (stream-chat + stream-chat-react) with a self-hosted solution to achieve:
- **Cost control** — eliminate Stream's per-MAU pricing
- **Data ownership** — messages stored in our own Postgres
- **Vendor independence** — remove dependency on Stream's backend
## Architecture
Three layers:
1. **Cloudflare Durable Objects** — one DO per tablo channel, managing WebSocket connections and real-time message broadcast. Uses the Hibernatable WebSocket API (idle rooms cost nothing).
2. **Cloudflare Worker** — authenticates WebSocket upgrades (validates Supabase JWTs), routes connections to the correct DO, and serves REST endpoints for message history and channel metadata.
3. **Supabase Postgres** — stores messages, read state. Channels and membership are implicit (tablos and tablo/org membership).
This is a separate Cloudflare Worker from the main app, with its own DO bindings and environment variables.
## Data Flow
### Sending a message
```
User sends WS message
→ DO receives
→ assigns server timestamp + message ID
→ broadcasts to all connected WS clients in the room
→ writes to Postgres (async, non-blocking via Supabase REST API)
```
### Loading history (reconnect / initial load)
```
Client connects
→ Worker authenticates JWT
→ Worker checks tablo membership
→ routes to DO
→ DO accepts WebSocket
Client requests history
→ Worker queries Postgres
→ returns paginated messages
```
## Data Model
### New tables
```sql
-- Messages
CREATE TABLE messages (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
channel_id uuid NOT NULL REFERENCES tablos(id) ON DELETE CASCADE,
user_id uuid NOT NULL REFERENCES auth.users(id),
text text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz,
deleted_at timestamptz
);
CREATE INDEX idx_messages_channel_created ON messages(channel_id, created_at DESC);
```
```sql
-- Read state (last read position per user per channel)
CREATE TABLE channel_read_state (
user_id uuid NOT NULL REFERENCES auth.users(id),
channel_id uuid NOT NULL REFERENCES tablos(id) ON DELETE CASCADE,
last_read_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (user_id, channel_id)
);
```
### No new tables needed for
- **Channels** — tablos are the channels (tablo ID = channel ID)
- **Membership** — existing tablo/org membership tables
- **Users** — existing Supabase auth + user tables
### Unread count query
```sql
SELECT COUNT(*) FROM messages
WHERE channel_id = :channel_id
AND created_at > (
SELECT last_read_at FROM channel_read_state
WHERE user_id = :user_id AND channel_id = :channel_id
)
AND deleted_at IS NULL;
```
### Soft deletes
Messages use `deleted_at` (not hard deletion), consistent with the tablo soft-delete pattern.
## Durable Objects — Real-time Layer
One DO per tablo channel, identified by tablo ID.
### WebSocket lifecycle
- **Connect**: Worker passes authenticated user ID as a tag via `acceptWebSocket(ws, [userId])`. DO tracks connected users.
- **Message**: DO parses incoming message, assigns server timestamp and ID, broadcasts to all other connected sockets via `getWebSockets()`.
- **Close**: DO cleans up. If no connections remain, it hibernates (zero cost).
### WebSocket message types
**Client → Server:**
| Type | Payload | Persisted |
|------|---------|-----------|
| `message.send` | `{ text, clientId }` | Yes |
| `typing.start` | `{}` | No |
| `typing.stop` | `{}` | No |
| `presence.ping` | `{}` | No |
**Server → Client:**
| Type | Payload |
|------|---------|
| `message.new` | `{ id, userId, text, createdAt, clientId }` |
| `typing` | `{ userId, isTyping }` |
| `presence.update` | `{ userId, status }` |
| `error` | `{ code, message }` |
### What the DO does NOT do
- No message history queries (Worker + Postgres)
- No channel membership management (existing API)
- No authentication (Worker validates JWT before routing)
### Presence
The DO knows who's connected via `getWebSockets()`. On connect/disconnect, it broadcasts `presence.update` to the room.
### Typing indicators
Purely ephemeral. Client sends `typing.start`, DO broadcasts to others, never persisted.
## Cloudflare Worker — API & Routing
### Endpoints
```
GET /chat/ws/:channelId # WebSocket upgrade → routed to DO
GET /chat/channels/:channelId/messages?before=<ts>&limit=50 # Paginated history
POST /chat/channels/:channelId/read # Mark channel as read
GET /chat/unread # Unread counts for current user
```
### Authentication
Every request (including WebSocket upgrade) includes the Supabase JWT in the `Authorization` header. The Worker validates the JWT using Supabase's public JWT secret (standard JWT verification, no Supabase SDK needed).
### Membership check
Before routing a WebSocket connection to a DO, the Worker verifies the user is a member of the tablo by querying existing membership data via the Supabase REST API.
### Postgres access
Both the Worker (for history/unread queries) and the DO (for message persistence) use Supabase's PostgREST API with the service role key. The DO calls PostgREST directly — it does not route writes through the Worker. No direct Postgres connection needed at this scale.
## Frontend
### Chat client hook — `useChat`
```typescript
const {
messages, // Message[] — history + live messages merged
sendMessage, // (text: string) => void
isConnected, // boolean
typingUsers, // string[] — user IDs currently typing
onlineUsers, // string[] — user IDs currently connected
loadMoreMessages, // () => void — pagination trigger
hasMoreMessages, // boolean
markAsRead, // () => void
} = useChat(channelId)
```
Internally:
1. Opens WebSocket to `/chat/ws/:channelId` with JWT
2. Fetches initial message history via REST
3. Merges incoming WebSocket messages with history (dedup via `clientId`)
4. Sends typing events with debounce (start on keypress, stop after 2s idle)
5. Calls `markAsRead` when channel is visible/focused
### Unread counts — `useChatUnread`
Polls `GET /chat/unread` on interval (every 30s) or on window focus. Replaces `useTabloDiscussionUnread`.
### UI components — chatscope
Using `@chatscope/chat-ui-kit-react`, styled to match existing Radix/Tailwind design system:
| chatscope component | Replaces |
|---------------------|----------|
| `ConversationList` + `Conversation` | Stream `ChannelList` + `ChannelPreview` |
| `ChatContainer` + `MessageList` + `Message` | Stream `Channel` + `MessageList` |
| `MessageInput` | Stream `MessageInput` |
| `TypingIndicator` | New (wasn't explicitly used before) |
| `Avatar` + `StatusIndicator` | `ChannelBadge` online dot |
### Styling
Override chatscope's default CSS theme to match existing Tailwind/Radix design system (colors, fonts, border radius, spacing).
## Migration — What Changes in Existing Code
### Removed from `apps/api`
- Stream server client initialization in `middleware.ts`
- `stream-chat` dependency
- Stream token generation in `GET /api/v1/users/me` (`user.ts:64`)
- `STREAM_CHAT_API_KEY` and `STREAM_CHAT_API_SECRET` config variables
- All `streamServerClient.channel()` calls in `tablo.ts`
- All `streamServerClient.upsertUser()` calls in `user.ts`, `helpers.ts`, `tablo.ts`
- All `channel.addMembers()` / `channel.removeMembers()` calls
### Replaced with Postgres operations
- **Tablo creation**: No extra chat setup needed. The tablo IS the channel.
- **Add/remove member**: Already handled by tablo/org membership. No chat-specific call.
- **Update channel name**: Not needed. Channel name = tablo name, read from `tablos` table.
- **Delete channel**: `ON DELETE CASCADE` on `messages.channel_id` handles cleanup.
- **Upsert Stream user**: Removed entirely. No user sync needed.
### Removed from `apps/main`
- `stream-chat` and `stream-chat-react` dependencies
- `VITE_STREAM_CHAT_API_KEY` from all env files
- `ChatProvider.tsx`
- `streamToken` from user data flow
- `ChannelPreview.tsx`, `CustomChannelHeader.tsx`, `ChannelBadge.tsx` (replaced by chatscope equivalents)
- `useChannelFromUrl`, `useTabloDiscussionUnread` hooks (replaced by `useChat`, `useChatUnread`)
### Unchanged
- Tablo CRUD, membership management, auth — all stay in existing API
- Chat page routes (`/chat`, `/chat/:channelId`) — same URLs, new implementation
## Error Handling & Edge Cases
### WebSocket disconnection / reconnect
- `useChat` implements automatic reconnection with exponential backoff
- On reconnect, fetches messages with `?after=<lastMessageTimestamp>` to fill the gap
- Messages sent while disconnected are queued locally and sent on reconnect (or shown as "failed to send" after timeout)
### Message ordering
- DO assigns server timestamps, which are authoritative
- Optimistic messages use `clientId` for dedup — replaced by server echo on arrival
- History from Postgres is ordered by `created_at DESC`
### Membership enforcement
- Worker checks tablo membership before allowing WebSocket connection
- If a user is removed while connected, they are evicted on next reconnect (acceptable at small scale)
### Postgres write failures
- DO retries up to 3 times
- Messages already delivered via WebSocket — user experience unaffected
- Persistent failure is logged as an operational alert
### Deploys
- DO eviction on deploy closes all WebSockets
- Clients reconnect and fetch the gap from Postgres via standard reconnect flow
## Scale Considerations
Designed for small scale (under 100 concurrent users, occasional messages). At this scale:
- A single DO per room handles all connections easily
- Supabase PostgREST is sufficient (no connection pooling needed)
- Polling for unread counts (every 30s) is fine
- No need for message queues, caches, or read replicas