xtablo-source/docs/superpowers/specs/2026-04-11-self-hosted-chat-design.md
Arthur Belleville 973d745753
docs: add self-hosted chat design spec (Stream Chat replacement)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 11:40:20 +02:00

9.8 KiB

Self-Hosted Chat: Stream Chat Replacement

Date: 2026-04-11 Status: Approved

Motivation

Replace Stream Chat (stream-chat + stream-chat-react) with a self-hosted solution to achieve:

  • Cost control — eliminate Stream's per-MAU pricing
  • Data ownership — messages stored in our own Postgres
  • Vendor independence — remove dependency on Stream's backend

Architecture

Three layers:

  1. Cloudflare Durable Objects — one DO per tablo channel, managing WebSocket connections and real-time message broadcast. Uses the Hibernatable WebSocket API (idle rooms cost nothing).
  2. Cloudflare Worker — authenticates WebSocket upgrades (validates Supabase JWTs), routes connections to the correct DO, and serves REST endpoints for message history and channel metadata.
  3. Supabase Postgres — stores messages, read state. Channels and membership are implicit (tablos and tablo/org membership).

This is a separate Cloudflare Worker from the main app, with its own DO bindings and environment variables.

Data Flow

Sending a message

User sends WS message
  → DO receives
  → assigns server timestamp + message ID
  → broadcasts to all connected WS clients in the room
  → writes to Postgres (async, non-blocking via Supabase REST API)

Loading history (reconnect / initial load)

Client connects
  → Worker authenticates JWT
  → Worker checks tablo membership
  → routes to DO
  → DO accepts WebSocket
Client requests history
  → Worker queries Postgres
  → returns paginated messages

Data Model

New tables

-- Messages
CREATE TABLE messages (
  id            uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  channel_id    uuid NOT NULL REFERENCES tablos(id) ON DELETE CASCADE,
  user_id       uuid NOT NULL REFERENCES auth.users(id),
  text          text NOT NULL,
  created_at    timestamptz NOT NULL DEFAULT now(),
  updated_at    timestamptz,
  deleted_at    timestamptz
);

CREATE INDEX idx_messages_channel_created ON messages(channel_id, created_at DESC);
-- Read state (last read position per user per channel)
CREATE TABLE channel_read_state (
  user_id       uuid NOT NULL REFERENCES auth.users(id),
  channel_id    uuid NOT NULL REFERENCES tablos(id) ON DELETE CASCADE,
  last_read_at  timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (user_id, channel_id)
);

No new tables needed for

  • Channels — tablos are the channels (tablo ID = channel ID)
  • Membership — existing tablo/org membership tables
  • Users — existing Supabase auth + user tables

Unread count query

SELECT COUNT(*) FROM messages
WHERE channel_id = :channel_id
  AND created_at > (
    SELECT last_read_at FROM channel_read_state
    WHERE user_id = :user_id AND channel_id = :channel_id
  )
  AND deleted_at IS NULL;

Soft deletes

Messages use deleted_at (not hard deletion), consistent with the tablo soft-delete pattern.

Durable Objects — Real-time Layer

One DO per tablo channel, identified by tablo ID.

WebSocket lifecycle

  • Connect: Worker passes authenticated user ID as a tag via acceptWebSocket(ws, [userId]). DO tracks connected users.
  • Message: DO parses incoming message, assigns server timestamp and ID, broadcasts to all other connected sockets via getWebSockets().
  • Close: DO cleans up. If no connections remain, it hibernates (zero cost).

WebSocket message types

Client → Server:

Type Payload Persisted
message.send { text, clientId } Yes
typing.start {} No
typing.stop {} No
presence.ping {} No

Server → Client:

Type Payload
message.new { id, userId, text, createdAt, clientId }
typing { userId, isTyping }
presence.update { userId, status }
error { code, message }

What the DO does NOT do

  • No message history queries (Worker + Postgres)
  • No channel membership management (existing API)
  • No authentication (Worker validates JWT before routing)

Presence

The DO knows who's connected via getWebSockets(). On connect/disconnect, it broadcasts presence.update to the room.

Typing indicators

Purely ephemeral. Client sends typing.start, DO broadcasts to others, never persisted.

Cloudflare Worker — API & Routing

Endpoints

GET  /chat/ws/:channelId                              # WebSocket upgrade → routed to DO
GET  /chat/channels/:channelId/messages?before=<ts>&limit=50  # Paginated history
POST /chat/channels/:channelId/read                    # Mark channel as read
GET  /chat/unread                                      # Unread counts for current user

Authentication

Every request (including WebSocket upgrade) includes the Supabase JWT in the Authorization header. The Worker validates the JWT using Supabase's public JWT secret (standard JWT verification, no Supabase SDK needed).

Membership check

Before routing a WebSocket connection to a DO, the Worker verifies the user is a member of the tablo by querying existing membership data via the Supabase REST API.

Postgres access

Both the Worker (for history/unread queries) and the DO (for message persistence) use Supabase's PostgREST API with the service role key. The DO calls PostgREST directly — it does not route writes through the Worker. No direct Postgres connection needed at this scale.

Frontend

Chat client hook — useChat

const {
  messages,          // Message[] — history + live messages merged
  sendMessage,       // (text: string) => void
  isConnected,       // boolean
  typingUsers,       // string[] — user IDs currently typing
  onlineUsers,       // string[] — user IDs currently connected
  loadMoreMessages,  // () => void — pagination trigger
  hasMoreMessages,   // boolean
  markAsRead,        // () => void
} = useChat(channelId)

Internally:

  1. Opens WebSocket to /chat/ws/:channelId with JWT
  2. Fetches initial message history via REST
  3. Merges incoming WebSocket messages with history (dedup via clientId)
  4. Sends typing events with debounce (start on keypress, stop after 2s idle)
  5. Calls markAsRead when channel is visible/focused

Unread counts — useChatUnread

Polls GET /chat/unread on interval (every 30s) or on window focus. Replaces useTabloDiscussionUnread.

UI components — chatscope

Using @chatscope/chat-ui-kit-react, styled to match existing Radix/Tailwind design system:

chatscope component Replaces
ConversationList + Conversation Stream ChannelList + ChannelPreview
ChatContainer + MessageList + Message Stream Channel + MessageList
MessageInput Stream MessageInput
TypingIndicator New (wasn't explicitly used before)
Avatar + StatusIndicator ChannelBadge online dot

Styling

Override chatscope's default CSS theme to match existing Tailwind/Radix design system (colors, fonts, border radius, spacing).

Migration — What Changes in Existing Code

Removed from apps/api

  • Stream server client initialization in middleware.ts
  • stream-chat dependency
  • Stream token generation in GET /api/v1/users/me (user.ts:64)
  • STREAM_CHAT_API_KEY and STREAM_CHAT_API_SECRET config variables
  • All streamServerClient.channel() calls in tablo.ts
  • All streamServerClient.upsertUser() calls in user.ts, helpers.ts, tablo.ts
  • All channel.addMembers() / channel.removeMembers() calls

Replaced with Postgres operations

  • Tablo creation: No extra chat setup needed. The tablo IS the channel.
  • Add/remove member: Already handled by tablo/org membership. No chat-specific call.
  • Update channel name: Not needed. Channel name = tablo name, read from tablos table.
  • Delete channel: ON DELETE CASCADE on messages.channel_id handles cleanup.
  • Upsert Stream user: Removed entirely. No user sync needed.

Removed from apps/main

  • stream-chat and stream-chat-react dependencies
  • VITE_STREAM_CHAT_API_KEY from all env files
  • ChatProvider.tsx
  • streamToken from user data flow
  • ChannelPreview.tsx, CustomChannelHeader.tsx, ChannelBadge.tsx (replaced by chatscope equivalents)
  • useChannelFromUrl, useTabloDiscussionUnread hooks (replaced by useChat, useChatUnread)

Unchanged

  • Tablo CRUD, membership management, auth — all stay in existing API
  • Chat page routes (/chat, /chat/:channelId) — same URLs, new implementation

Error Handling & Edge Cases

WebSocket disconnection / reconnect

  • useChat implements automatic reconnection with exponential backoff
  • On reconnect, fetches messages with ?after=<lastMessageTimestamp> to fill the gap
  • Messages sent while disconnected are queued locally and sent on reconnect (or shown as "failed to send" after timeout)

Message ordering

  • DO assigns server timestamps, which are authoritative
  • Optimistic messages use clientId for dedup — replaced by server echo on arrival
  • History from Postgres is ordered by created_at DESC

Membership enforcement

  • Worker checks tablo membership before allowing WebSocket connection
  • If a user is removed while connected, they are evicted on next reconnect (acceptable at small scale)

Postgres write failures

  • DO retries up to 3 times
  • Messages already delivered via WebSocket — user experience unaffected
  • Persistent failure is logged as an operational alert

Deploys

  • DO eviction on deploy closes all WebSockets
  • Clients reconnect and fetch the gap from Postgres via standard reconnect flow

Scale Considerations

Designed for small scale (under 100 concurrent users, occasional messages). At this scale:

  • A single DO per room handles all connections easily
  • Supabase PostgREST is sufficient (no connection pooling needed)
  • Polling for unread counts (every 30s) is fine
  • No need for message queues, caches, or read replicas