In the first real load test of chat and calls we ran into a reproducible failure: a message from one user never reached the other. Turns out the WebSocket registry was per-process, and nginx's round-robin happily landed two users on two different replicas of the same web container. Signals went into the process-local registry and disappeared.
The fix is app/realtime/signaling_bus.py: every
publish_to_user(user_id, payload) does local-first
delivery plus a Redis PUBLISH to
tacitusmail:signal:{user_id}. Every web process runs
a background PSUBSCRIBE loop that forwards matching
messages into its own local registry. A per-process
_origin tag prevents double-delivery when the
publisher and subscriber share a replica.
On iOS the same commit rewired the call signalling end-to-end
— CallManager now plugs its signal sender into the
live ChatStore WebSocket instead of a dead
RealtimeSession that was never initialised. The
CallSignal Codable wrapper was the other root cause;
it looked for a nested payload field that the server
never emits. Replaced with raw-dict dispatch via a new
handleIncomingSignalRaw.