1. Requirements

whatsapp-req.excalidraw

WIP on below.. πŸ‘·πŸ»β€β™‚οΈ

whatsapp-hdl.excalidraw


2. Core Entities

User
 └── userId
 
Client(Device)
 β”œβ”€β”€ clientId
 └── userId
 
Chat
 β”œβ”€β”€ chatId
 └── metadata
 
ChatParticipant
 β”œβ”€β”€ chatId
 └── userId
 
Message
 β”œβ”€β”€ messageId
 β”œβ”€β”€ chatId
 β”œβ”€β”€ senderId
 β”œβ”€β”€ content
 └── timestamp
 
Inbox
 β”œβ”€β”€ recipientId/clientId
 └── messageId

Why Client Entity?

User
 β”œβ”€β”€ Mobile
 β”œβ”€β”€ Laptop
 └── Tablet

A user may have multiple devices.


3. API Design

Use:

WebSockets

Reason:

Bidirectional
Low Latency
Persistent Connection

Client β†’ Server Commands

Create Chat

{
  "participants": [],
  "name": ""
}

Response

{
  "chatId": ""
}

Send Message

{
  "chatId": "",
  "message": "",
  "attachments": []
}

Response

{
  "status": "SUCCESS",
  "messageId": ""
}

Upload Attachment

{
  "body": "...",
  "hash": "..."
}

Modify Participants

{
  "chatId": "",
  "userId": "",
  "operation": "ADD | REMOVE"
}

ACK Message

{
  "messageId": ""
}

Used to confirm delivery.


Server β†’ Client Commands

New Message

{
  "chatId": "",
  "senderId": "",
  "message": ""
}

Chat Updated

{
  "chatId": "",
  "participants": []
}

4. High Level Design (HLD)

Step 1 β€” Create Chat

Components

Client
   β”‚
WebSocket
   β”‚
Chat Service
   β”‚
DynamoDB

Chat Table

PK = chatId
chatId
name
metadata

ChatParticipant Table

PK = chatId
SK = participantId

Query:

Get participants of chat

GSI

PK = participantId
SK = chatId

Query:

Get all chats for user

Step 2 β€” Send / Receive Messages

Single Server Version

Client
   β”‚
WebSocket
   β”‚
Chat Server

In-Memory Connection Map

unordered_map<
    userId,
    websocketConnection
>

Flow

Send Message
      β”‚
Find Participants
      β”‚
Find WebSocket
      β”‚
Push Message

Works only when everyone is online.


Step 3 β€” Offline Messages

Need persistence.


Message Table

messageId
chatId
senderId
content
timestamp

Inbox Table

recipientId
messageId

Purpose:

Track undelivered messages

Send Flow

Sender
   β”‚
Send Message
   β”‚
Chat Service
   β”‚
Write Message
   β”‚
Create Inbox Entry
   β”‚
Deliver Message

ACK Flow

Client Receives Message
        β”‚
        ACK
        β”‚
Delete Inbox Entry

Reconnect Flow

Client Connects
      β”‚
Read Inbox
      β”‚
Read Messages
      β”‚
Deliver
      β”‚
ACK

Step 4 β€” Media Attachments

Bad

Client
  β”‚
Video
  β”‚
Database

Better

Client
  β”‚
Video
  β”‚
Chat Server
  β”‚
S3

Best

Client
    β”‚
Request Upload URL
    β”‚
Chat Server
    β”‚
Pre-Signed URL
    β”‚
S3

Flow

1. Get URL
2. Upload directly to S3
3. Receive URL
4. Send URL inside message

5. Deep Dives

Deep Dive 1 β€” Scaling Chat Servers

Load balancer

β€œThe important requirement is maintaining long-lived TCP connections for WebSockets. A Layer-4 load balancer forwards the TCP connection to a chat server and keeps that connection pinned to the same server for its lifetime. We also don’t need any Layer-7 features like path-based routing or HTTP inspection, so an NLB is simpler and has lower overhead. Although modern Layer-7 load balancers support WebSockets, a Layer-4 load balancer is sufficient and generally a better fit for this architecture.”

Problem

User A β†’ Server 1
 
User B β†’ Server 2

Server 1 cannot directly access B’s websocket.


Solution A β€” Consistent Hashing

hash(userId)
      β”‚
      β–Ό
Chat Server

Pros

Predictable Routing

Cons

Complex Rebalancing

Solution B β€” Redis Pub/Sub (Preferred)

Subscription

user123
user456
user789

Each server subscribes to connected users.

Flow

Server 1
   β”‚
Publish(userB)
   β”‚
Redis
   β”‚
Server 2
   β”‚
WebSocket
   β”‚
User B

Why Not Kafka?

Need:

Topic per User

Not feasible for billions of users/topic Redis channels are lightweight.


Deep Dive 2 β€” Redis Reliability

Redis Pub/Sub provides:

At Most Once Delivery

Message may be lost.


Why Still Safe?

Redis = Fast Path
 
Inbox Table = Reliable Path

Message already exists in DB.


Deep Dive 3 β€” WebSocket Failure

Bad

Rely on TCP Timeout

May take minutes.


Better

ACK Timeout

Message Sent
      β”‚
No ACK
      β”‚
Retry

Best

Heartbeats

PING
PONG

every few seconds.


Deep Dive 4 β€” Lost Redis Messages

Solution 1

Polling

Check Inbox Every N Seconds

Solution 2

Sequence Numbers

101
102
104

Missing:

103

Fetch from DB.


Best

Heartbeat
    +
Sequence Numbers

Deep Dive 5 β€” Multi Device Support

Client Table

clientId
userId

Inbox Change

Before

recipientId

After

recipientClientId

Delivery

User
 β”œβ”€β”€ Mobile
 β”œβ”€β”€ Laptop
 └── Tablet

Send to every device.


Deep Dive 6 β€” Message Ordering

Distributed systems cannot guarantee perfect ordering.


Solution

All servers sync via:

NTP

On Ingestion

Server Timestamp

added.


Client

ORDER BY timestamp

Users may occasionally see messages reorder.

Acceptable tradeoff.


Deep Dive 7 β€” Presence / Last Seen

Presence Table

userId
status
lastSeen

Connected

ONLINE

Disconnected

lastSeen = disconnect time

Real-Time Updates

Reuse:

Redis Pub/Sub

for online/offline notifications.


Key Interview Takeaways

WebSockets
      ↓
Chat + Participant Tables
      ↓
Message + Inbox Tables
      ↓
ACK Mechanism
      ↓
S3 + Pre-Signed URLs
      ↓
Multiple Chat Servers
      ↓
Redis Pub/Sub
      ↓
Heartbeats
      ↓
Sequence Numbers
      ↓
Multi Device Support
      ↓
Presence / Last Seen