1. Requirements

whatsapp-req.excalidraw

WIP on below.. 👷🏻‍♂️

whatsapp-hdl.excalidraw


2. Core Entities

User
 └── userId
 
Client(Device)
 ├── clientId
 └── userId
 
Chat
 ├── chatId
 └── metadata
 
ChatParticipant
 ├── chatId
 └── userId
 
Message
 ├── messageId
 ├── chatId
 ├── senderId
 ├── content
 └── timestamp
 
Inbox
 ├── recipientId/clientId
 └── messageId

Why Client Entity?

User
 ├── Mobile
 ├── Laptop
 └── Tablet

A user may have multiple devices.


3. API Design

Use:

WebSockets

Reason:

Bidirectional
Low Latency
Persistent Connection

Client → Server Commands

Create Chat

{
  "participants": [],
  "name": ""
}

Response

{
  "chatId": ""
}

Send Message

{
  "chatId": "",
  "message": "",
  "attachments": []
}

Response

{
  "status": "SUCCESS",
  "messageId": ""
}

Upload Attachment

{
  "body": "...",
  "hash": "..."
}

Modify Participants

{
  "chatId": "",
  "userId": "",
  "operation": "ADD | REMOVE"
}

ACK Message

{
  "messageId": ""
}

Used to confirm delivery.


Server → Client Commands

New Message

{
  "chatId": "",
  "senderId": "",
  "message": ""
}

Chat Updated

{
  "chatId": "",
  "participants": []
}

4. High Level Design (HLD)

Step 1 — Create Chat

Components

Client

WebSocket

Chat Service

DynamoDB

Chat Table

PK = chatId
chatId
name
metadata

ChatParticipant Table

PK = chatId
SK = participantId

Query:

Get participants of chat

GSI

PK = participantId
SK = chatId

Query:

Get all chats for user

Step 2 — Send / Receive Messages

Single Server Version

Client

WebSocket

Chat Server

In-Memory Connection Map

unordered_map<
    userId,
    websocketConnection
>

Flow

Send Message

Find Participants

Find WebSocket

Push Message

Works only when everyone is online.


Step 3 — Offline Messages

Need persistence.


Message Table

messageId
chatId
senderId
content
timestamp

Inbox Table

recipientId
messageId

Purpose:

Track undelivered messages

Send Flow

Sender

Send Message

Chat Service

Write Message

Create Inbox Entry

Deliver Message

ACK Flow

Client Receives Message

        ACK

Delete Inbox Entry

Reconnect Flow

Client Connects

Read Inbox

Read Messages

Deliver

ACK

Step 4 — Media Attachments

Bad

Client

Video

Database

Better

Client

Video

Chat Server

S3

Best

Client

Request Upload URL

Chat Server

Pre-Signed URL

S3

Flow

1. Get URL
2. Upload directly to S3
3. Receive URL
4. Send URL inside message

Final HLD

                   L4 Load Balancer

        ┌──────────────────┼──────────────────┐
        │                  │                  │
   Chat Server 1     Chat Server 2      Chat Server 3
        │                  │                  │
        └──────────────┬───┴───┬──────────────┘

                  Redis PubSub

                 DynamoDB

      ┌────────────────┴──────────────┐
      │                               │
 Chat Tables                 Message Tables
      │                               │
      └──────────────┬────────────────┘

                    S3

5. Deep Dives

Deep Dive 1 — Scaling Chat Servers

Problem

User A → Server 1
 
User B → Server 2

Server 1 cannot directly access B’s websocket.


Solution A — Consistent Hashing

hash(userId)


Chat Server

Pros

Predictable Routing

Cons

Complex Rebalancing

Solution B — Redis Pub/Sub (Preferred)

Subscription

user123
user456
user789

Each server subscribes to connected users.


Flow

Server 1

Publish(userB)

Redis

Server 2

WebSocket

User B

Why Not Kafka?

Need:

Topic per User

Not feasible for billions of users.

Redis channels are lightweight.


Deep Dive 2 — Redis Reliability

Redis Pub/Sub provides:

At Most Once Delivery

Message may be lost.


Why Still Safe?

Redis = Fast Path
 
Inbox Table = Reliable Path

Message already exists in DB.


Deep Dive 3 — WebSocket Failure

Bad

Rely on TCP Timeout

May take minutes.


Better

ACK Timeout

Message Sent

No ACK

Retry

Best

Heartbeats

PING
PONG

every few seconds.


Deep Dive 4 — Lost Redis Messages

Solution 1

Polling

Check Inbox Every N Seconds

Solution 2

Sequence Numbers

101
102
104

Missing:

103

Fetch from DB.


Best

Heartbeat
    +
Sequence Numbers

Deep Dive 5 — Multi Device Support

Client Table

clientId
userId

Inbox Change

Before

recipientId

After

recipientClientId

Delivery

User
 ├── Mobile
 ├── Laptop
 └── Tablet

Send to every device.


Deep Dive 6 — Message Ordering

Distributed systems cannot guarantee perfect ordering.


Solution

All servers sync via:

NTP

On Ingestion

Server Timestamp

added.


Client

ORDER BY timestamp

Users may occasionally see messages reorder.

Acceptable tradeoff.


Deep Dive 7 — Presence / Last Seen

Presence Table

userId
status
lastSeen

Connected

ONLINE

Disconnected

lastSeen = disconnect time

Real-Time Updates

Reuse:

Redis Pub/Sub

for online/offline notifications.


Key Interview Takeaways

WebSockets

Chat + Participant Tables

Message + Inbox Tables

ACK Mechanism

S3 + Pre-Signed URLs

Multiple Chat Servers

Redis Pub/Sub

Heartbeats

Sequence Numbers

Multi Device Support

Presence / Last Seen