HLD: Design a Notification System

Design a multi-channel notification system โ€” push, email, SMS โ€” with priority queues, fan-out, delivery guarantees, and user preferences.

must hard โฑ 28 min hldnotificationspushemailsmsfan-outqueuesidempotency
Mastery:
Why interviewers ask this
Notifications test multi-channel architecture, delivery guarantees, idempotency, rate limiting, and handling unreliable third-party services.

Step 1: Requirements

Functional:

  • Channels: push notification (iOS + Android), email, SMS
  • Triggered by events: order shipped, friend request, sale, security alert
  • User preferences: opt-out per channel, per notification type
  • Priority: critical (security alerts โ†’ always send), high, low (marketing โ†’ can delay)
  • Scheduled notifications (send at 9am in userโ€™s timezone)
  • Delivery tracking: sent, delivered, opened

Non-functional:

  • 10M notifications/day โ†’ ~120/sec average, 1000/sec peak
  • Critical notifications within 5 seconds
  • At-least-once delivery (with deduplication)
  • Third-party providers may fail โ†’ retry with backoff

Step 2: High-level architecture

Event Sources (Order Service, Auth Service, Marketing Service)
  โ”‚
  โ–ผ
[ Notification Service (API) ]
  โ”œโ”€ Validate request
  โ”œโ”€ Check user preferences (opt-out, do-not-disturb hours)
  โ”œโ”€ Enqueue to priority queue
  โ””โ”€ Return accepted (202)

[ Priority Queues (Kafka) ]
  โ”œโ”€ "notifications.critical"  โ† security, auth
  โ”œโ”€ "notifications.transactional"  โ† order updates, delivery
  โ””โ”€ "notifications.marketing"  โ† promotions, newsletters

[ Channel Workers ]
  โ”œโ”€ Push Worker โ†’ APNs / FCM
  โ”œโ”€ Email Worker โ†’ SES / SendGrid
  โ””โ”€ SMS Worker โ†’ Twilio / SNS

[ Delivery Log DB ]
  โ””โ”€ Records every send attempt + status

Step 3: Data model

-- Notification templates
CREATE TABLE notification_templates (
  id            UUID PRIMARY KEY,
  type          VARCHAR(50),    -- 'order_shipped', 'friend_request', etc.
  channel       VARCHAR(10),    -- 'push', 'email', 'sms'
  subject       TEXT,           -- email subject / push title
  body_template TEXT,           -- Handlebars template: "Hi {{name}}, your order..."
  priority      VARCHAR(10)     -- 'critical', 'high', 'low'
);

-- User notification preferences
CREATE TABLE notification_preferences (
  user_id           UUID,
  notification_type VARCHAR(50),
  channel           VARCHAR(10),
  enabled           BOOLEAN DEFAULT TRUE,
  PRIMARY KEY (user_id, notification_type, channel)
);

-- User device tokens (for push)
CREATE TABLE device_tokens (
  user_id       UUID,
  token         TEXT,
  platform      VARCHAR(10),  -- 'ios', 'android'
  app_version   TEXT,
  last_active   TIMESTAMPTZ,
  PRIMARY KEY (user_id, token)
);

-- Delivery log (append-only, partitioned by date)
CREATE TABLE notification_log (
  id              UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  idempotency_key VARCHAR(128) UNIQUE,  -- prevents duplicate sends
  user_id         UUID,
  channel         VARCHAR(10),
  template_id     UUID,
  payload         JSONB,
  status          VARCHAR(20),  -- 'queued', 'sent', 'failed', 'delivered', 'opened'
  provider_ref    TEXT,         -- APNs message ID, SES message ID
  sent_at         TIMESTAMPTZ,
  created_at      TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY RANGE (created_at);

Step 4: Notification service (API layer)

class NotificationService:
    def send(self, request: NotificationRequest) -> str:
        # 1. Idempotency check
        if self.log.exists(request.idempotency_key):
            return "already_queued"
        
        # 2. Load user preferences
        prefs = self.prefs.get(request.user_id, request.type)
        if not prefs.is_channel_enabled(request.channel):
            return "user_opted_out"
        
        # 3. Do-not-disturb check (for non-critical)
        if request.priority != 'critical':
            user_tz = self.users.get_timezone(request.user_id)
            if is_do_not_disturb(user_tz):
                return self._schedule_for_morning(request)
        
        # 4. Render template
        template = self.templates.get(request.type, request.channel)
        rendered = template.render(request.data)
        
        # 5. Enqueue
        topic = f"notifications.{request.priority}"
        self.kafka.produce(topic, {
            "idempotency_key": request.idempotency_key,
            "user_id": request.user_id,
            "channel": request.channel,
            "rendered": rendered,
        })
        
        # 6. Log as queued
        self.log.insert(request.idempotency_key, status='queued')
        return "queued"

Step 5: Channel workers

Push notification worker (APNs / FCM)

class PushWorker:
    def process(self, message: dict):
        user_id = message["user_id"]
        
        # Get all active device tokens for user
        tokens = self.db.get_device_tokens(user_id)
        
        for token in tokens:
            try:
                if token.platform == 'ios':
                    response = self.apns.send(
                        device_token=token.token,
                        title=message["rendered"]["title"],
                        body=message["rendered"]["body"],
                        badge=1,
                        priority=10 if message["priority"] == 'critical' else 5,
                    )
                else:
                    response = self.fcm.send(
                        token=token.token,
                        notification={
                            "title": message["rendered"]["title"],
                            "body": message["rendered"]["body"],
                        },
                        android={"priority": "high"},
                    )
                
                self.log.update(message["idempotency_key"], status='sent', provider_ref=response.id)
                
            except TokenExpiredError:
                # Token is invalid โ€” remove it
                self.db.delete_device_token(user_id, token.token)
                
            except ProviderRateLimitError as e:
                # Re-queue with exponential backoff
                self.kafka.produce_delayed(message, delay_seconds=e.retry_after)
                
            except Exception:
                self.log.update(message["idempotency_key"], status='failed')
                raise  # Let Kafka consumer retry

Email worker (SES / SendGrid)

class EmailWorker:
    def process(self, message: dict):
        user = self.db.get_user(message["user_id"])
        
        # SES with unsubscribe header (required by CAN-SPAM / GDPR)
        self.ses.send_email(
            to=user.email,
            subject=message["rendered"]["subject"],
            html_body=message["rendered"]["html"],
            text_body=message["rendered"]["text"],  # plain text fallback
            headers={
                "List-Unsubscribe": f"<https://app.example.com/unsubscribe/{user.id}>",
                "List-Unsubscribe-Post": "List-Unsubscribe=One-Click",
            }
        )

Step 6: At-least-once delivery + idempotency

Problem: the worker may crash after sending the notification but before acking the Kafka message. On restart, it will re-process and send the notification again (duplicate).

Solution: idempotency key + deduplication table:

-- Before sending, check if already sent
SELECT 1 FROM notification_log 
WHERE idempotency_key = $1 AND status = 'sent';

-- If not, send and update status atomically
-- The UNIQUE constraint on idempotency_key prevents double-inserts

Idempotency key format: {event_type}:{user_id}:{event_id} โ€” same event always produces the same key.

On third-party provider duplicate: APNs and FCM both support idempotency keys (APNs: apns-collapse-id, FCM: collapseKey) โ€” duplicate messages to the same device are collapsed.


Step 7: Fan-out for broadcast notifications

Marketing campaign: send to 10M users simultaneously.

Naรฏve approach: loop over all user IDs and call /send for each โ†’ slow.

Scalable approach:

Marketing API
  โ””โ”€ Creates one "broadcast" job: { segment: 'all_us_users', template: '...', send_at: '...' }

Broadcast Scheduler (cron job at send_at)
  โ””โ”€ Queries user segments in batches of 10K
  โ””โ”€ Produces 10K user IDs to Kafka per batch
  โ””โ”€ Workers consume and enqueue per-user notifications

Rate limiting: max 1 email/user/day per campaign type
              respect Do-Not-Disturb windows
              skip users who unsubscribed in the last 24h

Batch size of 10K = 1,000 batches for 10M users. With 50 workers each consuming 1 batch/second, full fan-out in ~20 seconds.


Step 8: Failure scenarios

FailureHandling
APNs / FCM is downRetry with exponential backoff (1s, 2s, 4sโ€ฆ max 5 min); dead-letter queue after N retries
Invalid tokenDelete token from DB; donโ€™t retry
User unsubscribedCheck at dequeue time, not just at enqueue
DB unavailableQueue stays intact; workers pause; no data loss
Notification sent but status not updatedIdempotency key ensures re-attempt is a no-op

Say it out loud
โ€œThree priority queues in Kafka โ€” critical, transactional, marketing. Channel workers (push, email, SMS) consume independently so a slow email provider doesnโ€™t block push. Each notification has an idempotency key โ€” the worker checks before sending and updates status after. If the worker crashes between send and ack, it re-reads the message, checks idempotency key, sees it was already sent, and skips. For broadcast campaigns, a scheduler pages through user segments in batches and produces to Kafka โ€” actual sending is fully async and rate-limited.โ€

Likely follow-up questions
  • How do you ensure a notification is delivered exactly once?
  • How do you handle APNs/FCM failures?
  • How would you implement a user preference center?
  • How do you rate-limit notifications per user?
  • How do you handle millions of users for a broadcast notification?

References