Learn these components once and you can assemble most systems. Click through the blueprint, then read the โwhen to reach for itโ notes โ each block lists its job, the main alternative, and the cost.
Load balancer (L4 vs L7)
The moment you have more than one app server, you need a load balancer to spread traffic, health-check dead nodes, and enable zero-downtime deploys.
- L4 (transport layer) โ routes by IP/port, forwards TCP/UDP packets without inspecting the payload. Very fast, protocol-agnostic, canโt make content-aware decisions. Good for raw throughput.
- L7 (application layer) โ understands HTTP: can route by path/host/header (
/api/*โ service A,/img/*โ service B), terminate TLS, do sticky sessions, retries, and compression. More CPU per request but far more capable. Most modern setups use L7 (NGINX, Envoy, ALB).
Balancing algorithms: round-robin (simple, assumes uniform requests), least-connections (better when request durations vary), weighted (heterogeneous machines), consistent hashing / IP-hash (keeps a client mapped to the same backend โ useful for cache locality, but couples you to instances).
Reverse proxy & API gateway
A reverse proxy sits in front of your servers and handles cross-cutting concerns: TLS termination, compression, static caching, request buffering. An API gateway is a specialized reverse proxy for an API/microservices fleet that centralizes:
- AuthN/AuthZ (validate tokens once at the edge)
- Rate limiting & quotas
- Request routing / aggregation (fan one client call out to several services)
- Protocol translation (REST โ gRPC), logging, metrics
The tradeoff: itโs a single choke point and a potential SPOF, so it must be replicated and kept thin. Donโt put business logic in the gateway โ that recreates a monolith at the edge.
Stateless app tier
Keep no per-user state in a serverโs memory. Put sessions/state in a shared store (Redis) so any node can serve any request, letting you add/remove nodes freely. This is the precondition for horizontal scaling โ a stateful server canโt be load-balanced or autoscaled cleanly. If you ever say โscale horizontally,โ the implicit prerequisite is โthe tier is stateless.โ
Caching
When reads dominate and the same data is read repeatedly, a cache cuts latency and offloads the DB. (Full depth โ strategies, eviction, the three big failure modes, CDN internals, Redis structures โ is in the caching & CDN lesson.) The essentials:
- Where: browser โ CDN โ app-level (Redis/Memcached) โ DB buffer pool. Cache as close to the user as the dataโs freshness allows.
- Default strategy: cache-aside โ on read, check cache; on miss, read DB and populate with a TTL. The hard part is invalidation on writes.
- Eviction: LRU (default), LFU, TTL.
Donโt cache write-heavy or rarely-reread data โ low hit rate makes the cache pure overhead.
CDN
The moment you serve static assets (JS, images, video) to users across regions. A CDN caches at edge PoPs near users, cutting round-trip latency from ~100ms to ~10ms and offloading your origin. Version asset URLs (app.a1b2.js) so a deploy doesnโt serve stale files, and set long Cache-Control on immutable assets. (Push vs pull, cache keys, and dynamic-content edge caching are covered in the caching lesson.)
Message queues vs pub/sub vs streams
These get conflated constantly; knowing the distinction is a senior signal.
| Work queue (SQS, RabbitMQ) | Pub/Sub (SNS, Google Pub/Sub) | Log / stream (Kafka, Kinesis) | |
|---|---|---|---|
| Delivery model | One consumer per message (competing consumers) | Fan-out: every subscriber gets a copy | Append-only log; consumers track their own offset |
| Message after read | Deleted/acked | Delivered to all subscribers | Retained (by time/size); replayable |
| Ordering | Usually best-effort | Best-effort | Strict per-partition order |
| Use it for | Offload slow/spiky jobs (email, image resize) | Notify many independent services of an event | Event sourcing, analytics pipelines, replay, multiple readers of the same firehose |
Kafka deep-ish: a topic is split into partitions; each partition is an ordered, immutable log. Producers pick a partition (often by key, so all events for a user land in order). Consumers in a consumer group each own some partitions โ parallelism is bounded by partition count. Consumers commit an offset, so they can replay from any point, and slow/new consumers donโt lose data. That retention + replay is why Kafka is the backbone of streaming and event-driven systems, where a transient SQS queue canโt help you.
All three give at-least-once delivery in practice, so design consumers to be idempotent (processing the same message twice is safe). Exactly-once is expensive and usually faked via idempotency keys.
Databases (overview)
Replicas first, shards later โ both are levers, not defaults. (Full depth โ SQL vs NoSQL families, indexing, isolation, replication topologies โ is in the databases lesson.)
- Read replicas scale reads: writes โ primary, reads โ replicas. Cost: replication lag (a read right after a write may be stale).
- Sharding/partitioning scales writes and storage when one machine canโt hold the load. Split by a shard key chosen from your access patterns. Cost: cross-shard queries/transactions get hard; a bad key creates hot shards. Shard late โ itโs a big jump in complexity.
Object storage (S3)
For large blobs โ uploads, images, video, backups, logs. Store the object in S3 and only the key/URL in the DB. Itโs cheap, durable (eleven nines of durability), and effectively infinite, but itโs not a database: no queries, higher latency than a local disk, eventual-ish listing. Pair it with a CDN for delivery and use pre-signed URLs so clients upload/download directly without proxying bytes through your app tier.
Search index
A relational LIKE '%term%' canโt do full-text search at scale (no ranking, no fuzzy matching, full scans). Reach for a search index (Elasticsearch / OpenSearch) when you need full-text search, relevance ranking, faceting, or autocomplete. Itโs a secondary store: you keep the source of truth in your primary DB and stream changes into the index (via CDC or an outbox/queue). The tradeoff is a second system to operate and eventual consistency between DB and index โ never make Elasticsearch your system of record.
Rate limiter
Protects services from abuse and overload by capping requests per client per window (return 429 when exceeded). The standard algorithm is token bucket (allows bursts, bounds the average). In a fleet, the counter must be shared state in Redis so the limit holds across all nodes. (Full treatment โ algorithms and distributed concerns โ is in classics.) Usually lives at the API gateway.
Service discovery
In a dynamic fleet (containers, autoscaling), instances come and go with changing IPs. Service discovery lets a service find healthy instances of another without hardcoded addresses:
- Client-side (e.g. Consul/Eureka): the client queries a registry and picks an instance.
- Server-side (e.g. Kubernetes Services, a load balancer with a registry): clients hit a stable virtual address; the platform routes to a healthy pod.
A registry plus health checks keeps traffic off dead nodes. In a Kubernetes world this is largely built in (Services + DNS), which is why you often donโt draw it explicitly โ but mentioning it shows you know how the boxes actually find each other.
Putting it together
โโโโโโโโโ CDN (static, edge) โโโโโโโโโ
Client โโโโค โ
โโโบ Load Balancer / API Gateway โ
(TLS, auth, rate limit, route) โ
โ
โผ
Stateless App Tier โโโบ Cache (Redis)
โ โ miss
โโโโโโโโโโโโโโโโโโโโ
โผ
Primary DB โโโบ Read Replicas / Shards
โ
โโโบ Queue/Kafka โโบ Workers โโบ (email, index, analytics)
โโโบ Object Storage (S3) โโโ CDN delivers blobs
Interview questions & model answers
Q: Why must the app tier be stateless to scale horizontally? โIf a server holds session state in memory, requests must be pinned to that server (sticky sessions), so the load balancer canโt freely distribute load, autoscaling loses state when a node dies, and rolling deploys drop sessions. Moving state to a shared store (Redis) lets any node serve any request โ thatโs what makes adding nodes a pure capacity gain.โ
Q: L4 vs L7 load balancing โ when does the difference matter? โL4 routes by IP/port without reading the payload โ fastest, protocol-agnostic. L7 understands HTTP, so it can route by path/host/header, terminate TLS, retry, and do content-aware balancing. I use L7 whenever I need path-based routing, TLS termination at the edge, or smarter health/retry logic โ which is almost always for HTTP services. L4 when I need raw throughput or non-HTTP protocols.โ
Q: Queue vs pub/sub vs Kafka? โWork queue for distributing jobs to one of N workers โ message is consumed once. Pub/sub for fanning an event out to many independent subscribers. Kafka when I need an ordered, retained, replayable log that multiple consumer groups can read independently โ event sourcing, analytics, or replay after a bug. The deciding questions are: one consumer or many? and do I need replay and ordering?โ
Q: Why idempotent consumers? โQueues deliver at-least-once โ a consumer can crash after processing but before acking, so the message is redelivered. If processing isnโt idempotent, that double-charges a card or sends two emails. I make handlers idempotent with a dedup key or upsert, so reprocessing is a no-op.โ
Q: When does object storage beat the database? โFor large binary blobs โ images, video, backups. The DB stays small and fast holding only the key/metadata, while S3 handles cheap, durable, infinite blob storage. I serve via CDN and let clients upload directly with pre-signed URLs so bytes never flow through my app tier. The DB is for queryable structured data; S3 is for bytes you fetch by key.โ
Q: Where do you put a rate limiter and why there? โAt the API gateway / edge, so abusive traffic is rejected before it consumes app or DB resources. The counter lives in Redis as shared state so the limit is enforced fleet-wide, not per instance. Token bucket so legitimate bursts are allowed while the average stays bounded.โ
Q: You said โadd Elasticsearch.โ Is it your database? โNo โ itโs a secondary index for full-text search and ranking. The source of truth stays in my primary DB; I stream changes into Elasticsearch via CDC or an outbox so they stay in sync, accepting eventual consistency between them. Making it the system of record would risk data loss and reindex pain.โ
Common mistakes / what weak candidates do
- Adding every component reflexively (โLB, Redis, Kafka, Elasticsearchโ) without a requirement justifying each.
- Confusing pub/sub with a work queue, or calling Kafka โa queueโ without mentioning retention/replay/partitions.
- Putting business logic in the API gateway, recreating a distributed monolith.
- Forgetting consumers must be idempotent under at-least-once delivery.
- Sharding too early (or sharding before maxing out replicas + caching).
- Treating Elasticsearch/Redis as the source of truth instead of derived stores.
- Ignoring statefulness โ proposing horizontal scaling while leaving sessions in app memory.
- Storing blobs in the database (BLOB columns) instead of object storage + a key.