The building blocks of a scalable service

Load balancers, gateways, caching, CDN, queues vs pub/sub vs streams, databases, object storage, search, rate limiters, service discovery โ€” each with when to reach for it and the tradeoff.

must hard โฑ 34 min scalingcachequeuekafkashardingload-balancingcdngateway
Mastery:
Why interviewers ask this
Most backend design answers are an assembly of the same dozen components. Knowing each one's job, the alternatives, and the tradeoff lets you design almost anything.

Learn these components once and you can assemble most systems. Click through the blueprint, then read the โ€œwhen to reach for itโ€ notes โ€” each block lists its job, the main alternative, and the cost.

Anatomy of a scalable web service
The building blocks behind almost every backend system-design answer. Click any box for its job and the tradeoff it represents โ€” assemble these and you can design most things.
Click a component to see its role and the decision behind it.

Load balancer (L4 vs L7)

The moment you have more than one app server, you need a load balancer to spread traffic, health-check dead nodes, and enable zero-downtime deploys.

  • L4 (transport layer) โ€” routes by IP/port, forwards TCP/UDP packets without inspecting the payload. Very fast, protocol-agnostic, canโ€™t make content-aware decisions. Good for raw throughput.
  • L7 (application layer) โ€” understands HTTP: can route by path/host/header (/api/* โ†’ service A, /img/* โ†’ service B), terminate TLS, do sticky sessions, retries, and compression. More CPU per request but far more capable. Most modern setups use L7 (NGINX, Envoy, ALB).

Balancing algorithms: round-robin (simple, assumes uniform requests), least-connections (better when request durations vary), weighted (heterogeneous machines), consistent hashing / IP-hash (keeps a client mapped to the same backend โ€” useful for cache locality, but couples you to instances).

Rule of thumb
Prefer least-connections over round-robin when request latencies are uneven (some endpoints are slow), and prefer stateless services over sticky sessions so the LB is free to send any request anywhere.

Reverse proxy & API gateway

A reverse proxy sits in front of your servers and handles cross-cutting concerns: TLS termination, compression, static caching, request buffering. An API gateway is a specialized reverse proxy for an API/microservices fleet that centralizes:

  • AuthN/AuthZ (validate tokens once at the edge)
  • Rate limiting & quotas
  • Request routing / aggregation (fan one client call out to several services)
  • Protocol translation (REST โ†” gRPC), logging, metrics

The tradeoff: itโ€™s a single choke point and a potential SPOF, so it must be replicated and kept thin. Donโ€™t put business logic in the gateway โ€” that recreates a monolith at the edge.

Stateless app tier

Keep no per-user state in a serverโ€™s memory. Put sessions/state in a shared store (Redis) so any node can serve any request, letting you add/remove nodes freely. This is the precondition for horizontal scaling โ€” a stateful server canโ€™t be load-balanced or autoscaled cleanly. If you ever say โ€œscale horizontally,โ€ the implicit prerequisite is โ€œthe tier is stateless.โ€

Caching

When reads dominate and the same data is read repeatedly, a cache cuts latency and offloads the DB. (Full depth โ€” strategies, eviction, the three big failure modes, CDN internals, Redis structures โ€” is in the caching & CDN lesson.) The essentials:

  • Where: browser โ†’ CDN โ†’ app-level (Redis/Memcached) โ†’ DB buffer pool. Cache as close to the user as the dataโ€™s freshness allows.
  • Default strategy: cache-aside โ€” on read, check cache; on miss, read DB and populate with a TTL. The hard part is invalidation on writes.
  • Eviction: LRU (default), LFU, TTL.

Donโ€™t cache write-heavy or rarely-reread data โ€” low hit rate makes the cache pure overhead.

CDN

The moment you serve static assets (JS, images, video) to users across regions. A CDN caches at edge PoPs near users, cutting round-trip latency from ~100ms to ~10ms and offloading your origin. Version asset URLs (app.a1b2.js) so a deploy doesnโ€™t serve stale files, and set long Cache-Control on immutable assets. (Push vs pull, cache keys, and dynamic-content edge caching are covered in the caching lesson.)

Message queues vs pub/sub vs streams

These get conflated constantly; knowing the distinction is a senior signal.

Work queue (SQS, RabbitMQ)Pub/Sub (SNS, Google Pub/Sub)Log / stream (Kafka, Kinesis)
Delivery modelOne consumer per message (competing consumers)Fan-out: every subscriber gets a copyAppend-only log; consumers track their own offset
Message after readDeleted/ackedDelivered to all subscribersRetained (by time/size); replayable
OrderingUsually best-effortBest-effortStrict per-partition order
Use it forOffload slow/spiky jobs (email, image resize)Notify many independent services of an eventEvent sourcing, analytics pipelines, replay, multiple readers of the same firehose

Kafka deep-ish: a topic is split into partitions; each partition is an ordered, immutable log. Producers pick a partition (often by key, so all events for a user land in order). Consumers in a consumer group each own some partitions โ€” parallelism is bounded by partition count. Consumers commit an offset, so they can replay from any point, and slow/new consumers donโ€™t lose data. That retention + replay is why Kafka is the backbone of streaming and event-driven systems, where a transient SQS queue canโ€™t help you.

Rule of thumb
Need to distribute work to one of N workers? โ†’ queue. Need to notify many services of an event? โ†’ pub/sub. Need replay, ordering, and multiple independent consumers of the same stream? โ†’ Kafka.

All three give at-least-once delivery in practice, so design consumers to be idempotent (processing the same message twice is safe). Exactly-once is expensive and usually faked via idempotency keys.

Databases (overview)

Replicas first, shards later โ€” both are levers, not defaults. (Full depth โ€” SQL vs NoSQL families, indexing, isolation, replication topologies โ€” is in the databases lesson.)

  • Read replicas scale reads: writes โ†’ primary, reads โ†’ replicas. Cost: replication lag (a read right after a write may be stale).
  • Sharding/partitioning scales writes and storage when one machine canโ€™t hold the load. Split by a shard key chosen from your access patterns. Cost: cross-shard queries/transactions get hard; a bad key creates hot shards. Shard late โ€” itโ€™s a big jump in complexity.

Object storage (S3)

For large blobs โ€” uploads, images, video, backups, logs. Store the object in S3 and only the key/URL in the DB. Itโ€™s cheap, durable (eleven nines of durability), and effectively infinite, but itโ€™s not a database: no queries, higher latency than a local disk, eventual-ish listing. Pair it with a CDN for delivery and use pre-signed URLs so clients upload/download directly without proxying bytes through your app tier.

Search index

A relational LIKE '%term%' canโ€™t do full-text search at scale (no ranking, no fuzzy matching, full scans). Reach for a search index (Elasticsearch / OpenSearch) when you need full-text search, relevance ranking, faceting, or autocomplete. Itโ€™s a secondary store: you keep the source of truth in your primary DB and stream changes into the index (via CDC or an outbox/queue). The tradeoff is a second system to operate and eventual consistency between DB and index โ€” never make Elasticsearch your system of record.

Rate limiter

Protects services from abuse and overload by capping requests per client per window (return 429 when exceeded). The standard algorithm is token bucket (allows bursts, bounds the average). In a fleet, the counter must be shared state in Redis so the limit holds across all nodes. (Full treatment โ€” algorithms and distributed concerns โ€” is in classics.) Usually lives at the API gateway.

Service discovery

In a dynamic fleet (containers, autoscaling), instances come and go with changing IPs. Service discovery lets a service find healthy instances of another without hardcoded addresses:

  • Client-side (e.g. Consul/Eureka): the client queries a registry and picks an instance.
  • Server-side (e.g. Kubernetes Services, a load balancer with a registry): clients hit a stable virtual address; the platform routes to a healthy pod.

A registry plus health checks keeps traffic off dead nodes. In a Kubernetes world this is largely built in (Services + DNS), which is why you often donโ€™t draw it explicitly โ€” but mentioning it shows you know how the boxes actually find each other.

Putting it together

            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ CDN (static, edge) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  Client โ”€โ”€โ”€โ”ค                                     โ”‚
            โ””โ”€โ–บ Load Balancer / API Gateway       โ”‚
                  (TLS, auth, rate limit, route)  โ”‚
                        โ”‚                          
                        โ–ผ                          
                Stateless App Tier  โ”€โ”€โ–บ Cache (Redis)
                        โ”‚                  โ”‚ miss
                        โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ–ผ
              Primary DB โ”€โ”€โ–บ Read Replicas / Shards
                        โ”‚
                        โ”œโ”€โ–บ Queue/Kafka โ”€โ–บ Workers โ”€โ–บ (email, index, analytics)
                        โ””โ”€โ–บ Object Storage (S3) โ—„โ”€โ”€ CDN delivers blobs

Interview questions & model answers

Q: Why must the app tier be stateless to scale horizontally? โ€œIf a server holds session state in memory, requests must be pinned to that server (sticky sessions), so the load balancer canโ€™t freely distribute load, autoscaling loses state when a node dies, and rolling deploys drop sessions. Moving state to a shared store (Redis) lets any node serve any request โ€” thatโ€™s what makes adding nodes a pure capacity gain.โ€

Q: L4 vs L7 load balancing โ€” when does the difference matter? โ€œL4 routes by IP/port without reading the payload โ€” fastest, protocol-agnostic. L7 understands HTTP, so it can route by path/host/header, terminate TLS, retry, and do content-aware balancing. I use L7 whenever I need path-based routing, TLS termination at the edge, or smarter health/retry logic โ€” which is almost always for HTTP services. L4 when I need raw throughput or non-HTTP protocols.โ€

Q: Queue vs pub/sub vs Kafka? โ€œWork queue for distributing jobs to one of N workers โ€” message is consumed once. Pub/sub for fanning an event out to many independent subscribers. Kafka when I need an ordered, retained, replayable log that multiple consumer groups can read independently โ€” event sourcing, analytics, or replay after a bug. The deciding questions are: one consumer or many? and do I need replay and ordering?โ€

Q: Why idempotent consumers? โ€œQueues deliver at-least-once โ€” a consumer can crash after processing but before acking, so the message is redelivered. If processing isnโ€™t idempotent, that double-charges a card or sends two emails. I make handlers idempotent with a dedup key or upsert, so reprocessing is a no-op.โ€

Q: When does object storage beat the database? โ€œFor large binary blobs โ€” images, video, backups. The DB stays small and fast holding only the key/metadata, while S3 handles cheap, durable, infinite blob storage. I serve via CDN and let clients upload directly with pre-signed URLs so bytes never flow through my app tier. The DB is for queryable structured data; S3 is for bytes you fetch by key.โ€

Q: Where do you put a rate limiter and why there? โ€œAt the API gateway / edge, so abusive traffic is rejected before it consumes app or DB resources. The counter lives in Redis as shared state so the limit is enforced fleet-wide, not per instance. Token bucket so legitimate bursts are allowed while the average stays bounded.โ€

Q: You said โ€˜add Elasticsearch.โ€™ Is it your database? โ€œNo โ€” itโ€™s a secondary index for full-text search and ranking. The source of truth stays in my primary DB; I stream changes into Elasticsearch via CDC or an outbox so they stay in sync, accepting eventual consistency between them. Making it the system of record would risk data loss and reindex pain.โ€

Common mistakes / what weak candidates do

  • Adding every component reflexively (โ€œLB, Redis, Kafka, Elasticsearchโ€) without a requirement justifying each.
  • Confusing pub/sub with a work queue, or calling Kafka โ€œa queueโ€ without mentioning retention/replay/partitions.
  • Putting business logic in the API gateway, recreating a distributed monolith.
  • Forgetting consumers must be idempotent under at-least-once delivery.
  • Sharding too early (or sharding before maxing out replicas + caching).
  • Treating Elasticsearch/Redis as the source of truth instead of derived stores.
  • Ignoring statefulness โ€” proposing horizontal scaling while leaving sessions in app memory.
  • Storing blobs in the database (BLOB columns) instead of object storage + a key.

How to use this in a round
Start with client โ†’ LB โ†’ stateless app โ†’ DB. Then, driven by the non-functional requirements, add: a cache (read-heavy), replicas/shards (scale), a queue or Kafka (async/spiky/event-driven work), a CDN (global static), object storage (blobs), a search index (full-text), a rate limiter (abuse). Justify each addition by the requirement that forced it, and name the cost โ€” thatโ€™s the difference between assembling a system and listing buzzwords.

Likely follow-up questions
  • Why must app servers be stateless to scale horizontally?
  • L4 vs L7 load balancing โ€” when does it matter?
  • Queue vs pub/sub vs a log like Kafka โ€” when each?
  • When do you reach for object storage vs the database?
  • How does a CDN decide what to cache and for how long?

References