This is one of those interview questions that looks simple on the surface — "design a key-value store" — and then quietly murders you with its depth. I've seen engineers at senior and staff level stumble on this not because they lacked knowledge, but because they didn't connect the dots between storage internals, distributed systems theory, and the very specific constraints baked into the problem.

Let's walk through it properly. Not as a checklist, but as a story.

Start by Respecting the Constraints

Most candidates treat constraints as footnotes. They say "we'll shard by key" and move on. But three constraints here are load-bearing walls, not decorations.

Append-only writes mean you can never modify a file in place. Every update is a new write at the end of a log. Strong consistency means reads must always reflect the latest committed write, even during failures and even from a different server. And values up to 2GB mean you cannot buffer the full value in memory — not on writes, not on reads.

These three together eliminate a huge swath of obvious designs. No in-place hash tables. No loading everything into Redis. No hand-waving about eventual consistency.

The Write Path: Log-Structured Everything

The append-only constraint is actually a gift. Sequential writes to disk are dramatically faster than random writes. This is the core insight behind Log-Structured Merge Trees (LSM-trees), used by RocksDB, Cassandra, and LevelDB.

When a write comes in, it lands first in a Write-Ahead Log (WAL) — an append-only file on disk that guarantees durability even if the server crashes a millisecond later. Simultaneously, it goes into a memtable, an in-memory sorted structure that absorbs writes at memory speed.

When the memtable fills up, it gets flushed to disk as an immutable SSTable (Sorted String Table). SSTables are sorted, indexed, and never modified. Deletes become tombstone entries — a special marker that says "this key is gone." They only get cleaned up during compaction.

Compaction is the tax you pay for sequential write performance. For a read-heavy workload I'd use leveled compaction, which keeps fewer SSTables per level and reduces read amplification. For write-heavy, tiered compaction wins because it causes less write amplification during ingest. Given the problem asks us to handle both, I'd default to leveled with tunable parameters.

Handling the 2GB problem: value separation

If you stuff 2GB values directly into your SSTables, compaction becomes catastrophic. You're re-reading and re-writing massive blobs every time you merge levels. This is called write amplification.

The fix is value separation, from the WiscKey paper. Keep only the key and a small pointer (file ID + byte offset) in the SSTable, and store the actual blob in a separate append-only value log. Now compaction only touches tiny metadata. The giant blob stays put until you explicitly garbage collect it.

For reads and writes of large values you always stream in chunks — 4 to 16MB each. Never buffer the whole thing. Each chunk gets a checksum, and you maintain a chunk manifest per key listing all the chunks and their offsets. This enables resumable uploads, range reads, and crash recovery without replaying from the beginning.

The Read Path: Making Consistency Real

Strong consistency means linearizability: if I write a value and you read it immediately after from a different server, you must see my write. No stale data.

The standard approach is leader-based replication with a consensus protocol like Raft. Each partition has a single leader that processes all writes. A write is only acknowledged to the client after a majority of replicas have persisted it. Reads go to the leader by default, which guarantees you're reading the latest committed state.

The tricky part is failover. If the leader dies and a new one is elected, a stale ex-leader might still think it's in charge and serve old data. You prevent this with fencing tokens — an epoch number that increments with every leader election. Any replica operating with an old epoch gets rejected. This is non-negotiable for true linearizability.

For read-heavy workloads, forcing every read through the leader creates a bottleneck. The alternative is lease-based follower reads: the leader grants a timed lease to followers guaranteeing it won't commit anything new without them. Followers can safely serve reads within that window. This scales reads horizontally without sacrificing consistency.

The read path itself works like this: check the memtable first, then immutable memtables, then scan SSTables from newest to oldest. Bloom filters let you skip SSTables that definitely don't contain the key — without them, a read for a missing key would force you to scan every SSTable on every level.

Conditional Operations

Real systems need more than put and get. Multiple services might try to update the same record simultaneously. Without coordination, you get lost updates.

The answer is compare-and-set operations via ETags or version numbers. Every value has a version. A conditional write says: "set this value, but only if the current version is still X." If someone else wrote in between, the operation fails and the caller retries.

With a leader-based design this is straightforward — the leader serializes all writes. You implement CAS as a single Raft log entry that includes both the precondition check and the new value. If the check fails, the entry is rejected without touching state.

Failure Recovery

Crash recovery: replay the WAL from the last checkpoint to rebuild the memtable. The key optimization is periodic checkpointing — snapshotting the in-memory index to disk so you only replay the tail of the log. At 100TB scale, replaying from the beginning of time is unacceptable.

After a full node failure, SSTables on disk still have their own sparse indexes. You rebuild the global index by scanning SSTable metadata, not the data itself. For the value log, you rebuild GC metadata lazily in the background while already serving reads from the immutable files.

Torn writes — chunks partially written before a crash — are caught by checksums. A chunk that doesn't match gets discarded and the client retries.

Multi-Region and CAP

At Airbnb's scale you're running across multiple regions. The commit path for strong consistency requires quorum acknowledgment, and if your quorum spans regions, every write pays a cross-region round-trip.

The standard mitigation: place the majority of replicas in a single primary region with one replica remote. You get fast commits most of the time (two of three replicas in the same datacenter) while maintaining cross-region durability.

During a network partition between regions you must choose: block writes until the partition heals (CP), or allow writes in each region and reconcile later (AP). For a system spec'd with strong consistency, you choose CP. Document this trade-off explicitly and design your SLAs around it.

What a Good Answer Actually Shows

The interviewer isn't expecting you to design RocksDB from scratch in 45 minutes. They want to see how the pieces connect.

Writes are fast because of the WAL and memtable pipeline. Large values stay out of the compaction path because of value separation. Reads are always consistent because they go through a leader or a lease-holding follower. The system survives failures because every durable write has been checksummed, replicated to a quorum, and can be replayed from a checkpoint.

The deepest question in this problem is really about engineering taste: where do you put the complexity? LSM-trees put it in background compaction to keep foreground writes fast. Value separation puts it in GC to keep compaction cheap. Leader replication puts it in failover to keep reads simple. Good systems design is about moving complexity to where it's cheapest to pay for it.

That's the answer they're looking for. Not a perfect system — a thoughtful one.

Enjoyed the article? Follow me on X.com for more interesting content, and check out my website at anuraggoel.in !😊🚀