Apache Kafka is a distributed event streaming platform designed for high-throughput, low-latency, and durable log storage. Kafka acts as a commit log where producers append records and consumers read them in order.

Kafka is fundamentally a storage system optimized for sequential writes. Instead of deleting data after consumption, it retains logs for a configured time or size. This enables replay and decouples producers from consumers.

Scenario: In an e-commerce system, Kafka stores all order events. Downstream systems (billing, analytics, fraud detection) consume these events independently at their own pace.

Topics, Partitions & Offsets

  • Topics: Logical categories of messages.
  • Partitions: Logs within topics, ordered and immutable.
  • Offsets: Sequential IDs for each record within a partition.

Partitions are the unit of parallelism. More partitions = more consumer concurrency. Offsets are managed per consumer group, enabling fault-tolerant parallel consumption.

Scenario: A payments topic with 12 partitions allows 12 consumers in a group to process events concurrently, scaling throughput linearly.

Storage Internals

Kafka stores messages as log segments on disk:

  • Append-Only: Producers append records sequentially.
  • Log Segments: Files split by size/time (default 1GB). Partition data split into .log and .index files.
  • Index Files: Maintain mapping between offsets and physical positions.
  • Compaction: Retains the latest record per key, discarding older ones.
  • Tiered Storage (2.8+): Moves cold data to S3/GCS while hot data stays local.

Sequential writes + page cache usage enable Kafka to outperform many databases. Compaction is critical for topics like user profiles, where only the latest value matters.

Scenario: A user settings topic uses log compaction, ensuring only the latest preference survives, keeping storage efficient.

Producers

Producers send records to brokers:

  • Partitioner: Chooses partition (round-robin, hash key, custom).
  • Batching: Groups messages for efficiency.
  • ACKs: Configurable (acks=0, acks=1, acks=all).

Choosing acks=all with replication factor >1 ensures durability. acks=0 maximizes throughput but risks data loss.

Scenario: A metrics pipeline uses acks=0 for ultra-low latency. A financial trading system uses acks=all with idempotence enabled for strict durability.

Pitfalls

  • Hot partitions → broker imbalance.
  • Compression saves network but may saturate CPU.

Consumers

Consumers read sequentially:

  • Consumer Groups: Coordinate parallel processing.
  • Rebalancing: Redistributes partitions when group members change.
  • Offset Management: Committed to Kafka or external stores.
  • Consumer lag ≠ data loss; often a sign of downstream bottlenecks.
  • Large-scale deployments face thundering herd issues during rebalances.

Rebalancing pauses consumption temporarily. Kafka 2.4+ introduced incremental rebalancing to reduce disruption.

Scenario: In a fraud detection system, if a consumer crashes, its partitions are reassigned to healthy instances, ensuring no events are skipped.

Replication & Fault Tolerance

Kafka ensures durability with replication:

  • Replication Factor: Number of copies per partition.
  • Leader & Followers: One broker is leader, others replicate.
  • ISR (In-Sync Replicas): Followers fully caught up with leader.
  • Unclean leader election favors availability but risks durability.
  • Re-replication floods network unless throttled.
  • Cross-region replication (MirrorMaker 2.0) → eventual consistency vs synchronous high-latency trade-offs.

Producers can require writes to be acknowledged only if committed to ISR. Leader elections occur when leaders fail; unclean leader election risks data loss.

Scenario: A 3-replica topic ensures durability. If one broker fails, another replica takes over without losing data.

Controller & Metadata Management

  • Controller Broker: Manages partition leaders and replicas.
  • Zookeeper (Legacy): Stored metadata, coordinated elections.
  • KRaft Mode (Kafka 2.8+): Replaces Zookeeper with internal Raft consensus.
  • Zookeeper vs KRaft: KRaft uses Raft consensus, simplifying ops

Metadata is cached in brokers and updated via controller notifications. Modern deployments use KRaft for simplified operations.

Scenario: A cluster upgrade moves from Zookeeper to KRaft, reducing operational complexity by eliminating external coordination service.

Networking & Protocols

Kafka uses a custom TCP protocol:

  • Request/Response: Producers/consumers communicate with brokers.
  • Zero-Copy Transfers: Kafka uses sendfile() to stream data directly from disk to network.
  • Compression: Batch compression (gzip, snappy, lz4, zstd).
  • TLS/SASL: Secure but CPU intensive — requires hardware acceleration.

Zero-copy is why Kafka achieves high throughput — no extra memory copies between user and kernel space.

A video analytics pipeline compresses logs with lz4, balancing CPU usage and throughput.

Transactions & Exactly-Once Semantics

Kafka provides exactly-once semantics (EOS):

  • Idempotent Producers: Prevent duplicate writes.
  • Transactions: Commit multiple writes atomically.
  • Consumer Isolation: Read only committed messages.

EOS requires coordination between producer IDs, transaction coordinator, and consumer offsets. It guarantees messages are delivered once even with retries.

Operational Challenges & Scaling Patterns

Capacity Planning

  • Partition sizing must balance throughput vs metadata overhead.
  • Retention policies determine storage costs (petabytes for week-long retention at high event rates).

Fault-Tolerance Edge Cases

  • Zombie Producers: Must be fenced.
  • Lagging Consumers: Cause storage blowups if retention mismatches demand.Optimization Strategies
  • Tiered Storage: Reduces cost.
  • Cluster Sharding: Domain-level isolation.
  • Hybrid Topics: Both compaction + deletion.

📌 Scenario: Uber shards clusters by business vertical (Mobility, Eats, Freight) and deploys autoscaling consumers tied to lag metrics.

Closing Thoughts

To a developer, Kafka looks like a queue, but it is:-

  • Designing partition & replication strategies at scale.
  • Managing trade-offs between latency, durability, and cost.
  • Running hundreds of brokers across regions reliably.
  • Debugging ISR shrinkage, controller failovers, and consumer lag during real outages.

Kafka is not just a pub/sub system. It is the nervous system of modern data architectures.

Enjoyed the article? Follow me on X.com for more interesting content, and check out my website at anuraggoel.in !😊🚀

A message from our Founder

Hey, Sunil here. I wanted to take a moment to thank you for reading until the end and for being a part of this community.

Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don't receive any funding, we do this to support the community. ❤️

If you want to show some love, please take a moment to follow me on LinkedIn, TikTok, Instagram. You can also subscribe to our weekly newsletter.

And before you go, don't forget to clap and follow the writer️!