Produce a message. Consume a message. Add a few configs.

And for a while, that's enough.

But over time, a different set of questions starts showing up.

Why does Kafka scale the way it does? What really happens during a rebalance? Why does ordering suddenly break? What exactly does "exactly-once" even mean?

These are not API-level questions.

These are architecture questions.

And in my experience, this is where the real gap exists.

Not because Kafka is hard, but because most learning stops at usage.

Full story for non-members | E-Books on Java/Microservices/Springboot | Whatsapp Group| Youtube | LinkedIn

None

Cluster 1: Core Internals (Foundation)

These make people realize Kafka is more than a messaging tool.

  1. Why is Kafka so fast? (sequential I/O, zero copy, batching)
  2. What exactly happens when a producer sends a message?
  3. How does Kafka store data on disk? (log segments, index files)
  4. Why does Kafka rely on pull instead of push?
  5. What is ISR (in-sync replicas) and why does it matter?

Cluster 2: Replication & Fault Tolerance

This is where real system design starts.

  1. How does Kafka handle broker failure?
  2. Leader election — what actually happens during failover?
  3. What happens if leader and follower go out of sync?
  4. How does Kafka ensure durability? (acks, min.insync.replicas)
  5. What happens during network partition?

Cluster 3: Consumer Mechanics (Very Underrated)

Most devs are weak here → great differentiation.

  1. How does consumer group rebalancing actually work?
  2. What triggers a rebalance and why is it expensive?
  3. What happens during a rebalance step-by-step?
  4. How does offset commit really work internally?
  5. Why do consumers sometimes reprocess messages?

Cluster 4: Delivery Semantics (Interview Favorite)

This is where people usually give half answers.

  1. At-most-once vs at-least-once vs exactly-once
  2. Is Kafka really "exactly-once"? What does it actually mean?
  3. How do idempotent producers work internally?
  4. Transactions in Kafka — when do they matter?
  5. What are the trade-offs of exactly-once?

Cluster 5: Scaling & Performance

Very practical and highly relatabl

  1. How do you decide number of partitions?
  2. What happens when partitions increase later?
  3. How does Kafka handle high throughput?
  4. What causes consumer lag and how do you debug it?
  5. How to tune Kafka for performance (producer + broker + consumer)

Cluster 6: Real-world System Design Thinking

This is where your experience will shine.

  1. Designing retry architecture (you already started this)
  2. Handling ordering vs scaling trade-offs
  3. How to design DLQ (DLT) properly
  4. Schema evolution — how to avoid breaking consumers
  5. When NOT to use Kafka

Conclusion

Kafka is often introduced as a messaging system.

But in practice, it behaves very differently.

It's a distributed log. A coordination system. A backbone for data movement.

And because of that, surface-level understanding doesn't hold for long.

At some point, you have to go deeper.

The goal of these questions is not memorization.

It's to help you build a mental model that stays stable even when systems become complex.

If you're preparing for interviews, this is the level where conversations become meaningful.

If you're working on real systems, this is the level where problems start making sense.

Liked this deep dive story? If Yes Please 👏 Clap(50) | 📤 Share | 🔔 Follow

=======