Kafka Deep Dive

Level: Intermediate
Pre-reading: 04 · Event-Driven Architecture · 04.01 · Event Types and Patterns


Kafka Architecture Overview

Apache Kafka is a distributed event streaming platform. It's designed for high throughput, durability, and scalability.

graph TD
    subgraph Kafka Cluster
        B1[Broker 1]
        B2[Broker 2]
        B3[Broker 3]
    end
    P1[Producer 1] --> B1
    P2[Producer 2] --> B2
    C1[Consumer Group A] --> B1
    C1 --> B2
    C2[Consumer Group B] --> B2
    C2 --> B3
    ZK[Zookeeper / KRaft] --> B1
    ZK --> B2
    ZK --> B3

Core Concepts

Topics and Partitions

A topic is a named stream of events. Topics are split into partitions — ordered, immutable logs.

graph TD
    T[Topic: orders]
    T --> P0[Partition 0]
    T --> P1[Partition 1]
    T --> P2[Partition 2]
    P0 --> M0A[Offset 0]
    P0 --> M0B[Offset 1]
    P0 --> M0C[Offset 2]
    P1 --> M1A[Offset 0]
    P1 --> M1B[Offset 1]
Concept Description
Topic Logical channel; named stream
Partition Physical unit; ordered log; parallelism unit
Offset Position of message in partition; monotonically increasing
Replication Factor Number of copies across brokers

Partition Key

The partition key determines which partition a message goes to. Messages with the same key always go to the same partition (ordering guaranteed).

// Same orderId always goes to same partition
producer.send(new ProducerRecord<>(
    "orders",           // topic
    order.getId(),      // key (partition key)
    orderEvent          // value
));
Key Strategy Use Case
Entity ID Order ID, Customer ID — events for same entity ordered
Random/null Load distribution; no ordering needed
Region/tenant Shard by geography or tenant

Producers

Producers write messages to topics.

Producer Configuration

Config Purpose Recommendation
acks Acknowledgment level all for durability
retries Retry count High (e.g., 10)
enable.idempotence Prevent duplicates true
batch.size Batch messages Tune for throughput
linger.ms Wait to batch 5–10ms for throughput

Acknowledgment Levels

acks Behavior Durability Latency
0 Fire and forget None Lowest
1 Leader acknowledged Medium Medium
all All replicas acknowledged Highest Highest
Properties props = new Properties();
props.put("bootstrap.servers", "kafka:9092");
props.put("acks", "all");
props.put("enable.idempotence", "true");
props.put("retries", 10);
props.put("key.serializer", StringSerializer.class);
props.put("value.serializer", KafkaAvroSerializer.class);

KafkaProducer<String, OrderEvent> producer = new KafkaProducer<>(props);

Consumers and Consumer Groups

Consumer Groups

A consumer group is a set of consumers that cooperatively consume a topic. Each partition is consumed by exactly one consumer in the group.

graph TD
    subgraph Topic: orders - 4 partitions
        P0[Partition 0]
        P1[Partition 1]
        P2[Partition 2]
        P3[Partition 3]
    end
    subgraph Consumer Group A - 2 consumers
        C1[Consumer 1]
        C2[Consumer 2]
    end
    P0 --> C1
    P1 --> C1
    P2 --> C2
    P3 --> C2

Consumer Configuration

Config Purpose Recommendation
group.id Consumer group identifier Unique per logical consumer
auto.offset.reset Where to start on new group earliest or latest
enable.auto.commit Auto-commit offsets false for reliability
max.poll.records Records per poll Tune for processing time
session.timeout.ms Heartbeat timeout 10–30 seconds

Offset Management

sequenceDiagram
    participant C as Consumer
    participant K as Kafka
    participant DB as Application DB

    C->>K: Poll messages
    K->>C: Messages (offset 100-109)
    C->>DB: Process and save
    C->>K: Commit offset 110
    Note over C,K: If crash before commit, reprocess from 100

Manual commit for reliability:

while (true) {
    ConsumerRecords<String, OrderEvent> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, OrderEvent> record : records) {
        processEvent(record.value());  // Process
    }
    consumer.commitSync();  // Commit after processing
}

Delivery Guarantees

Guarantee Description Configuration
At most once May lose; no duplicates Auto-commit; acks=0
At least once May duplicate; no loss Manual commit; acks=all; idempotent consumer
Exactly once No loss; no duplicates Transactions + idempotent processing

Achieving Exactly-Once

// Producer side: enable transactions
props.put("enable.idempotence", "true");
props.put("transactional.id", "order-processor");

// Consumer side: read committed only
props.put("isolation.level", "read_committed");

Exactly-once is expensive

True exactly-once requires transactions and adds latency. Often, at-least-once with idempotent consumers is simpler and sufficient.

Idempotent Consumer Pattern

public void processEvent(OrderEvent event) {
    // Check if already processed
    if (processedEventRepository.exists(event.getEventId())) {
        return;  // Skip duplicate
    }

    // Process
    orderService.processOrder(event);

    // Mark as processed
    processedEventRepository.save(event.getEventId());
}

Replication and Fault Tolerance

ISR — In-Sync Replicas

Concept Description
Leader Handles reads/writes for partition
Follower Replicates from leader
ISR Replicas caught up to leader
min.insync.replicas Minimum ISR for acks=all
graph LR
    subgraph Partition 0
        L[Leader - Broker 1]
        F1[Follower - Broker 2]
        F2[Follower - Broker 3]
    end
    L --> F1
    L --> F2
# Topic configuration
replication.factor=3
min.insync.replicas=2

# Producer
acks=all

With this config: 3 replicas, writes succeed if at least 2 (including leader) acknowledge. Survives 1 broker failure.


Schema Registry

Centralized schema management for Avro, Protobuf, or JSON Schema.

graph LR
    P[Producer] -->|Register schema| SR[Schema Registry]
    P -->|Send with schema ID| K[Kafka]
    K --> C[Consumer]
    C -->|Fetch schema| SR

Benefits

Benefit Description
Schema validation Reject messages that don't conform
Compatibility checks Prevent breaking changes
Schema evolution Track schema versions
Efficient serialization Schema ID in message, not full schema

Compatibility Modes

Mode Can Add Fields Can Remove Fields
BACKWARD With defaults Optional only
FORWARD Required only Any
FULL With defaults Optional only
NONE Any Any

Dead Letter Queue (DLQ)

Handle messages that fail processing after max retries.

graph LR
    T[orders topic] --> C[Consumer]
    C -->|Process| App[Application]
    C -->|Retry exhausted| DLQ[orders.dlq topic]
    DLQ --> Alert[Alerting]
    DLQ --> Reprocess[Manual reprocess]
public void processWithRetry(ConsumerRecord<String, OrderEvent> record) {
    int retries = 0;
    while (retries < MAX_RETRIES) {
        try {
            process(record.value());
            return;
        } catch (RetryableException e) {
            retries++;
            backoff(retries);
        }
    }
    // Send to DLQ
    dlqProducer.send(new ProducerRecord<>("orders.dlq", record.key(), record.value()));
}

Kafka vs Alternatives

Aspect Kafka RabbitMQ AWS SQS
Model Log-based Queue-based Queue-based
Ordering Per partition Per queue (with caveats) FIFO queues only
Retention Time/size-based Until consumed 14 days max
Replay Yes No No
Throughput Very high High Moderate
Managed Confluent, MSK CloudAMQP Native AWS

Kafka Best Practices

Area Recommendation
Partitions Start with #partitions = expected consumers
Keys Use entity ID for ordering; null for load distribution
Serialization Avro + Schema Registry for contracts
Retention Set based on replay needs
Monitoring Consumer lag, under-replicated partitions
Idempotency Design idempotent consumers

How many partitions should a topic have?

Start with number of partitions = expected maximum consumers. More partitions = more parallelism but more overhead. You can increase partitions later, but messages with the same key may go to different partitions after. Typical: 6–12 for moderate load; 50+ for high-throughput.

What happens when a Kafka consumer crashes?

The consumer group rebalances. Partitions assigned to the crashed consumer are redistributed to remaining consumers. Uncommitted offsets are replayed (at-least-once). If using auto-commit, some messages may be lost (at-most-once). Design consumers to be idempotent.

How does Kafka achieve exactly-once semantics?

(1) Idempotent producer: Each message has sequence number; broker deduplicates. (2) Transactions: Atomic writes across partitions and consumer offset commits. (3) Read committed: Consumers only see committed transaction messages. (4) Idempotent consumer: Application-level deduplication for full exactly-once.