Kafka as a Resilience Layer

When the fraud detection service is unavailable, the payment service has two options: fail the payment immediately, or defer the fraud check and process it later. Kafka transforms the second option from a vague aspiration into a concrete implementation. The unprocessed payment event sits in a Kafka partition, durable and ordered, waiting for the fraud detection service to recover.

The Failure Mode

Synchronous fraud detection is on the critical path. The payment service calls the fraud detection service, waits for a response, and proceeds based on the score. When fraud detection is unavailable, the circuit breaker opens and the fallback permits the payment (Chapter 3). This is acceptable for low-risk payments but not for high-value transactions.

The alternative: publish the payment event to a Kafka topic. A consumer reads the event, calls fraud detection, and writes the result back. If fraud detection is unavailable, the consumer does not commit the offset. The event remains in the partition. When fraud detection recovers, the consumer processes the backlog.

Kafka Retry and Dead Letter Topology

The diagram shows the flow: payment-events topic feeds the fraud consumer. On failure, events move to payment-events-retry-1 (1-minute delay), then payment-events-retry-2 (5-minute delay), and after all retries are exhausted, to payment-events-dlq (dead letter queue). Each retry topic has its own consumer that re-attempts the fraud check after the configured delay.

From Scratch: The Retry Topic Pattern

// SCRATCH - Manual retry topic implementation
public class RetryTopicConsumer {

    private final KafkaConsumer<String, PaymentEvent> consumer;
    private final KafkaProducer<String, PaymentEvent> producer;
    private final FraudDetectionClient fraudClient;
    private final String currentTopic;
    private final String nextRetryTopic;  // null if this is the last retry
    private final String dlqTopic;
    private final int maxRetryCount;

    public void consume() {
        while (true) {
            ConsumerRecords<String, PaymentEvent> records =
                    consumer.poll(Duration.ofMillis(100));

            for (ConsumerRecord<String, PaymentEvent> record : records) {
                int retryCount = getRetryCount(record.headers());

                try {
                    FraudScore score = fraudClient.score(
                            record.value());
                    handleFraudResult(record.value(), score);
                    // Success: offset will be committed
                } catch (Exception e) {
                    if (retryCount >= maxRetryCount) {
                        // Exhausted all retries: send to DLQ
                        sendToDlq(record, e);
                    } else if (nextRetryTopic != null) {
                        // Send to next retry topic with incremented count
                        sendToRetry(record, retryCount + 1);
                    } else {
                        // Last retry topic, send to DLQ
                        sendToDlq(record, e);
                    }
                }
            }

            consumer.commitSync();
        }
    }

    private void sendToRetry(ConsumerRecord<String, PaymentEvent> record,
                              int newRetryCount) {
        ProducerRecord<String, PaymentEvent> retryRecord =
                new ProducerRecord<>(nextRetryTopic,
                        record.key(), record.value());
        retryRecord.headers()
                .add("retry-count",
                        Integer.toString(newRetryCount).getBytes())
                .add("original-topic",
                        record.topic().getBytes())
                .add("failure-timestamp",
                        Instant.now().toString().getBytes());
        producer.send(retryRecord);
    }

    private void sendToDlq(ConsumerRecord<String, PaymentEvent> record,
                            Exception error) {
        ProducerRecord<String, PaymentEvent> dlqRecord =
                new ProducerRecord<>(dlqTopic,
                        record.key(), record.value());
        dlqRecord.headers()
                .add("error-message",
                        error.getMessage().getBytes())
                .add("error-class",
                        error.getClass().getName().getBytes())
                .add("original-topic",
                        record.topic().getBytes())
                .add("retry-count",
                        Integer.toString(getRetryCount(
                                record.headers())).getBytes());
        producer.send(dlqRecord);
    }

    private int getRetryCount(Headers headers) {
        Header header = headers.lastHeader("retry-count");
        if (header == null) return 0;
        return Integer.parseInt(new String(header.value()));
    }
}

What this reveals:

Retry delay is topic-based, not timer-based. Each retry topic can have different consumer poll intervals or use Kafka’s message timestamp plus a delay check. The consumer reads the message, checks if enough time has elapsed since the failure timestamp, and if not, pauses the partition until the delay has passed. This avoids Thread.sleep() in the consumer.

The DLQ preserves the full error context. Headers carry the error message, error class, original topic, and retry count. An operator investigating a DLQ message can reconstruct the entire failure history without accessing application logs.

Commit semantics determine at-least-once guarantees. The consumer commits offsets after successful processing. If the consumer crashes between processing and committing, the message is reprocessed on restart. This is at-least-once delivery. The fraud detection call must be idempotent (scoring is a read operation, so it naturally is).

Production Implementation with Spring Kafka

Spring Kafka provides built-in retry topic support:

// PRODUCTION - Spring Kafka retry topic configuration
@Configuration
@EnableKafka
public class KafkaRetryConfig {

    @Bean
    public RetryTopicConfiguration retryTopicConfig(
            KafkaTemplate<String, PaymentEvent> template) {

        return RetryTopicConfigurationBuilder
                .newInstance()
                .maxRetryAttempts(5)
                .fixedBackOff(60_000)  // 1 minute between retries
                .retryTopicSuffix("-retry")
                .dltSuffix("-dlq")
                .includeTopics("payment-events")
                .dltHandlerMethod("deadLetterHandler", "handle")
                .create(template);
    }

    @Bean
    public DeadLetterHandler deadLetterHandler(
            MeterRegistry meterRegistry,
            PaymentEventRepository repository) {
        return new DeadLetterHandler(meterRegistry, repository);
    }
}

// PRODUCTION - Consumer with automatic retry
@Component
public class FraudCheckConsumer {

    private final FraudDetectionClient fraudClient;
    private final PaymentResultPublisher resultPublisher;

    @KafkaListener(topics = "payment-events",
                   groupId = "fraud-check-consumer")
    public void onPaymentEvent(PaymentEvent event) {
        // Spring Kafka handles retry automatically:
        // - Exception thrown here -> message sent to retry topic
        // - After max retries -> message sent to DLQ
        FraudScore score = fraudClient.score(event);
        resultPublisher.publishResult(event.paymentId(), score);
    }
}

@Component
public class DeadLetterHandler {

    private final MeterRegistry meterRegistry;
    private final PaymentEventRepository repository;

    @DltHandler
    public void handle(PaymentEvent event,
                       @Header(KafkaHeaders.RECEIVED_TOPIC) String topic,
                       @Header(KafkaHeaders.EXCEPTION_MESSAGE) String error) {
        meterRegistry.counter("kafka.dlq.received",
                "original_topic", "payment-events")
                .increment();

        // Persist for manual review
        repository.saveDlqEvent(event, topic, error);

        log.error("Payment event exhausted retries: paymentId={}, error={}",
                event.paymentId(), error);
    }
}

Implementing Retry Delay

Fixed backoff retries every 60 seconds. For the transaction platform, escalating delays are more appropriate:

// PRODUCTION - Escalating retry delays
@Bean
public RetryTopicConfiguration escalatingRetryConfig(
        KafkaTemplate<String, PaymentEvent> template) {

    return RetryTopicConfigurationBuilder
            .newInstance()
            .exponentialBackoff(60_000, 3.0, 900_000)
            // First retry: 1 minute
            // Second retry: 3 minutes
            // Third retry: 9 minutes
            // Fourth retry: 15 minutes (capped at maxInterval)
            // Fifth retry: 15 minutes
            .maxRetryAttempts(5)
            .retryOn(FraudServiceUnavailableException.class)
            .notRetryOn(InvalidPaymentException.class)
            // Only retry transient failures.
            // Invalid payments go directly to DLQ.
            .includeTopics("payment-events")
            .create(template);
}

The escalating delay gives the fraud detection service progressively more time to recover. A 60-second outage is covered by the first retry. A 15-minute deployment is covered by retries 3-5. Outages longer than ~45 minutes (sum of all delays) result in DLQ messages that require manual intervention.

Testing with Testcontainers

// PRODUCTION - Integration test with embedded Kafka
@SpringBootTest
@EmbeddedKafka(
        partitions = 1,
        topics = {"payment-events",
                  "payment-events-retry-0",
                  "payment-events-retry-1",
                  "payment-events-dlq"})
class KafkaRetryIntegrationTest {

    @Autowired
    private KafkaTemplate<String, PaymentEvent> kafkaTemplate;

    @MockBean
    private FraudDetectionClient fraudClient;

    @Autowired
    private KafkaListenerEndpointRegistry registry;

    @Test
    void retriesExhausted_sendsToDeadLetterTopic(
            @Autowired ConsumerFactory<String, PaymentEvent> cf) {

        // Fraud service always fails
        when(fraudClient.score(any()))
                .thenThrow(new FraudServiceUnavailableException(
                        "Connection refused"));

        // Send payment event
        PaymentEvent event = new PaymentEvent(
                "PAY-001", "ACC-123", new BigDecimal("150.00"));
        kafkaTemplate.send("payment-events", event.paymentId(), event);

        // Wait for retries to exhaust and message to arrive in DLQ
        Consumer<String, PaymentEvent> dlqConsumer =
                cf.createConsumer("test-dlq-group", "test");
        dlqConsumer.subscribe(List.of("payment-events-dlq"));

        await().atMost(Duration.ofMinutes(2)).untilAsserted(() -> {
            ConsumerRecords<String, PaymentEvent> records =
                    dlqConsumer.poll(Duration.ofMillis(100));
            assertThat(records.count()).isGreaterThan(0);

            ConsumerRecord<String, PaymentEvent> dlqRecord =
                    records.iterator().next();
            assertThat(dlqRecord.value().paymentId())
                    .isEqualTo("PAY-001");
        });

        dlqConsumer.close();
    }
}

The Observable Signal: Consumer Lag

Consumer lag is the distance between the latest produced offset and the latest consumed offset. Under normal operation, lag is near zero. When the fraud detection service is unavailable and messages accumulate in retry topics, lag grows.

# PRODUCTION - Prometheus metrics for Kafka resilience
# Spring Kafka exposes consumer lag via Micrometer

# Consumer lag per partition
kafka_consumer_fetch_manager_records_lag{topic="payment-events"}
kafka_consumer_fetch_manager_records_lag{topic="payment-events-retry-0"}

# DLQ message rate
rate(kafka.dlq.received_total[5m])

Lag on payment-events: Growing lag means the consumer is processing slower than the producer is publishing. Possible causes: fraud service latency, consumer under-provisioned.

Lag on retry topics: Growing lag means retry processing is falling behind. The fraud service is still unavailable, and retry messages are accumulating faster than they can be processed (because each attempt fails quickly and is sent to the next retry topic).

DLQ rate: Any sustained DLQ rate is an alert. It means messages have exhausted all retry attempts. Each DLQ message represents a payment that has not been fraud-checked. The operations team must investigate, fix the underlying issue, and replay DLQ messages.

The critical alert: kafka_consumer_fetch_manager_records_lag{topic="payment-events"} > 10000. This means over 10,000 payment events are waiting for fraud checking. Combined with a circuit breaker open alert, this confirms the fraud detection service is down and payments are accumulating.