Challenges & Learnings with using Kafka/Redpanda at a huge scale
Producer configs: don’t always rely defaults
Retries for successful delivery
Redpanda continually rebalances leaders in the background. Some rebalances might take time, and errors or higher latency might be observed during that window. To overcome this, add bounded retries and higher batching of messages during production.
Hot Spotting Issues
No Multiple Initialization on Kafka Client
Creating multiple Kafka clients is not ideal and creates connection overload. A single Kafka client per consumer or producer per cluster should work fine for most use cases.
Deciding right number of Partitions
Async Produce and TryProduce (Fail Fast)
Binary Serialization and Compression is must
Graceful Shutdowns: Producers and Consumers
Tiered Storage : Keep Storage cost in control
*(Do note that consuming from tiered storage is slightly more expensive for consumers.)
Client-Side Observability
Along with Kafka server-side observability, client (Producer & Consumer) side observability is a must. Helps in debugging any issues much faster. Few metrics to observe -
Producer
Consumers
Common