1 of 12

Challenges & Learnings with using Kafka/Redpanda at a huge scale

- Anuraj Jain

https://www.linkedin.com/in/anuraj-jain-3101

2 of 12

Producer configs: don’t always rely defaults

Batching configs - buffer bytes, buffer count, linger.ms, optimise requests rate, compression
Right produced message acknowledgement
Partitioner should be tuned, in Franz-Go UniformBytesPartitioner can affect connections open/close rate and Kafka produce request size. Can help in pinning a connection

3 of 12

Retries for successful delivery

Redpanda continually rebalances leaders in the background. Some rebalances might take time, and errors or higher latency might be observed during that window. To overcome this, add bounded retries and higher batching of messages during production.

4 of 12

Hot Spotting Issues

Producers can lead to hot-spotting brokers if significant throughput going to only few brokers.
Consumers can lead to hot-spotting brokers if they update their offsets very aggressively (per message) at high volumes. It’s advisable to commit at intervals and have multiple consumer groups. Use Auto-Commits when possible.

5 of 12

No Multiple Initialization on Kafka Client

Creating multiple Kafka clients is not ideal and creates connection overload. A single Kafka client per consumer or producer per cluster should work fine for most use cases.

6 of 12

Deciding right number of Partitions

Wrong partitions count affect all Producers, Consumers and Kafka Server
If you expecting significant throughput on a topic then create partitions in multiples of your brokers (also check if partitions are distributed equally among brokers) OR highly composite numbers
Keep enough partitions that equals your consumers count required at peak traffic times. Also don’t forget to account for festive/sale season traffic peaks
Also remember to not create too many partitions, unnecessary overhead for brokers. You can increase but cannot decrease partitions count

7 of 12

Async Produce and TryProduce (Fail Fast)

Use Async Produce method from Kafka Client whenever possible and add callback for handling any failures.
Using a TryProduce, basically fail-fast instead of blocking when message batch/buffer is already full. Keeps your producer threads/goroutines in control, prevents OOMs on high scale services.

8 of 12

Binary Serialization and Compression is must

Binary Serialization and Compression of messages being produced directly impacts performance of your system. Less throughput, less data & compute at brokers, less to consume by consumers, low e2e latency, lower cloud cost.
Produce & Consume throughout affects compute required for brokers in direct proportion.
Not to mention Serialization with Proto/Avro enforces data contracts for good.

9 of 12

Graceful Shutdowns: Producers and Consumers

Shutdown/SIGTERM for Producer application should be handled, kafka-producer-client should be closed and immediately flush the available messages in the batch/buffer, or wait for linger.ms before the pod dies. Prevents loss of messages.
Shutdown/SIGTERM for Consumer application should be handled, stop consumption or new messages, process and commit any messages consumed in-memory. Prevent loss of messages or duplicate processing.

10 of 12

Tiered Storage : Keep Storage cost in control

For high-volume topics, we retained only a few hours of data in local-storage(SSD) and had longer retention in the tiered storage layer.
Keeps less data in SSD and so lower cost.
Also this helped reduce the volume of data to be transferred among brokers on a node failure.

*(Do note that consuming from tiered storage is slightly more expensive for consumers.)

11 of 12

Client-Side Observability

Along with Kafka server-side observability, client (Producer & Consumer) side observability is a must. Helps in debugging any issues much faster. Few metrics to observe -

Producer

Compression Ratio
Bytes Produced Rate
Batches Produced & Batch Size Rate
Messages Produced Rate
Produce Request Rate & latency

Consumers

Messages consumed Rate
Batches Fetched & Batch Size Rate
Commit Interval and Rate
Heartbeat rate
Fetch Request Rate & Latency

Common

Connections opened/closed rate, active connections
Kafka APIs rate, bytes write/read, success/failure rate, latency