1 of 12

Challenges & Learnings with using Kafka/Redpanda at a huge scale

2 of 12

Producer configs: don’t always rely defaults

  • Batching configs - buffer bytes, buffer count, linger.ms, optimise requests rate, compression
  • Right produced message acknowledgement
  • Partitioner should be tuned, in Franz-Go UniformBytesPartitioner can affect connections open/close rate and Kafka produce request size. Can help in pinning a connection

3 of 12

Retries for successful delivery

Redpanda continually rebalances leaders in the background. Some rebalances might take time, and errors or higher latency might be observed during that window. To overcome this, add bounded retries and higher batching of messages during production.

4 of 12

Hot Spotting Issues

  • Producers can lead to hot-spotting brokers if significant throughput going to only few brokers.
  • Consumers can lead to hot-spotting brokers if they update their offsets very aggressively (per message) at high volumes. It’s advisable to commit at intervals and have multiple consumer groups. Use Auto-Commits when possible.

5 of 12

No Multiple Initialization on Kafka Client

Creating multiple Kafka clients is not ideal and creates connection overload. A single Kafka client per consumer or producer per cluster should work fine for most use cases.

6 of 12

Deciding right number of Partitions

  • Wrong partitions count affect all Producers, Consumers and Kafka Server
  • If you expecting significant throughput on a topic then create partitions in multiples of your brokers (also check if partitions are distributed equally among brokers) OR highly composite numbers
  • Keep enough partitions that equals your consumers count required at peak traffic times. Also don’t forget to account for festive/sale season traffic peaks
  • Also remember to not create too many partitions, unnecessary overhead for brokers. You can increase but cannot decrease partitions count

7 of 12

Async Produce and TryProduce (Fail Fast)

  • Use Async Produce method from Kafka Client whenever possible and add callback for handling any failures.
  • Using a TryProduce, basically fail-fast instead of blocking when message batch/buffer is already full. Keeps your producer threads/goroutines in control, prevents OOMs on high scale services.

8 of 12

Binary Serialization and Compression is must

  • Binary Serialization and Compression of messages being produced directly impacts performance of your system. Less throughput, less data & compute at brokers, less to consume by consumers, low e2e latency, lower cloud cost.
  • Produce & Consume throughout affects compute required for brokers in direct proportion.
  • Not to mention Serialization with Proto/Avro enforces data contracts for good.

9 of 12

Graceful Shutdowns: Producers and Consumers

  • Shutdown/SIGTERM for Producer application should be handled, kafka-producer-client should be closed and immediately flush the available messages in the batch/buffer, or wait for linger.ms before the pod dies. Prevents loss of messages.
  • Shutdown/SIGTERM for Consumer application should be handled, stop consumption or new messages, process and commit any messages consumed in-memory. Prevent loss of messages or duplicate processing.

10 of 12

Tiered Storage : Keep Storage cost in control

  • For high-volume topics, we retained only a few hours of data in local-storage(SSD) and had longer retention in the tiered storage layer.
  • Keeps less data in SSD and so lower cost.
  • Also this helped reduce the volume of data to be transferred among brokers on a node failure.

*(Do note that consuming from tiered storage is slightly more expensive for consumers.)

11 of 12

Client-Side Observability

Along with Kafka server-side observability, client (Producer & Consumer) side observability is a must. Helps in debugging any issues much faster. Few metrics to observe -

Producer

  • Compression Ratio
  • Bytes Produced Rate
  • Batches Produced & Batch Size Rate
  • Messages Produced Rate
  • Produce Request Rate & latency

Consumers

  • Messages consumed Rate
  • Batches Fetched & Batch Size Rate
  • Commit Interval and Rate
  • Heartbeat rate
  • Fetch Request Rate & Latency

Common

  • Connections opened/closed rate, active connections
  • Kafka APIs rate, bytes write/read, success/failure rate, latency

12 of 12

References