Skip to content

Commit e6af874

Browse files
Document metrics and consumer tuning based on metrics (#1280)
Also: fix typo and make metric descriptions consistent.
1 parent 9a60ccd commit e6af874

File tree

5 files changed

+142
-7
lines changed

5 files changed

+142
-7
lines changed

docs/consumer-internals.svg

Lines changed: 4 additions & 0 deletions
Loading

docs/consumer-tuning.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,3 +90,39 @@ On older zio-kafka versions `withMaxPollInterval` is not available. Use the foll
9090
⚠️In zio-kafka versions 2.2 up to 2.5.0 it may also be necessary to increase the `runloopTimeout` setting.
9191
When no stream is processing data for this amount of time (while new data is available), the consumer will halt with a
9292
failure. In zio-kafka 2.5.0 `runloopTimeout` defaults to 4 minutes, a little bit lower than `max.poll.interval.ms`.
93+
94+
## Using metrics to tune the consumer
95+
96+
Zio-Kafka exposes [metrics](metrics.md) that can be used to further tune the consumer. To interpret these metrics you need to know how zio-kafka works internally.
97+
98+
![](consumer-internals.svg)
99+
100+
The runloop is at the heart of every zio-kafka consumer.
101+
It creates a zstream for each partition, eventually this is the zstream your applications consumes from.
102+
When the zstream starts, and every time the records queue is empty, it sends a request for data to the runloop.
103+
The request causes the runloop to resume the partition so that the next poll may receive records.
104+
Any received records are put in the records queue.
105+
When the queue reaches a certain size (as determined by the configured `FetchStrategy`), the partition is paused.
106+
Meanwhile, the zstream reads from the queue and emits the records to your application.
107+
108+
An optimally configured consumer has the following properties:
109+
110+
- the zstreams never have to wait for new records (to get high throughput),
111+
- most of the time, the record queues are empty (to get low latency and low heap usage).
112+
113+
The following strategy can help you get to this state:
114+
115+
1. First make sure that `pollTimeout` and `max.poll.records` make sense for the latency and throughput requirements
116+
of your application.
117+
2. Configure `partitionPreFetchBufferLimit` to `0`.
118+
3. Observe metric `ziokafka_consumer_queue_polls` which gives the number of polls during which records are idling in
119+
the queue.
120+
4. Increase `partitionPreFetchBufferLimit` in steps until most measurements of the `ziokafka_consumer_queue_polls`
121+
histogram are in the `0` bucket .
122+
123+
During this process, it is useful to observe metric `ziokafka_consumer_queue_size` (number of records in the queues) to
124+
see if the queues are indeed increasing in size.
125+
126+
When many (hundreds of) partitions need to be consumed, the metric `ziokafka_consumer_all_queue_size` should also be
127+
observed as increasing `partitionPreFetchBufferLimit` can lead to high heap usage. (See 'High number of partitions'
128+
above.)

0 commit comments

Comments
 (0)