zio
diff --git a/‎docs/consumer-internals.svg
Lines changed: 4 additions & 0 deletions b/‎docs/consumer-internals.svg
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/consumer-tuning.md
Lines changed: 36 additions & 0 deletions b/‎docs/consumer-tuning.md
Lines changed: 36 additions & 0 deletions
@@ -90,3 +90,39 @@ On older zio-kafka versions `withMaxPollInterval` is not available. Use the foll
 ⚠️In zio-kafka versions 2.2 up to 2.5.0 it may also be necessary to increase the `runloopTimeout` setting.
 When no stream is processing data for this amount of time (while new data is available), the consumer will halt with a
 failure. In zio-kafka 2.5.0 `runloopTimeout` defaults to 4 minutes, a little bit lower than `max.poll.interval.ms`.
+
+## Using metrics to tune the consumer
+
+Zio-Kafka exposes [metrics](metrics.md) that can be used to further tune the consumer. To interpret these metrics you need to know how zio-kafka works internally.
+
+![](consumer-internals.svg)
+
+The runloop is at the heart of every zio-kafka consumer.
+It creates a zstream for each partition, eventually this is the zstream your applications consumes from.
+When the zstream starts, and every time the records queue is empty, it sends a request for data to the runloop.
+The request causes the runloop to resume the partition so that the next poll may receive records.
+Any received records are put in the records queue.
+When the queue reaches a certain size (as determined by the configured `FetchStrategy`), the partition is paused.
+Meanwhile, the zstream reads from the queue and emits the records to your application.
+
+An optimally configured consumer has the following properties:
+
+- the zstreams never have to wait for new records (to get high throughput),
+- most of the time, the record queues are empty (to get low latency and low heap usage).
+
+The following strategy can help you get to this state:
+
+1. First make sure that `pollTimeout` and `max.poll.records` make sense for the latency and throughput requirements
+   of your application.
+2. Configure `partitionPreFetchBufferLimit` to `0`.
+3. Observe metric `ziokafka_consumer_queue_polls` which gives the number of polls during which records are idling in
+   the queue.
+4. Increase `partitionPreFetchBufferLimit` in steps until most measurements of the `ziokafka_consumer_queue_polls`
+   histogram are in the `0` bucket .
+
+During this process, it is useful to observe metric `ziokafka_consumer_queue_size` (number of records in the queues) to
+see if the queues are indeed increasing in size.
+
+When many (hundreds of) partitions need to be consumed, the metric `ziokafka_consumer_all_queue_size` should also be
+observed as increasing `partitionPreFetchBufferLimit` can lead to high heap usage. (See 'High number of partitions'
+above.)