More accurate consumption metrics in presence of server restarts and replication #11441

kirkrodrigues · 2023-08-25T14:51:24Z

Currently, metrics like REALTIME_ROWS_CONSUMED are emitted by each partition consumer, which we can use to track real-time consumption at a topic-level by aggregating the metric per time window. However, the metric is emitted by each replica making it complicated to aggregate---ideally we would just divide the metric by the replication factor, but when replicas restart (and start re-consuming a segment) or fall-behind in consumption, this calculation is no longer accurate.

I'm wondering if there's any way to emit/calculate such a metric, accurately, at the topic-level without too much overhead.

The text was updated successfully, but these errors were encountered:

Jackie-Jiang · 2023-08-29T18:53:34Z

So this metric will emit the total rows consumed for a particular stream partition since the server starts. Do you think it is more useful if it emits the total rows consumed for the latest consuming segment? Essentially we reset the count for each new consuming segment

cc @mcvsubbu @sajjad-moradi

mcvsubbu · 2023-08-29T19:59:10Z

If I understand your comment correctly, you want to get a metric on total number of rows consumed by Pinot?
If you are OK getting this metric with a lag (and in steps) then you can get it from number of rows after the segment is completed. Perhaps the controller emits such a metric now? (total number of rows in a table)

If you want this in realtime, another way of getting this is to see the total number of rows/bytes produced to the topic. If your input stream offers such a metric that may work for you.

In LinkedIn, the METERs are projected as rates, so we get to see the consumption rates for each partition with this metric. We match these against production rates when we want to debug. We also rely heavily on lag metrics (how much is the time or offset lag of any partition of a topic), to assess the quality of service to our consumers.

Generally, the volume of consumption is of less interest to us.

sajjad-moradi · 2023-08-31T17:07:20Z

If the goal is to catch consumption issues, as Subbu mentioned, freshness metric can be used. If the goal is to know how many rows are there in the table, using this metric is problematic. If one replica has consumption issue, the aggregated count divided by replica count shows an incorrect value, because the non-problematic replica(s) serve queries correctly, while the aggregate metric shows that there are less rows in the table.

hpvd · 2024-04-09T12:23:30Z

Perhaps this is something to keep in mind when thinking about further refining the metrics: #12840
The interesting thing there is not only the standardization, but the connection of metric, logs, traces and profiles.

Jackie-Jiang added the metrics label Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More accurate consumption metrics in presence of server restarts and replication #11441

More accurate consumption metrics in presence of server restarts and replication #11441

kirkrodrigues commented Aug 25, 2023

Jackie-Jiang commented Aug 29, 2023

mcvsubbu commented Aug 29, 2023

sajjad-moradi commented Aug 31, 2023

hpvd commented Apr 9, 2024 •

edited

Loading

More accurate consumption metrics in presence of server restarts and replication #11441

More accurate consumption metrics in presence of server restarts and replication #11441

Comments

kirkrodrigues commented Aug 25, 2023

Jackie-Jiang commented Aug 29, 2023

mcvsubbu commented Aug 29, 2023

sajjad-moradi commented Aug 31, 2023

hpvd commented Apr 9, 2024 • edited Loading

hpvd commented Apr 9, 2024 •

edited

Loading