Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More accurate consumption metrics in presence of server restarts and replication #11441

Open
kirkrodrigues opened this issue Aug 25, 2023 · 4 comments
Labels

Comments

@kirkrodrigues
Copy link
Contributor

Currently, metrics like REALTIME_ROWS_CONSUMED are emitted by each partition consumer, which we can use to track real-time consumption at a topic-level by aggregating the metric per time window. However, the metric is emitted by each replica making it complicated to aggregate---ideally we would just divide the metric by the replication factor, but when replicas restart (and start re-consuming a segment) or fall-behind in consumption, this calculation is no longer accurate.

I'm wondering if there's any way to emit/calculate such a metric, accurately, at the topic-level without too much overhead.

@Jackie-Jiang
Copy link
Contributor

So this metric will emit the total rows consumed for a particular stream partition since the server starts. Do you think it is more useful if it emits the total rows consumed for the latest consuming segment? Essentially we reset the count for each new consuming segment

cc @mcvsubbu @sajjad-moradi

@mcvsubbu
Copy link
Contributor

If I understand your comment correctly, you want to get a metric on total number of rows consumed by Pinot?
If you are OK getting this metric with a lag (and in steps) then you can get it from number of rows after the segment is completed. Perhaps the controller emits such a metric now? (total number of rows in a table)

If you want this in realtime, another way of getting this is to see the total number of rows/bytes produced to the topic. If your input stream offers such a metric that may work for you.

In LinkedIn, the METERs are projected as rates, so we get to see the consumption rates for each partition with this metric. We match these against production rates when we want to debug. We also rely heavily on lag metrics (how much is the time or offset lag of any partition of a topic), to assess the quality of service to our consumers.

Generally, the volume of consumption is of less interest to us.

@sajjad-moradi
Copy link
Contributor

If the goal is to catch consumption issues, as Subbu mentioned, freshness metric can be used. If the goal is to know how many rows are there in the table, using this metric is problematic. If one replica has consumption issue, the aggregated count divided by replica count shows an incorrect value, because the non-problematic replica(s) serve queries correctly, while the aggregate metric shows that there are less rows in the table.

@hpvd
Copy link

hpvd commented Apr 9, 2024

Perhaps this is something to keep in mind when thinking about further refining the metrics: #12840
The interesting thing there is not only the standardization, but the connection of metric, logs, traces and profiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants