-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More accurate consumption metrics in presence of server restarts and replication #11441
Comments
So this metric will emit the total rows consumed for a particular stream partition since the server starts. Do you think it is more useful if it emits the total rows consumed for the latest consuming segment? Essentially we reset the count for each new consuming segment |
If I understand your comment correctly, you want to get a metric on total number of rows consumed by Pinot? If you want this in realtime, another way of getting this is to see the total number of rows/bytes produced to the topic. If your input stream offers such a metric that may work for you. In LinkedIn, the METERs are projected as rates, so we get to see the consumption rates for each partition with this metric. We match these against production rates when we want to debug. We also rely heavily on lag metrics (how much is the time or offset lag of any partition of a topic), to assess the quality of service to our consumers. Generally, the volume of consumption is of less interest to us. |
If the goal is to catch consumption issues, as Subbu mentioned, freshness metric can be used. If the goal is to know how many rows are there in the table, using this metric is problematic. If one replica has consumption issue, the aggregated count divided by replica count shows an incorrect value, because the non-problematic replica(s) serve queries correctly, while the aggregate metric shows that there are less rows in the table. |
Perhaps this is something to keep in mind when thinking about further refining the metrics: #12840 |
Currently, metrics like
REALTIME_ROWS_CONSUMED
are emitted by each partition consumer, which we can use to track real-time consumption at a topic-level by aggregating the metric per time window. However, the metric is emitted by each replica making it complicated to aggregate---ideally we would just divide the metric by the replication factor, but when replicas restart (and start re-consuming a segment) or fall-behind in consumption, this calculation is no longer accurate.I'm wondering if there's any way to emit/calculate such a metric, accurately, at the topic-level without too much overhead.
The text was updated successfully, but these errors were encountered: