You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But one missing piece is the diagnosis of performance issues. In this proposal, I'd like to take back-pressure rate as an example, because this is the most helpful metric to identify a performance bottleneck.
We have displayed this on the meta dashboard. However, in on-perm deployments, they often didn't use our provided Prometheus yaml, and thus the meta dashboard can't actually work.
Proposal
Many systems have embedded or self-contained performance monitoring components. For example,
Flink's dashboard can show back-pressure rate without support of Prometheus, etc.
Spark Web UI can show the progress of each stage in a batch job
MySQL has a performance_schema which exposes lots of internal info
Example of Spark web UI:
I'd like to introduce a new self-contained monitoring component on Meta node and compute node. When being requested from RPC, it collects data from each CN and show a back-pressure at this moment (More accurately, in a recent time period e.g. last 15 seconds).
Embedded v.s. Prometheus
The embedded metrics is not intended to replace Prometheus, but just a light-weighted complement.
The embedded metrics are just designed for end users. The Prometheus metrics is for us, and it's full and complete.
The embedded metrics never store history data. It can only display the current situation.
The embedded metrics doesn't need to be persisted. Keep it small and in-memory.
For now, I recommend starting from back-pressure metrics only. I don't have any future plans now. As mentioned before, this is the most helpful metric to identify a performance bottleneck.
The text was updated successfully, but these errors were encountered:
Background
We are actively improving observability in this quarter, including
But one missing piece is the diagnosis of performance issues. In this proposal, I'd like to take back-pressure rate as an example, because this is the most helpful metric to identify a performance bottleneck.
We have displayed this on the meta dashboard. However, in on-perm deployments, they often didn't use our provided Prometheus yaml, and thus the meta dashboard can't actually work.
Proposal
Many systems have embedded or self-contained performance monitoring components. For example,
performance_schema
which exposes lots of internal infoExample of Spark web UI:
I'd like to introduce a new self-contained monitoring component on Meta node and compute node. When being requested from RPC, it collects data from each CN and show a back-pressure at this moment (More accurately, in a recent time period e.g. last 15 seconds).
Embedded v.s. Prometheus
The embedded metrics is not intended to replace Prometheus, but just a light-weighted complement.
For now, I recommend starting from back-pressure metrics only. I don't have any future plans now. As mentioned before, this is the most helpful metric to identify a performance bottleneck.
The text was updated successfully, but these errors were encountered: