feat: Embedded (back-pressure) metrics for dashboard #13830

fuyufjh · 2023-12-06T07:46:12Z

Background

We are actively improving observability in this quarter, including

Improve error reporting: refactor: reinvestigate error handling and reporting in the system #11443
Added a system table: Discussion: persist system events and query via SQL #13267

But one missing piece is the diagnosis of performance issues. In this proposal, I'd like to take back-pressure rate as an example, because this is the most helpful metric to identify a performance bottleneck.

We have displayed this on the meta dashboard. However, in on-perm deployments, they often didn't use our provided Prometheus yaml, and thus the meta dashboard can't actually work.

Proposal

Many systems have embedded or self-contained performance monitoring components. For example,

Flink's dashboard can show back-pressure rate without support of Prometheus, etc.
Spark Web UI can show the progress of each stage in a batch job
MySQL has a performance_schema which exposes lots of internal info

Example of Spark web UI:

I'd like to introduce a new self-contained monitoring component on Meta node and compute node. When being requested from RPC, it collects data from each CN and show a back-pressure at this moment (More accurately, in a recent time period e.g. last 15 seconds).

Embedded v.s. Prometheus

The embedded metrics is not intended to replace Prometheus, but just a light-weighted complement.

The embedded metrics are just designed for end users. The Prometheus metrics is for us, and it's full and complete.
The embedded metrics never store history data. It can only display the current situation.
The embedded metrics doesn't need to be persisted. Keep it small and in-memory.

For now, I recommend starting from back-pressure metrics only. I don't have any future plans now. As mentioned before, this is the most helpful metric to identify a performance bottleneck.

The text was updated successfully, but these errors were encountered:

yufansong · 2024-03-06T04:06:42Z

Close, already finished.

github-actions bot added this to the release-1.6 milestone Dec 6, 2023

fuyufjh added needs-discussion needs-design Don't start your coding work before a detailed design proposed labels Dec 6, 2023

fuyufjh mentioned this issue Dec 18, 2023

feat(dashboard): add diagnose command #14025

Merged

9 tasks

fuyufjh assigned yufansong Jan 9, 2024

yufansong mentioned this issue Jan 25, 2024

feat(meta): add get back pressure RPC for UI dashboard #14790

Merged

9 tasks

fuyufjh changed the title ~~Discussion: Embedded (back-pressure) metrics for dashboard~~ feat: Embedded (back-pressure) metrics for dashboard Jan 25, 2024

This was referenced Jan 29, 2024

refactor(meta): move get_back_pressure into monitor service and make it concurrent #14829

Merged

feat(dashboard): add back_pressure_rate UI #14863

Merged

yufansong closed this as completed Mar 6, 2024

BugenZhao mentioned this issue Jun 18, 2024

visualize inter-job back-pressure rate in dashboard #17298

Closed

fuyufjh mentioned this issue Nov 13, 2024

Explain Analyze #896

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Embedded (back-pressure) metrics for dashboard #13830

feat: Embedded (back-pressure) metrics for dashboard #13830

fuyufjh commented Dec 6, 2023 •

edited

Loading

yufansong commented Mar 6, 2024

feat: Embedded (back-pressure) metrics for dashboard #13830

feat: Embedded (back-pressure) metrics for dashboard #13830

Comments

fuyufjh commented Dec 6, 2023 • edited Loading

Background

Proposal

Embedded v.s. Prometheus

yufansong commented Mar 6, 2024

fuyufjh commented Dec 6, 2023 •

edited

Loading