Skip to content

Commit

Permalink
[Metrics] Add the OOM failure graph (#33999)
Browse files Browse the repository at this point in the history
Add the OOM failure graph.

I added under the top section (task/actor) since it is much easier to discover + related to tasks and actors failures.

In the long term, we can replace this graph to failure type graph (e.g., include application failure, node failure, etc.).
  • Loading branch information
rkooo567 authored Apr 6, 2023
1 parent 07b6721 commit 4c4f35a
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 0 deletions.
4 changes: 4 additions & 0 deletions dashboard/client/src/pages/metrics/Metrics.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,10 @@ const METRICS_CONFIG: MetricsSectionConfig[] = [
title: "Active Actors by Name",
pathParams: "orgId=1&theme=light&panelId=36",
},
{
title: "Out of Memory Failures by Name",
pathParams: "orgId=1&theme=light&panelId=44",
},
],
},
{
Expand Down
12 changes: 12 additions & 0 deletions dashboard/modules/metrics/dashboards/default_dashboard_panels.py
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,18 @@ def max_plus_pending(max_resource, pending_resource):
),
],
),
Panel(
id=44,
title="Node Out of Memory Failures by Name",
description="The number of tasks and actors killed by the Ray Out of Memory killer due to high memory pressure. Metrics are broken down by IP and the name. https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html.",
unit="failures",
targets=[
Target(
expr='ray_memory_manager_worker_eviction_total{{instance=~"$Instance",{global_filters}}}',
legend="OOM Killed: {{Name}}, {{instance}}",
),
],
),
Panel(
id=34,
title="Node Memory by Component",
Expand Down
3 changes: 3 additions & 0 deletions doc/source/ray-observability/ray-metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,9 @@ Ray exports a number of system metrics, which provide introspection into the sta
* - `ray_placement_groups`
- `State`
- Current number of placement groups by state. The State label (e.g., PENDING, CREATED, REMOVED) describes the state of the placement group. See `rpc::PlacementGroupTable <https://github.com/ray-project/ray/blob/e85355b9b593742b4f5cb72cab92051980fa73d3/src/ray/protobuf/gcs.proto#L517>`_ for more information.
* - `ray_memory_manager_worker_eviction_total`
- `Type`, `Name`
- The number of tasks and actors killed by the Ray Out of Memory killer (https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html) broken down by types (whether it is tasks or actors) and names (name of tasks and actors).
* - `ray_node_cpu_utilization`
- `InstanceId`
- The CPU utilization per node as a percentage quantity (0..100). This should be scaled by the number of cores per node to convert the units into cores.
Expand Down

0 comments on commit 4c4f35a

Please sign in to comment.