Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PSU & system health] Support PSU power threshold checking #1060

Merged
merged 23 commits into from
Nov 18, 2022
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
39aa11d
PSU power exceeding check - initial revision
stephenxs Aug 4, 2022
6afc6e7
Fix review comments
stephenxs Aug 24, 2022
cb34e5f
Rename platform API
stephenxs Aug 26, 2022
d8d9539
Update field name in state db
stephenxs Aug 26, 2022
cb3615d
System health update
stephenxs Aug 10, 2022
336c5b5
Update database field name
stephenxs Aug 29, 2022
28e9913
Merge branch 'master' into psu-power-threshold-github
liat-grozovik Aug 29, 2022
bd7e2e6
Merge branch 'master' into psu-power-threshold-github
liat-grozovik Sep 19, 2022
e7398f7
Merge branch 'master' into psu-power-threshold-github
liat-grozovik Sep 28, 2022
27ced8c
Merge branch 'master' into psu-power-threshold-github
stephenxs Oct 4, 2022
97c1fae
Rephrase and fix typo
stephenxs Oct 4, 2022
6f9bace
Both critical and warning thresholds should be exposed
stephenxs Oct 4, 2022
c947b6e
Add a picture to describe warning/critical thresholds
stephenxs Oct 11, 2022
ac809ed
Merge branch 'master' into psu-power-threshold-github
liat-grozovik Oct 19, 2022
74a5fbe
Merge branch 'master' into psu-power-threshold-github
liat-grozovik Nov 9, 2022
1ac1eff
Fix Prince's comments
stephenxs Nov 13, 2022
07a314f
power_warning_threshold => power_warning_suppress_threshold
stephenxs Nov 17, 2022
9bcf22d
Update picture
stephenxs Nov 17, 2022
55dea13
Fix comments
stephenxs Nov 17, 2022
0993402
Fix review comments
stephenxs Nov 18, 2022
a0ae272
Update CLI
stephenxs Nov 18, 2022
05c5fea
More warning => warning-suppress
stephenxs Nov 18, 2022
0caacec
Merge branch 'master' into psu-power-threshold-github
prgeor Nov 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 171 additions & 30 deletions doc/psud/PSU_daemon_design.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# SONiC PSU Daemon Design #

### Rev 0.2 ###
### Rev 0.4 ###

### Revision ###

Expand All @@ -10,6 +10,7 @@
| 0.1 | | Chen Junchao | Initial version |
| 0.2 | August 4th, 2022 | Stephen Sun | Update according to the current implementation |
| 0.3 | August 8th, 2022 | Or Farfara | Add input current, voltage and max power |
| 0.4 | August 18th, 2022 | Stephen Sun | PSU power threshold checking logic |


## 1. Overview
Expand All @@ -28,6 +29,17 @@ The purpose of PSU daemon is to collect platform PSU data and trigger proper act
- whether the PSU voltage exceeds the minimal and maximum thresholds
- whether the PSU temperature exceeds the threshold
- whether the total PSU power consumption exceeds the budget (modular switch only)
- whether PSU power consumption exceeds the PSU threshold

### 1.1 PSU power threshold check

#### 1.1.1 Why we need it

An Ethernet switch is typically equipped with more than one PSU for redundancy. It can be deployed in different scenarios with different types of xSFP modules, traffic type and traffic load under different temperature. All these factors affect the power consumption of an Ethernet switch.

On some platforms, the capacity of a single PSU is not large enough to afford all the components and xSFP modules running at the highest performance at the same time. In this case, we do not have redundancy any longer and users should be notified of that, which is achieved via periodically checking the current power of PSUs against their maximum allowed power, AKA, power thresholds.

On some platforms, the maximum allowed power of the PSUs is not fixed but a dynamic value depending on other factors like temperature of certain sensors on the switch.

## 2. PSU data collection

Expand All @@ -37,40 +49,103 @@ PSU daemon data collection flow diagram:

Now psud collects PSU data via platform API, and it also support platform plugin for backward compatible. All PSU data will be saved to redis database for further usage.

### 2.1 PSU data collection specific to PSU power exceeding check

We will leverage the existing framework of PSU daemon, adding corresponding logic to perform PSU power check.

Currently, PSU daemon is waken up periodically, executing the following flows (flows in bold are newly introduced by the feature):

1. Check the PSUs' physical entity information and update them into database
2. Check the PSUs' present and power good information and update them to database
- __It will check the capability of PSU power check via reading PSU power thresholds when a new PSU is detected.__
3. Check and update the PSUs' data
- Fetch voltage, current, power via calling platform API
- __Perform PSU power checking logic__
- Update all the information to database

We will detail the new flows in the following sections.

#### New PSU is detected

Basically, there are two scenarios in which a new PSU can be detected:

- On PSU daemon starting, all PSUs installed on the switch are detected
- On new PSU pulgged, the new PSU is detected

When one or more new PSUs is detected and power is good, PSU daemon tries retrieving the warning and critical threshold for each PSU installed on the switch.

The PSU power checking will not be checked for a PSU if `NotImplemented` exception is thrown or `None` is returned while either threshold is being retrieved

#### Alarm raising and clearing threshold

We use asymmetric thresholds between raising and clearing the alarm for the purpose of creating a hysteresis and avoiding alarm flapping.

- an alarm will be raised when a PSU's power is rising accross the critical threshold
- an alarm will be cleared when a PSU's power is dropping across the warning threshold
stephenxs marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning threshold -> 'warning suppress' threshold


In case a unified power threshold is used, the alarm status can flap when the power fluctuates around the threshold. For example, in the following picture, the alarm is cleared every time the PSU power drops across the critical threshold and raised every time the PSU power rises across the critical threshold. By having two thresholds, the alarm won't be cleared and raised so frequently.

![](PSU_daemon_design_pictures/PSU-power-thresholds.png)

#### PSU power checking logic

For each PSU supporting power checking:

1. Retrieve the current power
2. If flag `PSU power exceeded threshold` is `true`, compare the current power against the warning threshold
- If `current power` < `warning threshold`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning-suppress threshold

- Set `PSU power exceeded threshold` to `false`
- Message in NOTICE level should be logged: `PSU <x>: current power <power> is below the warning threshold <threshold>` where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning-suppress threshold

- `<x>` is the number of the PSU
- `<power>` is the current power of the PSU
- `<threshold>` is the warning threshold of the PSU
- Otherwise: no action
3. Otherwise, compare the current power against the critical threshold
- If `current power` >= `critical threshold`
- Set `PSU power exceeded threshold` to `true`
- Message in WARNING level should be logged: `PSU <x>: current power <power> is exceeding the critical threshold <threshold>` where
stephenxs marked this conversation as resolved.
Show resolved Hide resolved
- `<x>` is the number of the PSU
- `<power>` is the current power of the PSU
- `<threshold>` is the warning threshold of the PSU
- Otherwise: no action

## 3. DB schema for PSU

PSU number is stored in chassis table. Please refer to this [document](https://github.com/sonic-net/SONiC/blob/master/doc/pmon/pmon-enhancement-design.md), section 1.5.2.

PSU information is stored in PSU table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this table for all PSUs in the system?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


; Defines information for a psu
key = PSU_INFO|psu_name ; information for the psu
; field = value
presence = BOOLEAN ; presence state of the psu
model = STRING ; model name of the psu
serial = STRING ; serial number of the psu
revision = STRING ; hardware revision of the PSU
status = BOOLEAN ; status of the psu
change_event = STRING ; change event of the psu
fan = STRING ; fan_name of the psu
led_status = STRING ; led status of the psu
is_replaceable = STRING ; whether the PSU is replaceable
temp = 1*3.3DIGIT ; temperature of the PSU
temp_threshold = 1*3.3DIGIT ; temperature threshold of the PSU
voltage = 1*3.3DIGIT ; the output voltage of the PSU
voltage_min_threshold = 1*3.3DIGIT ; the minimal voltage threshold of the PSU
voltage_max_threshold = 1*3.3DIGIT ; the maximum voltage threshold of the PSU
current = 1*3.3DIGIT ; the current of the PSU
power = 1*3.3DIGIT ; the power of the PSU
input_voltage = 1*3.3DIGIT ; input voltage of the psu
input_current = 1*3.3DIGIT ; input current of the psu
max_power = 1*4.3DIGIT ; power capacity of the psu

; Defines information for a psu
key = PSU_INFO|psu_name ; information for the psu
; field = value
presence = BOOLEAN ; presence state of the psu
model = STRING ; model name of the psu
serial = STRING ; serial number of the psu
revision = STRING ; hardware revision of the PSU
status = BOOLEAN ; status of the psu
change_event = STRING ; change event of the psu
fan = STRING ; fan_name of the psu
led_status = STRING ; led status of the psu
is_replaceable = STRING ; whether the PSU is replaceable
temp = 1*3.3DIGIT ; temperature of the PSU
temp_threshold = 1*3.3DIGIT ; temperature threshold of the PSU
voltage = 1*3.3DIGIT ; the output voltage of the PSU
voltage_min_threshold = 1*3.3DIGIT ; the minimal voltage threshold of the PSU
voltage_max_threshold = 1*3.3DIGIT ; the maximum voltage threshold of the PSU
current = 1*3.3DIGIT ; the current of the PSU
power = 1*4.3DIGIT ; the power of the PSU
input_voltage = 1*3.3DIGIT ; input voltage of the psu
input_current = 1*3.3DIGIT ; input current of the psu
max_power = 1*4.3DIGIT ; power capacity of the psu
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold
power_warning_threshold = 1*4.3DIGIT ; The power warning threshold
power_critical_threshold = 1*4.3DIGIT ; The power critical threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these thresholds the same for all PSUs?

Copy link
Collaborator Author

@stephenxs stephenxs Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For us, yes. But it's possible for other vendors to have different thresholds among all PSUs.


Now psud only collect and update "presence" and "status" field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need one field in the DB which indicates the critical threshold warning is ACTIVE. this is needed for telemetry

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

power_overload is the field to indicate whether it’s in the warning state


## 4. PSU command

### 4.1 show platform psustatus
There is a sub command "psustatus" under "show platform"

```
Expand All @@ -95,14 +170,41 @@ Commands:

The current output for "show platform psustatus" looks like:

```
admin@sonic:~$ show platform psustatus
PSU Model Serial HW Rev Voltage (V) Current (A) Power (W) Status LED
----- ------------- ------------ -------- ------------- ------------- ----------- ------- -----
PSU 1 MTEF-PSF-AC-A MT1629X14911 A3 12.08 5.19 62.62 WARNING green
stephenxs marked this conversation as resolved.
Show resolved Hide resolved
PSU 2 MTEF-PSF-AC-A MT1629X14913 A3 12.01 4.38 52.50 OK green
```

The field `Status` represents the status of the PSU, which can be the following:
- `OK` which represents no alarm raised due to PSU power exceeding the threshold
- `Not OK` which can be caused by:
- power is not good, which means the PSU is present but no power (Eg. the power is down or power cable is unplugged)
- `WARNING` which can be caused by:
- power exceeds the PSU's power threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which threshold? critical or warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The critical threshold. Will fix it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


### 4.2 psuutil

`psuutil` fetches the information via calling platform API directly. Both warning and critical thresholds will be exposed in the output of psuutil status.
The "WARNING" state is not exposed because psuutil is a one-time command instead of a daemon, which means it does not store state information. It fetches information via calling platform API so it can not distinguish the following status:

1. The power exceeded the critical threshold but is in the range between the warning and critical thresholds, which means the alarm should be raised
2. The power didn't exceed the critical threshold and exceeds the warning threshold, which means the alarm should not be raised

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning -> 'warning-suppress'

An example of output
```
admin@sonic:~$ show platform psustatus
PSU Model Serial HW Rev Voltage (V) Current (A) Power (W) Status LED
----- ------------- ------------ -------- ------------- ------------- ----------- -------- -----
PSU 1 MTEF-PSF-AC-A MT1629X14911 A3 12.09 5.44 64.88 OK green
PSU 2 MTEF-PSF-AC-A MT1629X14913 A3 12.02 4.69 56.25 OK green
PSU Model Serial HW Rev Voltage (V) Current (A) Power (W) Power Warn Thres (W) Power Crit Thres (W) Status LED
----- ------------- ------------ -------- ------------- ------------- ----------- ---------------------- ---------------------- ------- -----
PSU 1 MTEF-PSF-AC-A MT1843K17965 A4 12.02 3.62 43.56 38.00 58.00 OK green
PSU 2 MTEF-PSF-AC-A MT1843K17966 A4 12.04 4.25 51.12 38.00 58.00 OK green

```

In case neither threshold is supported on the platform, `N/A` will be displayed.

## 5. PSU LED management

The purpose of PSU LED management is to notify user about PSU event by PSU LED or syslog. Current PSU daemon psud need to monitor PSU event (PSU voltage out of range, PSU too hot) and trigger proper actions if necessary.
stephenxs marked this conversation as resolved.
Show resolved Hide resolved
Expand Down Expand Up @@ -168,14 +270,53 @@ class PsuBase(device_base.DeviceBase):

def get_input_current(self):
raise NotImplementedError
...

def get_psu_power_warning_threshold(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a set API for user to override?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not suggest having an API to override the thresholds. As discussed before, it should be controlled by the platform vendor only because users do not have full knowledge to decide what the threshold should be. For example, in our system, the thresholds depend on temperatures as well.

"""
Retrieve the warning threshold of the power on this PSU
The value can be volatile, so the caller should call the API each time it is used.

Returns:
A float number, the warning threshold of the PSU in watts.
"""
raise NotImplementedError

def get_psu_power_critical_threshold(self)
"""
Retrieve the critical threshold of the power on this PSU
The value can be volatile, so the caller should call the API each time it is used.

Returns:
A float number, the critical threshold of the PSU in watts.
"""
raise NotImplementedError
```

### 6. PSU daemon flow

Supervisord takes charge of this daemon. This daemon will loop every 3 seconds and get the data from psuutil/platform API and then write it the Redis DB.

- The psu_num will store in "chassis_info" table. It will just be invoked one time when system boot up or reload. The key is chassis_name, the field is "psu_num" and the value is from get_psu_num().
- The psu_num will store in "chassis_info" table. It will just be invoked one time when system boot up or reload. The key is chassis_name, the field is "psu_num" and the value is from get_psu_num().
- The psu_status and psu_presence will store in "psu_info" table. It will be updated every 3 seconds. The key is psu_name, the field is "presence" and "status", the value is from get_psu_presence() and get_psu_num().
- The daemon query PSU event every 3 seconds via platform API. If any event detects, it should set PSU LED color accordingly and trigger proper syslog.

### 7. Test cases

#### 7.1 Unit test cases added for PSU power exceeding checking

1. Neither `get_psu_power_warning_threshold` nor `get_psu_power_critical_threshold` is supported by platform API when a new PSU is identified
In `psu_status`, power exceeding check should be stored as `not supported` and no further function call.
2. Both `get_psu_power_warning_threshold` and `get_psu_power_critical_threshold` are supported by platform API when a new PSU is identified
In `psu_status`, power exceeding check should be stored as `supported`
3. PSU's power was less than the warning threshold and is in the range (warning threshold, critical threshold): no action
4. PSU's power was in range (warning threshold, critical threshold) and is greater than the critical threshold
1. if warning was raised, no action expected
2. if warning was not raised, a warning should be raised
5. PSU's power was less than the warning threshold and is greater than the critical threshold: a warning should be raised
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to test the case that the PSU power jumps from a value below the warning suppress threshold to a value above critical threshold.

6. PSU's power was greater than the critical threshold and is in range (warning threshold, critical threshold): no action
7. PSU's power was in range (warning threshold, critical threshold) and is less than the warning threshold:
1. if warning was raised, the warning should be cleared
2. if warning was not raised, no action
8. PSU's power was greater than the critical threshold and is less than the warning threshold: the warning should be cleared
9. A PSU becomes absent
10. A PSU becomes `not power good`
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 19 additions & 14 deletions doc/system_health_monitoring/system-health-HLD.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,20 +207,23 @@ To have system status LED can be set by this new service, a system status LED ob
psud need to collect more PSU data to the DB to satisfy the requirement of this new service. more specifically, psud need to collect psu output voltage, temperature and their threshold.

; Defines information for a psu
key = PSU_INFO|psu_name ; information for the psu
; field = value
presence = BOOLEAN ; presence of the psu
model = STRING ; model name of the psu
serial = STRING ; serial number of the psu
status = BOOLEAN ; status of the psu
change_event = STRING ; change event of the psu
fan = STRING ; fan_name of the psu
led_status = STRING ; led status of the psu
temp = INT ; temperature of the PSU
temp_th = INT ; temperature threshold
voltage = INT ; output voltage of the PSU
voltage_max_th = INT ; max threshold of the output voltage
voltage_min_th = INT ; min threshold of the output voltage
key = PSU_INFO|psu_name ; information for the psu
; field = value
presence = BOOLEAN ; presence of the psu
model = STRING ; model name of the psu
serial = STRING ; serial number of the psu
status = BOOLEAN ; status of the psu
change_event = STRING ; change event of the psu
fan = STRING ; fan_name of the psu
led_status = STRING ; led status of the psu
temp = INT ; temperature of the PSU
temp_th = INT ; temperature threshold
voltage = INT ; output voltage of the PSU
voltage_max_th = INT ; max threshold of the output voltage
voltage_min_th = INT ; min threshold of the output voltage
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold
power_warning_threshold = 1*4.3DIGIT ; The power warning threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming not changed to supress

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

power_critical_threshold = 1*4.3DIGIT ; The power critical threshold

## 5. System health monitor CLI

Expand Down Expand Up @@ -275,6 +278,7 @@ When something is wrong
orchagent is not running
Hardware Fault
PSU 1 temp 85C and threshold is 70C
PSU 1 power (66.32w) exceeds thresholds (warning: 60.00w critical: 70.00w)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'thresholds' which threshold?

Copy link
Collaborator Author

@stephenxs stephenxs Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard to say which threshold is crossed because most times it exceeds the critical threshold but sometimes it can be in (warning_suppress_threshold, critical_threshold) because of hysteresis.
I think we can just put the value of warning_suppress_threshold here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

FAN 2 is broken

for the "detail" sub command output, it will give out all the services and devices status which is under monitoring, and also the ignored service/device list will also be displayed.
Expand All @@ -290,6 +294,7 @@ Fault condition and CLI output string table
| Any fan is missing/broken |[FAN name] is missing/broken|
| Fan speed is below minimal range|[FAN name] speed is lower than expected|
| PSU power voltage is out of range|[PSU name] voltage is out of range|
| PSU power exceeds threshold|[PSU name] power exceeds threshold|
| PSU temp is too hot|[PSU name] is overheated|
| PSU is in bad status|[PSU name] is broken|
| ASIC temperature is too hot|[ASIC name] is overheated|
Expand Down