Skip to content

Commit f3cf739

Browse files
yvolynets-mlnxjleveque
authored andcommitted
Monitoring of hardware resources consumed by a device (#439)
* Added initial DUT monitor HLD Signed-off-by: Yuriy Volynets <yuriyv@mellanox.com>
1 parent 37ec231 commit f3cf739

File tree

3 files changed

+246
-0
lines changed

3 files changed

+246
-0
lines changed

doc/DUT_monitor_HLD.md

+246
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
Table of Contents
2+
<!-- TOC -->
3+
- [Scope](#scope)
4+
- [Overview](#overview)
5+
- [Quality Objective](#quality-objective)
6+
- [Module design](#module-design)
7+
- [Overall design](#overall-design)
8+
- [Updated directory structure](#updated-directory-structure)
9+
- [Thresholds overview](#thresholds-overview)
10+
- [Thresholds configuration file](#thresholds-configuration-file)
11+
- [Thresholds template](#thresholds-template)
12+
- [Preliminary defaults](#preliminary-defaults)
13+
- [Pytest plugin overview](#pytest-plugin-overview)
14+
- [Pytest option](#pytest-option)
15+
- [Pytest hooks](#pytest-hooks)
16+
- [Classes](#classes)
17+
- [Interaction with dut](#interaction-with-dut)
18+
- [Tests execution flaw](#tests-execution-flaw)
19+
- [Extended info to print for error cases](#extended-info-to-print-for-error-cases)
20+
- [Commands to fetch monitoring data](#commands-to-fetch-monitoring-data)
21+
- [Possible future expansion](#possible-future-expansion)
22+
23+
<!-- /TOC -->
24+
25+
### Scope
26+
27+
This document describes the high level design of verification the hardware resources consumed by a device. The hardware resources which are currently verified are CPU, RAM and HDD.
28+
29+
This implementation will be integrated in test cases written on Pytest framework.
30+
31+
### Overview
32+
33+
During tests run test cases perform many manipulations with DUT including different Linux and SONiC configurations and sending traffic.
34+
35+
To be sure that CPU, RAM and HDD resources utilization on DUT are not increasing within tests run, those parameters can be checked after each finished test case.
36+
37+
Purpose of the current feature is to - verify that previously listed resources are not increasing during tests run. It achieves by performing verification after each test case.
38+
39+
### Quality Objective
40+
+ Ensure CPU consumption on DUT does not exceed threshold
41+
+ Ensure RAM consumption on DUT does not exceed threshold
42+
+ Ensure used space in the partition mounted to the HDD "/" root folder does not exceed threshold
43+
44+
### Module design
45+
#### Overall design
46+
The following figure depicts current feature integration with existed Pytest framework.
47+
48+
![](https://github.com/yvolynets-mlnx/SONiC/blob/dut_monitor/images/dut_monitor_hld/Load_flaw.jpg)
49+
50+
Newly introduced feature consists of:
51+
+ Pytest plugin – pytest_dut_monitor.py. Plugin defines:
52+
+ pytest hooks: pytest_addoption , pytest_configure, pytest_unconfigure
53+
+ pytest fixtures: dut_ssh, dut_monitor
54+
+ DUTMonitorPlugin – class to be registered as plugin. Define pytest fixtures described above
55+
+ DUTMonitorClient - class to control DUT monitoring over SSH
56+
+ Pytest plugin register new options: "--dut_monitor", "--thresholds_file"
57+
+ Python module - dut_monitor.py. Which is running on DUT and collects CPU, RAM and HDD data and writes it to the log files. There will be created three new files: cpu.log, ram.log, hdd.log.
58+
59+
#### Updated directory structure
60+
+ ./sonic-mgmt/tests/plugins/\_\_init__.yml
61+
+ ./sonic-mgmt/tests/plugins/dut_monitor/thresholds.yml
62+
+ ./sonic-mgmt/tests/plugins/dut_monitor/pytest_dut_monitor.py
63+
+ ./sonic-mgmt/tests/plugins/dut_monitor/dut_monitor.py
64+
+ ./sonic-mgmt/tests/plugins/dut_monitor/errors.py
65+
+ ./sonic-mgmt/tests/plugins/dut_monitor/\_\_init__.py
66+
67+
#### Thresholds overview
68+
To be able to verify that CPU, RAM or HDD utilization are not critical on the DUT, there is a need to define specific thresholds.
69+
70+
List of thresholds:
71+
+ Total system CPU consumption
72+
+ Separate process CPU consumption
73+
+ Time duration of CPU monitoring
74+
+ Average CPU consumption during test run
75+
+ Peak RAM consumption
76+
+ RAM consumption delta before and after test run
77+
+ Used disk space
78+
79+
```Total system CPU consumption``` - integer value (percentage). Triggers when total peak CPU consumption is >= to defined value during “Peak CPU monitoring duration” seconds.
80+
81+
```Separate process CPU consumption``` - integer value (percentage). Triggers when peak CPU consumption of any separate process is >= to defined value during “Peak CPU measurement duration” seconds.
82+
83+
```Time duration of CPU monitoring``` - integer value (seconds). Time frame. Used together with total or process peak CPU consumption verification.
84+
85+
```Average CPU consumption during test run``` - integer value (percentage). Triggers when the average CPU consumption of the whole system between start/stop monitoring (between start/end test) is >= to defined value.
86+
87+
```Peak RAM consumption``` – integer value (percentage). Triggers when RAM consumption of the whole system is >= to defined value.
88+
89+
```RAM consumption delta before and after test``` – integer value (percentage). Difference between consumed RAM before and after test case. Triggers when the difference is >= to defined value.
90+
91+
```Used disk space``` - integer value (percentage). Triggers when used disk space is >= to defined value.
92+
93+
#### Thresholds configuration file
94+
Default thresholds are defined in ./sonic-mgmt/tests/plugins/dut_monitor/thresholds.yml file.
95+
96+
The proposal is to define thresholds for specific platform and its hwsku. Below is template of "thresholds.yml" file, which has defined: general default thresholds, platform default thresholds, specific HWSKU thresholds.
97+
98+
If HWSKU is not defined for current DUT - platform thresholds will be used.
99+
100+
If platform is not defined for current DUT - default thresholds will be used.
101+
102+
##### Thresholds template:
103+
```code
104+
default:
105+
cpu_total: x
106+
cpu_process: x
107+
cpu_measure_duration: x
108+
cpu_total_average: x
109+
ram_peak: x
110+
ram_delta: x
111+
hdd_used: x
112+
113+
platform X:
114+
hwsku: A
115+
cpu_total: x
116+
cpu_process: x
117+
cpu_measure_duration: x
118+
cpu_total_average: x
119+
ram_peak: x
120+
ram_delta: x
121+
hdd_used: x
122+
...
123+
default:
124+
cpu_total: 80
125+
cpu_process: 70
126+
cpu_measure_duration: 10
127+
cpu_total_average: 90
128+
ram_peak: 90
129+
hdd_used: 75
130+
...
131+
```
132+
##### Preliminary defaults
133+
Note: need to be tested to define accurately.
134+
135+
cpu_total: 90
136+
cpu_process: 60
137+
cpu_measure_duration: 10
138+
cpu_total_average: 90
139+
ram_peak: 80
140+
ram_delta: 1
141+
hdd_used: 80
142+
143+
##### How to tune thresholds
144+
1. User can pass its own thresholds file for test run using "--thresholds_file" pytest option. For example:
145+
```code
146+
py.test TEST_RUN_OPTIONS --thresholds_file THRESHOLDS_FILE_PATH
147+
```
148+
2. User can update thresholds directly in test case by using "dut_monitor" fixture.
149+
For example:
150+
```code
151+
dut_monitor["cpu_total"] = 80
152+
dut_monitor["ram_peak"] = 90
153+
...
154+
```
155+
3. Define thresholds for specific test groups.
156+
For specific test groups like scale, performance, etc. thresholds can be common. In such case "thresholds.yml" file can be created and placed next to the test module file. Pytest framework will automatically discover "thresholds.yml" file and will apply defined thresholds for current tests.
157+
158+
159+
### Pytest plugin overview
160+
161+
#### Pytest option
162+
To enable DUT monitoring for each test case the following pytest console option should be used - "--dut_monitor"
163+
164+
#### Pytest hooks
165+
dut_monitor.py module defines the following hooks:
166+
##### pytest_addoption(parser)
167+
Register "--dut_monitor" option. This option used for trigger device monitoring.
168+
169+
Register "--thresholds_file" option. This option takes path to the thresholds file.
170+
171+
##### pytest_configure(config)
172+
Check whether "--dut_monitor" option is used, if so register DUTMonitorPlugin class as pytest plugin.
173+
##### pytest_unconfigure(config)
174+
Unregister DUTMonitorPlugin plugin.
175+
176+
### Classes
177+
#### DUTMonitorClient class
178+
Define API for:
179+
180+
+ Start monitoring on the DUT
181+
+ Stop monitoring on the DUT. Compare measurements with defined thresholds
182+
+ Execute remote commands via SSH
183+
+ Track SSH connection with DUT
184+
+ Automatically restore SSH connection with DUT while in monitoring mode
185+
186+
#### DUTMonitorPlugin class
187+
Defines the following pytest fixtures:
188+
189+
##### dut_ssh(autouse=True, scope="session")
190+
Establish SSH connection with a device. Keeps this connection during all tests run.
191+
192+
If the connection to the DUT is broken during monitoring phase (test performed DUT reboot), it will automatically try to restore connection during some time (for example 5 minutes).
193+
194+
If the connection will be restored, monitoring will be automatically restored as well and dut_monitor fixture will have monitoring results even if reboot occurred. So, monitoring results will not be lost if in some case DUT will be rebooted.
195+
196+
If the connection will not be restored, exception will be raised that DUT become inaccessible.
197+
198+
##### dut_monitor(dut_ssh, autouse=True, scope="function")
199+
- Starts DUT monitoring before test start
200+
- Stops DUT monitoring after test finish
201+
- Get measured values and compare them with defined thresholds
202+
- Pytest error will be generated if any of resources exceed the defined threshold.
203+
204+
205+
### Interaction with dut
206+
207+
![](https://github.com/yvolynets-mlnx/SONiC/blob/dut_monitor/images/dut_monitor_hld/Dut_monitor_ssh.jpg)
208+
209+
### Tests execution flaw
210+
211+
+ Start pytest run with added “–dut_monitor” option
212+
+ Before each test case - initialize DUT monitoring
213+
+ Start reading CPU, RAM and HDD values every 2 seconds
214+
+ Start test case
215+
+ Wait the test case to finish
216+
+ Stop reading CPU, RAM and HDD values
217+
+ Display logging message with measured parameters
218+
+ After the end of each test case compare obtained values with defined thresholds
219+
+ Pytest error will be generated if any of resources exceed the defined threshold. Error message will also show extended output about consumed CPU, RAM and HDD, which is described below. Test case status like pass/fail still be shown separately. It gives possibility to have separate results for test cases (pass/fail) and errors if resources consumption exceed the threshold.
220+
221+
222+
#### Extended info to print for error cases
223+
Display output of the following commands:
224+
225+
+ df -h --total /*
226+
+ ps aux --sort rss
227+
+ docker stats --all --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
228+
229+
### Commands to fetch monitoring data
230+
231+
##### Fetch CPU consumption:
232+
ps -A -o pcpu | tail -n+2 | python -c "import sys; print(sum(float(line) for line in sys.stdin))"
233+
##### Fetch RAM consumption:
234+
show system-memory
235+
OR
236+
ps -A -o rss | tail -n+2 | python -c "import sys; print(sum(float(line) for line in sys.stdin))"
237+
238+
##### Fetch HDD usage:
239+
df -hm /
240+
241+
242+
### Possible future expansion
243+
244+
Later this functionality can be integrated with some UI interface where will be displayed consumed resources and device health during regression run. As UI board can be used Grafana.
245+
246+
It can be useful for DUT health debugging and for load/stress testing analysis.
20.6 KB
Loading

images/dut_monitor_hld/Load_flaw.jpg

66 KB
Loading

0 commit comments

Comments
 (0)