♻️ Improve metrics generation #2726

sanderegg · 2022-01-07T15:54:13Z

What do these changes do?

removed high cardinality metrics in storage and webserver by replacing all metrics that used endpoints with UUIDs inside with the "canonical" path. This translates into endpoint of the form
-http_requests_total{app_name="simcore_service_webserver",endpoint="/v0/projects/{project_id}:open",http_status="200",method="POST"} 1.0
as defined in their respective openapi.

Therefore the number of timeseries for an application can be calculated with: APP_NAMEs x ENDPOINTs x RESPONSE_STATUS x METHOD. ENDPOINT is constrained to the amount of entrypoints in the API. Before it was unconstrained since endpoints were created dynamically with UUIDs.

Also, storage and webserver use the same base module.

Additionally we have the following new metrics:

platform metrics (defaults from the prometheus client)
python GC metrics (defaults from the prometheus client)
process metrics (defaults from the prometheus client)
number of requests currently in process (http_in_flight_requests)
number/sum of time of time taken to complete requests (http_request_latency_seconds)
Please note that the prometheus_client auto-creates a gauge metrics see this issue, but it seems this gets compressed and should not be a hurdle... let's hope that is true.

Related issue/s

How to test

make build
make up-prod
# find the port for webserver/storage
curl WEBSERVER/metrics
curl STORAGE/metrics

Detailed metrics

Platform + GC + Process

# TYPE process_open_fds gauge
process_open_fds 30.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="8",patchlevel="10",version="3.8.10"} 1.0
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 7438.0
python_gc_objects_collected_total{generation="1"} 1783.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 630.0
python_gc_collections_total{generation="1"} 57.0
python_gc_collections_total{generation="2"} 5.0

Example metrics webserver:

# HELP http_requests_total Total requests count
# TYPE http_requests_total counter
http_requests_total{app_name="simcore_service_webserver",endpoint="/v0/",http_status="200",method="GET"} 37.0
http_requests_total{app_name="simcore_service_webserver",endpoint="/socket.io/",http_status="200",method="GET"} 5.0
http_requests_total{app_name="simcore_service_webserver",endpoint="/v0/projects/active",http_status="200",method="GET"} 4.0
# HELP http_requests_created Total requests count
# TYPE http_requests_created gauge
http_requests_created{app_name="simcore_service_webserver",endpoint="/v0/",http_status="200",method="GET"} 1.6415744037080317e+09
http_requests_created{app_name="simcore_service_webserver",endpoint="/socket.io/",http_status="200",method="GET"} 1.6415744055034618e+09
http_requests_created{app_name="simcore_service_webserver",endpoint="/v0/projects/active",http_status="200",method="GET"} 1.6415744055948822e+09
# HELP http_in_flight_requests Number of requests in process
# TYPE http_in_flight_requests gauge
http_in_flight_requests{app_name="simcore_service_webserver",endpoint="/v0/",method="GET"} 0.0
http_in_flight_requests{app_name="simcore_service_webserver",endpoint="/socket.io/",method="GET"} 2.0
http_in_flight_requests{app_name="simcore_service_webserver",endpoint="/v0/projects/active",method="GET"} 0.0
http_in_flight_requests{app_name="simcore_service_webserver",endpoint="/metrics",method="GET"} 1.0
# HELP http_request_latency_seconds Time processing a request
# TYPE http_request_latency_seconds summary
http_request_latency_seconds_count{app_name="simcore_service_webserver",endpoint="/v0/",method="GET"} 37.0
http_request_latency_seconds_sum{app_name="simcore_service_webserver",endpoint="/v0/",method="GET"} 0.027914400037843734
http_request_latency_seconds_count{app_name="simcore_service_webserver",endpoint="/socket.io/",method="GET"} 5.0
http_request_latency_seconds_sum{app_name="simcore_service_webserver",endpoint="/socket.io/",method="GET"} 0.2476124000386335
http_request_latency_seconds_count{app_name="simcore_service_webserver",endpoint="/v0/projects/active",method="GET"} 4.0
http_request_latency_seconds_sum{app_name="simcore_service_webserver",endpoint="/v0/projects/active",method="GET"} 0.025529100035782903
http_request_latency_seconds_count{app_name="simcore_service_webserver",endpoint="/metrics",method="GET"} 0.0
http_request_latency_seconds_sum{app_name="simcore_service_webserver",endpoint="/metrics",method="GET"} 0.0
# HELP http_request_latency_seconds_created Time processing a request
# TYPE http_request_latency_seconds_created gauge
http_request_latency_seconds_created{app_name="simcore_service_webserver",endpoint="/v0/",method="GET"} 1.6415744037050257e+09
http_request_latency_seconds_created{app_name="simcore_service_webserver",endpoint="/socket.io/",method="GET"} 1.6415744054753144e+09
http_request_latency_seconds_created{app_name="simcore_service_webserver",endpoint="/v0/projects/active",method="GET"} 1.6415744055896983e+09
http_request_latency_seconds_created{app_name="simcore_service_webserver",endpoint="/metrics",method="GET"} 1.641574434527328e+09
# HELP simcore_simcore_service_webserver_services_started_total Counts the services started
# TYPE simcore_simcore_service_webserver_services_started_total counter
# HELP simcore_simcore_service_webserver_services_stopped_total Counts the services stopped
# TYPE simcore_simcore_service_webserver_services_stopped_total counter

Checklist

codecov · 2022-01-07T15:55:26Z

Codecov Report

Merging #2726 (35527c6) into master (08196f8) will decrease coverage by 0.1%.
The diff coverage is 13.1%.

@@           Coverage Diff            @@
##           master   #2726     +/-   ##
========================================
- Coverage    78.2%   78.1%   -0.2%     
========================================
  Files         647     647             
  Lines       26656   26657      +1     
  Branches     2584    2581      -3     
========================================
- Hits        20871   20831     -40     
- Misses       5092    5139     +47     
+ Partials      693     687      -6

Flag	Coverage Δ
integrationtests	`66.2% <83.3%> (+<0.1%)`	⬆️
unittests	`73.8% <12.1%> (-0.2%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...library/src/servicelib/aiohttp/monitor_services.py	`0.0% <ø> (ø)`
...rvice-library/src/servicelib/aiohttp/monitoring.py	`0.0% <0.0%> (ø)`
...es/service-library/src/servicelib/logging_utils.py	`0.0% <0.0%> (ø)`
.../director/src/simcore_service_director/producer.py	`61.0% <ø> (-0.3%)`	⬇️
...storage/src/simcore_service_storage/application.py	`44.1% <50.0%> (ø)`
...imcore_service_webserver/diagnostics_monitoring.py	`100.0% <100.0%> (+17.6%)`	⬆️
...ce_webserver/resource_manager/websocket_manager.py	`93.3% <0.0%> (-4.0%)`	⬇️
...ore_service_director_v2/api/routes/computations.py	`78.6% <0.0%> (-2.6%)`	⬇️
...simcore_service_director_v2/modules/dask_client.py	`82.1% <0.0%> (ø)`
... and 1 more

pcrespov

GREAT and necessary improvement and refactor!
Why didn't you apply it as well to storage service?

pcrespov · 2022-01-09T14:44:21Z

packages/service-library/src/servicelib/aiohttp/monitoring.py

+
+def middleware_factory(
+    app_name: str,
+    enter_middleware_cb: Optional[EnterMiddlewareCB],


THANKS for unifying back aiohttp/monitoring
NICE addition enter/exit callbacks!
THOUGHT: callbacks is a very convenient mechanism but it always puzzles me that it might introduce nasty bugs since you are delegating to a third-party a "portion" of the implementation of your function. You loose some control on things like which exceptions it raises or whether it blocks (e.g. forever) the execution of your code, etc . Therefore, it is a good exercise to analyze how these situations might affect your workflow. For instance:

does the middleware handles well exceptions raise by the callbacks?

what if the implementation of the callbacks block?

One needs to judge provided the context. E.g. there is no need to create a complex "task scheduler to implemente a callback timeout mechanism" if we know here functions are typically tiny: a simple WARNING in the doc will suffice in that case.

it is applied to storage. Or do you mean the director-v0?

for the second part, I actually ended up doing it this way because of the diagnostics part in the webserver. I would be in favor of separating the diagnostics part (the module that sense if the webserver is behaving correctly) from the monitoring part.
Anyway your points make sense, I will check these cases.

ah, of course ... it is already there. I just did not see any changed file of storage. NO, director-v0 stays as it is ... do not open pandoras box please :-D

yes, diagnostics should be separated from monitoring. The former is a "conclusion from the data produced in the former"

@pcrespov : I will ask for a new review from you. I added a new function to catch and log exceptions in logging_utils.py

packages/service-library/src/servicelib/aiohttp/monitoring.py

…an exception

improved informations

GitHK

Interesting changes. Not too familiar with them. But I do like the usage of the canonical urls. That shouts very reasonable.

From what I've read here, I'd expect the memory consumption issue to be no longer there. Especially for the webserver and storage services.

If my understanding is correct the Prometheus metrics were the ones using all that memory in both services causing them to degrade?

packages/service-library/src/servicelib/logging_utils.py

pcrespov · 2022-01-10T11:24:33Z

packages/service-library/src/servicelib/logging_utils.py

+        logger.debug("call was cancelled")
+        raise
+    except Exception as exc:  # pylint: disable=broad-except
+        logger.error("Unhandled exception: %s", f"{exc}", exc_info=True)


Checkout the message in the implementation of https://loguru.readthedocs.io/en/stable/_modules/loguru/_logger.html#Logger.catch

Should we simply give it a try? apparently it is can be used together with the normal logger

sanderegg added t:maintenance Some planned maintenance work changelog:♻️refactor labels Jan 7, 2022

sanderegg added this to the Rudolph milestone Jan 7, 2022

sanderegg requested review from GitHK, mrnicegyu11 and pcrespov January 7, 2022 15:54

sanderegg self-assigned this Jan 7, 2022

sanderegg force-pushed the maintenance/reduce_metrics branch from db04125 to 91ccbe0 Compare January 7, 2022 15:56

sanderegg requested a review from mguidon January 8, 2022 16:45

pcrespov approved these changes Jan 9, 2022

View reviewed changes

sanderegg added 10 commits January 10, 2022 08:50

metrics not updated correctly when resp is not prepared

0d08e3d

only pass the canonical path

f6eb751

uniformize monitoring middleware

6cdceb5

fix the path

8392d35

added typical metrics

2b4e853

remove high cardinality labels

f4df05f

use same base for monitoring

9333e5f

make it more readable

2ed5537

@pcrespov review: add some doc on canonical endpoint

8ff5c93

@pcrespov review: ensure an exception in the callback does not raise …

9751e80

…an exception

sanderegg requested a review from pcrespov January 10, 2022 09:06

sanderegg force-pushed the maintenance/reduce_metrics branch from 66be985 to 9751e80 Compare January 10, 2022 09:06

sanderegg added 2 commits January 10, 2022 10:21

@pcrespov review: added docs

ae0e0db

improved informations

upgraded storage target info

a2939a0

GitHK approved these changes Jan 10, 2022

View reviewed changes

sanderegg added 3 commits January 10, 2022 10:38

reduce metrics cardinality in director-v0

3a909f2

linter

6b89320

@pcrespov review: renamed method

35527c6

pcrespov approved these changes Jan 10, 2022

View reviewed changes

sanderegg merged commit 5915b3f into ITISFoundation:master Jan 10, 2022

sanderegg deleted the maintenance/reduce_metrics branch January 10, 2022 12:32

pcrespov mentioned this pull request Jan 26, 2022

maintenance/scaling of the platform ITISFoundation/osparc-issues#428

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

♻️ Improve metrics generation #2726

♻️ Improve metrics generation #2726

sanderegg commented Jan 7, 2022 •

edited

Loading

codecov bot commented Jan 7, 2022 •

edited

Loading

pcrespov left a comment

pcrespov Jan 9, 2022

sanderegg Jan 9, 2022

pcrespov Jan 9, 2022

sanderegg Jan 10, 2022

GitHK left a comment

pcrespov Jan 10, 2022

♻️ Improve metrics generation #2726

♻️ Improve metrics generation #2726

Conversation

sanderegg commented Jan 7, 2022 • edited Loading

What do these changes do?

Related issue/s

How to test

Detailed metrics

Checklist

codecov bot commented Jan 7, 2022 • edited Loading

Codecov Report

pcrespov left a comment

Choose a reason for hiding this comment

pcrespov Jan 9, 2022

Choose a reason for hiding this comment

sanderegg Jan 9, 2022

Choose a reason for hiding this comment

pcrespov Jan 9, 2022

Choose a reason for hiding this comment

sanderegg Jan 10, 2022

Choose a reason for hiding this comment

GitHK left a comment

Choose a reason for hiding this comment

pcrespov Jan 10, 2022

Choose a reason for hiding this comment

sanderegg commented Jan 7, 2022 •

edited

Loading

codecov bot commented Jan 7, 2022 •

edited

Loading