Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ Improve metrics generation #2726

Merged
merged 15 commits into from
Jan 10, 2022

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Jan 7, 2022

What do these changes do?

  • removed high cardinality metrics in storage and webserver by replacing all metrics that used endpoints with UUIDs inside with the "canonical" path. This translates into endpoint of the form
    -http_requests_total{app_name="simcore_service_webserver",endpoint="/v0/projects/{project_id}:open",http_status="200",method="POST"} 1.0
    as defined in their respective openapi.

Therefore the number of timeseries for an application can be calculated with: APP_NAMEs x ENDPOINTs x RESPONSE_STATUS x METHOD. ENDPOINT is constrained to the amount of entrypoints in the API. Before it was unconstrained since endpoints were created dynamically with UUIDs.

Also, storage and webserver use the same base module.

Additionally we have the following new metrics:

  • platform metrics (defaults from the prometheus client)
  • python GC metrics (defaults from the prometheus client)
  • process metrics (defaults from the prometheus client)
  • number of requests currently in process (http_in_flight_requests)
  • number/sum of time of time taken to complete requests (http_request_latency_seconds)
    Please note that the prometheus_client auto-creates a gauge metrics see this issue, but it seems this gets compressed and should not be a hurdle... let's hope that is true.

Related issue/s

How to test

make build
make up-prod
# find the port for webserver/storage
curl WEBSERVER/metrics
curl STORAGE/metrics

Detailed metrics

Platform + GC + Process

# TYPE process_open_fds gauge
process_open_fds 30.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="8",patchlevel="10",version="3.8.10"} 1.0
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 7438.0
python_gc_objects_collected_total{generation="1"} 1783.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 630.0
python_gc_collections_total{generation="1"} 57.0
python_gc_collections_total{generation="2"} 5.0

Example metrics webserver:

# HELP http_requests_total Total requests count
# TYPE http_requests_total counter
http_requests_total{app_name="simcore_service_webserver",endpoint="/v0/",http_status="200",method="GET"} 37.0
http_requests_total{app_name="simcore_service_webserver",endpoint="/socket.io/",http_status="200",method="GET"} 5.0
http_requests_total{app_name="simcore_service_webserver",endpoint="/v0/projects/active",http_status="200",method="GET"} 4.0
# HELP http_requests_created Total requests count
# TYPE http_requests_created gauge
http_requests_created{app_name="simcore_service_webserver",endpoint="/v0/",http_status="200",method="GET"} 1.6415744037080317e+09
http_requests_created{app_name="simcore_service_webserver",endpoint="/socket.io/",http_status="200",method="GET"} 1.6415744055034618e+09
http_requests_created{app_name="simcore_service_webserver",endpoint="/v0/projects/active",http_status="200",method="GET"} 1.6415744055948822e+09
# HELP http_in_flight_requests Number of requests in process
# TYPE http_in_flight_requests gauge
http_in_flight_requests{app_name="simcore_service_webserver",endpoint="/v0/",method="GET"} 0.0
http_in_flight_requests{app_name="simcore_service_webserver",endpoint="/socket.io/",method="GET"} 2.0
http_in_flight_requests{app_name="simcore_service_webserver",endpoint="/v0/projects/active",method="GET"} 0.0
http_in_flight_requests{app_name="simcore_service_webserver",endpoint="/metrics",method="GET"} 1.0
# HELP http_request_latency_seconds Time processing a request
# TYPE http_request_latency_seconds summary
http_request_latency_seconds_count{app_name="simcore_service_webserver",endpoint="/v0/",method="GET"} 37.0
http_request_latency_seconds_sum{app_name="simcore_service_webserver",endpoint="/v0/",method="GET"} 0.027914400037843734
http_request_latency_seconds_count{app_name="simcore_service_webserver",endpoint="/socket.io/",method="GET"} 5.0
http_request_latency_seconds_sum{app_name="simcore_service_webserver",endpoint="/socket.io/",method="GET"} 0.2476124000386335
http_request_latency_seconds_count{app_name="simcore_service_webserver",endpoint="/v0/projects/active",method="GET"} 4.0
http_request_latency_seconds_sum{app_name="simcore_service_webserver",endpoint="/v0/projects/active",method="GET"} 0.025529100035782903
http_request_latency_seconds_count{app_name="simcore_service_webserver",endpoint="/metrics",method="GET"} 0.0
http_request_latency_seconds_sum{app_name="simcore_service_webserver",endpoint="/metrics",method="GET"} 0.0
# HELP http_request_latency_seconds_created Time processing a request
# TYPE http_request_latency_seconds_created gauge
http_request_latency_seconds_created{app_name="simcore_service_webserver",endpoint="/v0/",method="GET"} 1.6415744037050257e+09
http_request_latency_seconds_created{app_name="simcore_service_webserver",endpoint="/socket.io/",method="GET"} 1.6415744054753144e+09
http_request_latency_seconds_created{app_name="simcore_service_webserver",endpoint="/v0/projects/active",method="GET"} 1.6415744055896983e+09
http_request_latency_seconds_created{app_name="simcore_service_webserver",endpoint="/metrics",method="GET"} 1.641574434527328e+09
# HELP simcore_simcore_service_webserver_services_started_total Counts the services started
# TYPE simcore_simcore_service_webserver_services_started_total counter
# HELP simcore_simcore_service_webserver_services_stopped_total Counts the services stopped
# TYPE simcore_simcore_service_webserver_services_stopped_total counter

Checklist

@sanderegg sanderegg added t:maintenance Some planned maintenance work changelog:♻️refactor labels Jan 7, 2022
@sanderegg sanderegg added this to the Rudolph milestone Jan 7, 2022
@sanderegg sanderegg self-assigned this Jan 7, 2022
@codecov
Copy link

codecov bot commented Jan 7, 2022

Codecov Report

Merging #2726 (35527c6) into master (08196f8) will decrease coverage by 0.1%.
The diff coverage is 13.1%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #2726     +/-   ##
========================================
- Coverage    78.2%   78.1%   -0.2%     
========================================
  Files         647     647             
  Lines       26656   26657      +1     
  Branches     2584    2581      -3     
========================================
- Hits        20871   20831     -40     
- Misses       5092    5139     +47     
+ Partials      693     687      -6     
Flag Coverage Δ
integrationtests 66.2% <83.3%> (+<0.1%) ⬆️
unittests 73.8% <12.1%> (-0.2%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...library/src/servicelib/aiohttp/monitor_services.py 0.0% <ø> (ø)
...rvice-library/src/servicelib/aiohttp/monitoring.py 0.0% <0.0%> (ø)
...es/service-library/src/servicelib/logging_utils.py 0.0% <0.0%> (ø)
.../director/src/simcore_service_director/producer.py 61.0% <ø> (-0.3%) ⬇️
...storage/src/simcore_service_storage/application.py 44.1% <50.0%> (ø)
...imcore_service_webserver/diagnostics_monitoring.py 100.0% <100.0%> (+17.6%) ⬆️
...ce_webserver/resource_manager/websocket_manager.py 93.3% <0.0%> (-4.0%) ⬇️
...ore_service_director_v2/api/routes/computations.py 78.6% <0.0%> (-2.6%) ⬇️
...simcore_service_director_v2/modules/dask_client.py 82.1% <0.0%> (ø)
... and 1 more

@sanderegg sanderegg force-pushed the maintenance/reduce_metrics branch from db04125 to 91ccbe0 Compare January 7, 2022 15:56
@sanderegg sanderegg requested a review from mguidon January 8, 2022 16:45
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GREAT and necessary improvement and refactor!
Why didn't you apply it as well to storage service?


def middleware_factory(
app_name: str,
enter_middleware_cb: Optional[EnterMiddlewareCB],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THANKS for unifying back aiohttp/monitoring
NICE addition enter/exit callbacks!
THOUGHT: callbacks is a very convenient mechanism but it always puzzles me that it might introduce nasty bugs since you are delegating to a third-party a "portion" of the implementation of your function. You loose some control on things like which exceptions it raises or whether it blocks (e.g. forever) the execution of your code, etc . Therefore, it is a good exercise to analyze how these situations might affect your workflow. For instance:

  • does the middleware handles well exceptions raise by the callbacks?
  • what if the implementation of the callbacks block?

One needs to judge provided the context. E.g. there is no need to create a complex "task scheduler to implemente a callback timeout mechanism" if we know here functions are typically tiny: a simple WARNING in the doc will suffice in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is applied to storage. Or do you mean the director-v0?

for the second part, I actually ended up doing it this way because of the diagnostics part in the webserver. I would be in favor of separating the diagnostics part (the module that sense if the webserver is behaving correctly) from the monitoring part.
Anyway your points make sense, I will check these cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, of course ... it is already there. I just did not see any changed file of storage. NO, director-v0 stays as it is ... do not open pandoras box please :-D

yes, diagnostics should be separated from monitoring. The former is a "conclusion from the data produced in the former"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcrespov : I will ask for a new review from you. I added a new function to catch and log exceptions in logging_utils.py

@sanderegg sanderegg requested a review from pcrespov January 10, 2022 09:06
@sanderegg sanderegg force-pushed the maintenance/reduce_metrics branch from 66be985 to 9751e80 Compare January 10, 2022 09:06
Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting changes. Not too familiar with them. But I do like the usage of the canonical urls. That shouts very reasonable.

From what I've read here, I'd expect the memory consumption issue to be no longer there. Especially for the webserver and storage services.

If my understanding is correct the Prometheus metrics were the ones using all that memory in both services causing them to degrade?

packages/service-library/src/servicelib/logging_utils.py Outdated Show resolved Hide resolved
logger.debug("call was cancelled")
raise
except Exception as exc: # pylint: disable=broad-except
logger.error("Unhandled exception: %s", f"{exc}", exc_info=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkout the message in the implementation of https://loguru.readthedocs.io/en/stable/_modules/loguru/_logger.html#Logger.catch

Should we simply give it a try? apparently it is can be used together with the normal logger

@sanderegg sanderegg merged commit 5915b3f into ITISFoundation:master Jan 10, 2022
@sanderegg sanderegg deleted the maintenance/reduce_metrics branch January 10, 2022 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t:maintenance Some planned maintenance work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants