-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
♻️ Improve metrics generation #2726
♻️ Improve metrics generation #2726
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2726 +/- ##
========================================
- Coverage 78.2% 78.1% -0.2%
========================================
Files 647 647
Lines 26656 26657 +1
Branches 2584 2581 -3
========================================
- Hits 20871 20831 -40
- Misses 5092 5139 +47
+ Partials 693 687 -6
Flags with carried forward coverage won't be shown. Click here to find out more.
|
db04125
to
91ccbe0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GREAT and necessary improvement and refactor!
Why didn't you apply it as well to storage
service?
|
||
def middleware_factory( | ||
app_name: str, | ||
enter_middleware_cb: Optional[EnterMiddlewareCB], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THANKS for unifying back aiohttp/monitoring
NICE addition enter/exit callbacks!
THOUGHT: callbacks is a very convenient mechanism but it always puzzles me that it might introduce nasty bugs since you are delegating to a third-party a "portion" of the implementation of your function. You loose some control on things like which exceptions it raises or whether it blocks (e.g. forever) the execution of your code, etc . Therefore, it is a good exercise to analyze how these situations might affect your workflow. For instance:
- does the middleware handles well exceptions raise by the callbacks?
- what if the implementation of the callbacks block?
One needs to judge provided the context. E.g. there is no need to create a complex "task scheduler to implemente a callback timeout mechanism" if we know here functions are typically tiny: a simple WARNING in the doc will suffice in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is applied to storage. Or do you mean the director-v0?
for the second part, I actually ended up doing it this way because of the diagnostics part in the webserver. I would be in favor of separating the diagnostics part (the module that sense if the webserver is behaving correctly) from the monitoring part.
Anyway your points make sense, I will check these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, of course ... it is already there. I just did not see any changed file of storage. NO, director-v0 stays as it is ... do not open pandoras box please :-D
yes, diagnostics should be separated from monitoring. The former is a "conclusion from the data produced in the former"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcrespov : I will ask for a new review from you. I added a new function to catch and log exceptions in logging_utils.py
66be985
to
9751e80
Compare
improved informations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting changes. Not too familiar with them. But I do like the usage of the canonical urls. That shouts very reasonable.
From what I've read here, I'd expect the memory consumption issue to be no longer there. Especially for the webserver and storage services.
If my understanding is correct the Prometheus metrics were the ones using all that memory in both services causing them to degrade?
logger.debug("call was cancelled") | ||
raise | ||
except Exception as exc: # pylint: disable=broad-except | ||
logger.error("Unhandled exception: %s", f"{exc}", exc_info=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkout the message in the implementation of https://loguru.readthedocs.io/en/stable/_modules/loguru/_logger.html#Logger.catch
Should we simply give it a try? apparently it is can be used together with the normal logger
What do these changes do?
-
http_requests_total{app_name="simcore_service_webserver",endpoint="/v0/projects/{project_id}:open",http_status="200",method="POST"} 1.0
as defined in their respective openapi.
Therefore the number of timeseries for an application can be calculated with: APP_NAMEs x ENDPOINTs x RESPONSE_STATUS x METHOD. ENDPOINT is constrained to the amount of entrypoints in the API. Before it was unconstrained since endpoints were created dynamically with UUIDs.
Also, storage and webserver use the same base module.
Additionally we have the following new metrics:
Please note that the prometheus_client auto-creates a gauge metrics see this issue, but it seems this gets compressed and should not be a hurdle... let's hope that is true.
Related issue/s
How to test
Detailed metrics
Platform + GC + Process
Example metrics webserver:
Checklist