Instrument the application #2

severo · 2021-08-03T14:32:10Z

Measure the response time, status code, RAM usage, etc. to be able to take decision (see #1). Also statistics about the most common requests (endpoint, dataset, parameters)

julien-c · 2021-08-06T18:10:57Z

also probably disk usage as datasets is probably just filling its cache for you forever on disk

severo · 2021-08-09T08:43:11Z

Yes. The disk usage for datasets cache should be low, since we use streaming (or we error): we never download the data itself. But it's better to monitor than to assume.

severo · 2021-09-02T13:17:18Z

Note that issues like huggingface/datasets#2859 can affect the CPU and memory usage (even block the server completely)

severo · 2021-09-02T13:43:18Z

Two solutions proposed by the infra team:

https://huggingface.slack.com/archives/CTKK32GE8/p1630588820038000

Morgan:

We have been using Prometheus with success to report this kind of metrics on the Inference API.
Some pointers, if it can help:

Philip:

You could also use directly cloudwatch with either the:

ec2 cloudwatch agent
or watchtower a python tool, where you use the normal logger and log directly to cloudwatch

AFAIK most of our production service are using prometheus + grafana sofar, with ELK as additional informations. So you could directly log into the existing system for this already

severo · 2021-09-02T14:03:24Z

See how prometheus is used in other projects:

As we rely on Starlette, we might want to try https://github.com/perdy/starlette-prometheus

severo · 2021-09-23T12:39:21Z

Also: alert when 500 errors occur. See #21 (comment)

severo · 2022-05-12T09:30:14Z

Best practices on what to monitor: https://prometheus.io/docs/practices/instrumentation

github-actions · 2022-09-16T15:20:16Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo · 2022-09-16T20:16:46Z

Already in place. Let's improve when needed

severo added the enhancement label Aug 3, 2021

severo mentioned this issue Sep 2, 2021

Prevent DoS when accessing some datasets #17

Closed

severo added the low-priority label Jan 26, 2022

severo added the wait-for-datasets-server label Feb 4, 2022

severo closed this as completed Feb 4, 2022

severo reopened this May 3, 2022

severo removed the move-to-datasets-server label May 3, 2022

severo mentioned this issue May 12, 2022

Setup prometheus + grafana #250

Closed

9 tasks

severo mentioned this issue May 20, 2022

debug the memory+cpu usage of python applications #288

Closed

severo closed this as completed Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrument the application #2

Instrument the application #2

severo commented Aug 3, 2021 •

edited

Loading

julien-c commented Aug 6, 2021 •

edited

Loading

severo commented Aug 9, 2021

severo commented Sep 2, 2021

severo commented Sep 2, 2021 •

edited

Loading

severo commented Sep 2, 2021 •

edited

Loading

severo commented Sep 23, 2021

severo commented May 12, 2022

github-actions bot commented Sep 16, 2022

severo commented Sep 16, 2022

Instrument the application #2

Instrument the application #2

Comments

severo commented Aug 3, 2021 • edited Loading

julien-c commented Aug 6, 2021 • edited Loading

severo commented Aug 9, 2021

severo commented Sep 2, 2021

severo commented Sep 2, 2021 • edited Loading

severo commented Sep 2, 2021 • edited Loading

severo commented Sep 23, 2021

severo commented May 12, 2022

github-actions bot commented Sep 16, 2022

severo commented Sep 16, 2022

severo commented Aug 3, 2021 •

edited

Loading

julien-c commented Aug 6, 2021 •

edited

Loading

severo commented Sep 2, 2021 •

edited

Loading

severo commented Sep 2, 2021 •

edited

Loading