Observability additions #2963

toiletpapercode · 2023-04-28T21:24:35Z

Add new data sources (cadvisor for docker containers, dcgm-exporter for nvidia gpus)
Add dashboards for docker / gpus
Add mono-board WIP with variable for Datasource and Job

…or nvidia gpus) Add dashboards for docker / gpus Add mono-board WIP with variable for Datasource and Job

toiletpapercode · 2023-04-28T21:26:59Z

Not sure how to integrate the new containers in the deployment process, please let me know how that works

AbdBarho · 2023-04-29T05:21:39Z

Our deployment setup starts in https://github.com/LAION-AI/Open-Assistant/blob/main/.github/workflows/deploy-to-node.yaml

which references the files in https://github.com/LAION-AI/Open-Assistant/tree/main/ansible

deploy-to-node is the main entrypoint for web / backend, and in the inference subfolder you find the deployment code for inference server and workers.

andrewm4894 · 2023-04-29T10:33:25Z

Do you know what Grafana version those dashboards use/need? I tried to copy them into Grafana v9.5.1 (bc353e4b2d) version but seemed to be getting error.

https://community.grafana.com/t/recurrent-issue-the-dashboards-has-been-changed-by-someone-else/41349/11

Not sure if might be some other issue or not - was hoping to just copy paste in mono-board.json to see how it looks.

docker-compose.yaml

toiletpapercode · 2023-04-29T10:43:18Z

Do you know what Grafana version those dashboards use/need? I tried to copy them into Grafana v9.5.1 (bc353e4b2d) version but seemed to be getting error.

https://community.grafana.com/t/recurrent-issue-the-dashboards-has-been-changed-by-someone-else/41349/11

Not sure if might be some other issue or not - was hoping to just copy paste in mono-board.json to see how it looks.

It'll be easier to just import it as a new dashboard, rather than pasting it into an existing one. I was using a freshly pulled grafana/grafana docker image

andrewm4894 · 2023-04-29T11:31:06Z

I was trying to run in a codespace on this branch and got this:

docker compose --profile ci --profile observability up --build --attach-dependencies

[+] Running 11/0
 ✔ Container cadvisor                              Created                                                                                                0.0s 
 ✔ Container open-assistant-webdb-1                Running                                                                                                0.0s 
 ✔ Container netdata                               Running                                                                                                0.0s 
 ✔ Container prometheus                            Running                                                                                                0.0s 
 ✔ Container open-assistant-web-1                  Created                                                                                                0.0s 
 ✔ Container open-assistant-maildev-1              Running                                                                                                0.0s 
 ✔ Container open-assistant-redis-1                Running                                                                                                0.0s 
 ✔ Container open-assistant-db-1                   Running                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-1       Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-1              Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-beat-1  Created                                                                                                0.0s 
Attaching to cadvisor, grafana, netdata, open-assistant-backend-1, open-assistant-backend-worker-1, open-assistant-backend-worker-beat-1, open-assistant-db-1, open-assistant-maildev-1, open-assistant-redis-1, open-assistant-web-1, open-assistant-webdb-1, prometheus
Error response from daemon: driver failed programming external connectivity on endpoint grafana (1020f4e9bb3ab4338326c4012ed943aa4189d5bf72be852f5225b8e5359915bb): Error starting userland proxy: listen tcp4 0.0.0.0:2000: bind: address already in use

Not sure what might already be using port 2000.

toiletpapercode · 2023-04-29T11:39:08Z

I was trying to run in a codespace on this branch and got this:

docker compose --profile ci --profile observability up --build --attach-dependencies

[+] Running 11/0
 ✔ Container cadvisor                              Created                                                                                                0.0s 
 ✔ Container open-assistant-webdb-1                Running                                                                                                0.0s 
 ✔ Container netdata                               Running                                                                                                0.0s 
 ✔ Container prometheus                            Running                                                                                                0.0s 
 ✔ Container open-assistant-web-1                  Created                                                                                                0.0s 
 ✔ Container open-assistant-maildev-1              Running                                                                                                0.0s 
 ✔ Container open-assistant-redis-1                Running                                                                                                0.0s 
 ✔ Container open-assistant-db-1                   Running                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-1       Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-1              Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-beat-1  Created                                                                                                0.0s 
Attaching to cadvisor, grafana, netdata, open-assistant-backend-1, open-assistant-backend-worker-1, open-assistant-backend-worker-beat-1, open-assistant-db-1, open-assistant-maildev-1, open-assistant-redis-1, open-assistant-web-1, open-assistant-webdb-1, prometheus
Error response from daemon: driver failed programming external connectivity on endpoint grafana (1020f4e9bb3ab4338326c4012ed943aa4189d5bf72be852f5225b8e5359915bb): Error starting userland proxy: listen tcp4 0.0.0.0:2000: bind: address already in use

Not sure what might already be using port 2000.

You can try following some of these to find what is using it: https://www.cyberciti.biz/faq/unix-linux-check-if-port-is-in-use-command/

Alternatively, change Open-Assistant/docker-compose.yaml:298 to - <free port in codespace>:2000

andrewm4894 · 2023-04-29T11:50:07Z

i am hoping to get this merged so i can just use a github codespace then to test this properly (just to make sure anyone can run it via a clean github codespace) :

#2970

andreaskoepf · 2023-04-30T09:25:08Z

This PR contains quite a lot of elements for which I don't see a clear use case. For example why is there nvidia-gpus.json ? We don't have any GPUs in our backend server ..

andreaskoepf · 2023-04-30T09:29:21Z

I general we need a proper plan for monitoring. At least Grafana should be deployed to a separate machine to monitor the health of the system (and send out alerts).

toiletpapercode · 2023-04-30T10:25:13Z

@andreaskoepf We do need to discuss things properly.

I was in the process of writing proper dockerfiles and updating the github actions / ansible for all these new containers yesterday, but I had to pause that and was afk for the rest of the day.

I have added two new containers, cadvisor and dcgm-exporter.

cadvisor will allow monitoring of docker containers (cpu, memory, network, etc), and should be deployed on any server that is running docker containers

dcgm-exporter gives stats for nvidia gpus, and should be deployed onto the inference worker servers so that the gpus can be monitored for utilisation and vram usage (or anything else that would be useful), I mentioned this in the discord:

Visibility is always good, if prometheus is going to be scraping the gpu instances for the worker metrics anyway.. might as well pick up the GPU usage
That way if something starts hanging the gpus can be checked, and if a new model is released it can be used to monitor the vram/utilisation for testing more optimised models

andreaskoepf · 2023-06-07T12:26:10Z

I am closing this since we haven't any clear use case for this right now.

Add new data sources (cadvisor for docker containers, dcgm-exporter f…

b5c8a3a

…or nvidia gpus) Add dashboards for docker / gpus Add mono-board WIP with variable for Datasource and Job

toiletpapercode requested review from andreaskoepf, melvinebenezer, yk, olliestanley and AbdBarho as code owners April 28, 2023 21:24

andrewm4894 reviewed Apr 29, 2023

View reviewed changes

docker-compose.yaml Show resolved Hide resolved

andreaskoepf closed this Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability additions #2963

Observability additions #2963

toiletpapercode commented Apr 28, 2023

toiletpapercode commented Apr 28, 2023

AbdBarho commented Apr 29, 2023

andrewm4894 commented Apr 29, 2023

toiletpapercode commented Apr 29, 2023

andrewm4894 commented Apr 29, 2023

toiletpapercode commented Apr 29, 2023

andrewm4894 commented Apr 29, 2023 •

edited

Loading

andreaskoepf commented Apr 30, 2023 •

edited

Loading

andreaskoepf commented Apr 30, 2023

toiletpapercode commented Apr 30, 2023

andreaskoepf commented Jun 7, 2023

Observability additions #2963

Observability additions #2963

Conversation

toiletpapercode commented Apr 28, 2023

toiletpapercode commented Apr 28, 2023

AbdBarho commented Apr 29, 2023

andrewm4894 commented Apr 29, 2023

toiletpapercode commented Apr 29, 2023

andrewm4894 commented Apr 29, 2023

toiletpapercode commented Apr 29, 2023

andrewm4894 commented Apr 29, 2023 • edited Loading

andreaskoepf commented Apr 30, 2023 • edited Loading

andreaskoepf commented Apr 30, 2023

toiletpapercode commented Apr 30, 2023

andreaskoepf commented Jun 7, 2023

andrewm4894 commented Apr 29, 2023 •

edited

Loading

andreaskoepf commented Apr 30, 2023 •

edited

Loading