Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability additions #2963

Closed
wants to merge 1 commit into from

Conversation

toiletpapercode
Copy link
Contributor

Add new data sources (cadvisor for docker containers, dcgm-exporter for nvidia gpus)
Add dashboards for docker / gpus
Add mono-board WIP with variable for Datasource and Job

…or nvidia gpus)

Add dashboards for docker / gpus
Add mono-board WIP with variable for Datasource and Job
@toiletpapercode
Copy link
Contributor Author

Not sure how to integrate the new containers in the deployment process, please let me know how that works

@AbdBarho
Copy link
Collaborator

Our deployment setup starts in https://github.com/LAION-AI/Open-Assistant/blob/main/.github/workflows/deploy-to-node.yaml

which references the files in https://github.com/LAION-AI/Open-Assistant/tree/main/ansible

deploy-to-node is the main entrypoint for web / backend, and in the inference subfolder you find the deployment code for inference server and workers.

@andrewm4894
Copy link
Collaborator

Do you know what Grafana version those dashboards use/need? I tried to copy them into Grafana v9.5.1 (bc353e4b2d) version but seemed to be getting error.

https://community.grafana.com/t/recurrent-issue-the-dashboards-has-been-changed-by-someone-else/41349/11

Not sure if might be some other issue or not - was hoping to just copy paste in mono-board.json to see how it looks.

image

@toiletpapercode
Copy link
Contributor Author

Do you know what Grafana version those dashboards use/need? I tried to copy them into Grafana v9.5.1 (bc353e4b2d) version but seemed to be getting error.

https://community.grafana.com/t/recurrent-issue-the-dashboards-has-been-changed-by-someone-else/41349/11

Not sure if might be some other issue or not - was hoping to just copy paste in mono-board.json to see how it looks.

image

It'll be easier to just import it as a new dashboard, rather than pasting it into an existing one. I was using a freshly pulled grafana/grafana docker image

@andrewm4894
Copy link
Collaborator

I was trying to run in a codespace on this branch and got this:

docker compose --profile ci --profile observability up --build --attach-dependencies
[+] Running 11/0
 ✔ Container cadvisor                              Created                                                                                                0.0s 
 ✔ Container open-assistant-webdb-1                Running                                                                                                0.0s 
 ✔ Container netdata                               Running                                                                                                0.0s 
 ✔ Container prometheus                            Running                                                                                                0.0s 
 ✔ Container open-assistant-web-1                  Created                                                                                                0.0s 
 ✔ Container open-assistant-maildev-1              Running                                                                                                0.0s 
 ✔ Container open-assistant-redis-1                Running                                                                                                0.0s 
 ✔ Container open-assistant-db-1                   Running                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-1       Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-1              Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-beat-1  Created                                                                                                0.0s 
Attaching to cadvisor, grafana, netdata, open-assistant-backend-1, open-assistant-backend-worker-1, open-assistant-backend-worker-beat-1, open-assistant-db-1, open-assistant-maildev-1, open-assistant-redis-1, open-assistant-web-1, open-assistant-webdb-1, prometheus
Error response from daemon: driver failed programming external connectivity on endpoint grafana (1020f4e9bb3ab4338326c4012ed943aa4189d5bf72be852f5225b8e5359915bb): Error starting userland proxy: listen tcp4 0.0.0.0:2000: bind: address already in use

Not sure what might already be using port 2000.

@toiletpapercode
Copy link
Contributor Author

I was trying to run in a codespace on this branch and got this:

docker compose --profile ci --profile observability up --build --attach-dependencies
[+] Running 11/0
 ✔ Container cadvisor                              Created                                                                                                0.0s 
 ✔ Container open-assistant-webdb-1                Running                                                                                                0.0s 
 ✔ Container netdata                               Running                                                                                                0.0s 
 ✔ Container prometheus                            Running                                                                                                0.0s 
 ✔ Container open-assistant-web-1                  Created                                                                                                0.0s 
 ✔ Container open-assistant-maildev-1              Running                                                                                                0.0s 
 ✔ Container open-assistant-redis-1                Running                                                                                                0.0s 
 ✔ Container open-assistant-db-1                   Running                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-1       Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-1              Created                                                                                                0.0s 
 ✔ Container open-assistant-backend-worker-beat-1  Created                                                                                                0.0s 
Attaching to cadvisor, grafana, netdata, open-assistant-backend-1, open-assistant-backend-worker-1, open-assistant-backend-worker-beat-1, open-assistant-db-1, open-assistant-maildev-1, open-assistant-redis-1, open-assistant-web-1, open-assistant-webdb-1, prometheus
Error response from daemon: driver failed programming external connectivity on endpoint grafana (1020f4e9bb3ab4338326c4012ed943aa4189d5bf72be852f5225b8e5359915bb): Error starting userland proxy: listen tcp4 0.0.0.0:2000: bind: address already in use

Not sure what might already be using port 2000.

You can try following some of these to find what is using it: https://www.cyberciti.biz/faq/unix-linux-check-if-port-is-in-use-command/

Alternatively, change Open-Assistant/docker-compose.yaml:298 to - <free port in codespace>:2000

@andrewm4894
Copy link
Collaborator

andrewm4894 commented Apr 29, 2023

i am hoping to get this merged so i can just use a github codespace then to test this properly (just to make sure anyone can run it via a clean github codespace) :

#2970

@andreaskoepf
Copy link
Collaborator

andreaskoepf commented Apr 30, 2023

This PR contains quite a lot of elements for which I don't see a clear use case. For example why is there nvidia-gpus.json ? We don't have any GPUs in our backend server ..

@andreaskoepf
Copy link
Collaborator

I general we need a proper plan for monitoring. At least Grafana should be deployed to a separate machine to monitor the health of the system (and send out alerts).

@toiletpapercode
Copy link
Contributor Author

@andreaskoepf We do need to discuss things properly.

I was in the process of writing proper dockerfiles and updating the github actions / ansible for all these new containers yesterday, but I had to pause that and was afk for the rest of the day.

I have added two new containers, cadvisor and dcgm-exporter.

cadvisor will allow monitoring of docker containers (cpu, memory, network, etc), and should be deployed on any server that is running docker containers

dcgm-exporter gives stats for nvidia gpus, and should be deployed onto the inference worker servers so that the gpus can be monitored for utilisation and vram usage (or anything else that would be useful), I mentioned this in the discord:

Visibility is always good, if prometheus is going to be scraping the gpu instances for the worker metrics anyway.. might as well pick up the GPU usage
That way if something starts hanging the gpus can be checked, and if a new model is released it can be used to monitor the vram/utilisation for testing more optimised models

@andreaskoepf
Copy link
Collaborator

I am closing this since we haven't any clear use case for this right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants