Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Satyaog/feature/covalent #217

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Satyaog/feature/covalent #217

wants to merge 3 commits into from

Conversation

satyaog
Copy link
Member

@satyaog satyaog commented May 22, 2024

milabench cloud --setup

It creates a system config file and takes a target cloud platform with --run-on.

This starts a local covalent server which is used to manage python code that will be executed on the remote. For now this is only somewhat useful since milabench is mostly using ssh commands anyway and it would take a bit of time to refactor the pipeline I think to instead use the covalent interface to run code. I think this could be an interesting approach but it's a nice to have for now.

So milabench cloud --setup setup the remote and install basic stuff on it like the correct python version (necessary to ensure good serialization/deserialization of python objects between the local and remote machine), pip and venv. venv is used to separate the covalent env and milabench env which have incompatible package requirements versions (sqlalchemy caused problems). On this is done , the covalent server becomes useless

Then system config file should be used in the install, prepare and run commands. In those commands it creates a new standalone config for the tests that will be executed and copies it to the remote before the rest of the pipeline is executed.

At the end of the run command the results are copied to the local machine to allow the generation of a report

At the very end, milabench cloud --teardown should be used to release the cloud resources. The --all argument will release all resources of a target cloud platform specified with --run-on.

Check docs/usage.rst for more info

milabench with slurm

The milabench cloud --setup works as well with a slurm system configuration but does not support the --all argument with milabench cloud --teardown.

Check docs/usage.rst for more info

milabench report --push

Push the results to a reports branch which as well stores the status svg and summary

Example of reports : #210

@satyaog satyaog mentioned this pull request May 22, 2024
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 29f573e to 3bfe690 Compare May 23, 2024 13:53
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 3bfe690 to 65ca09a Compare May 23, 2024 14:02
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 65ca09a to 978e16d Compare May 23, 2024 14:28
@satyaog satyaog force-pushed the satyaog/feature/covalent branch 3 times, most recently from f9a8c6e to 89898ee Compare May 23, 2024 15:11
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 89898ee to 14ffdf1 Compare May 24, 2024 13:03
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 14ffdf1 to 7d15073 Compare May 24, 2024 15:40
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 7d15073 to 052d2b9 Compare May 27, 2024 13:49
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 052d2b9 to 11dd515 Compare May 27, 2024 18:06
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 11dd515 to 3683cb7 Compare May 27, 2024 23:50
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 3683cb7 to fa32dde Compare August 8, 2024 14:24
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from fa32dde to 172c90f Compare August 8, 2024 14:26
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 172c90f to 2f5f981 Compare August 8, 2024 14:53
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from e9a129b to 558a31d Compare August 21, 2024 06:23
@satyaog satyaog requested a deployment to cloud-ci August 21, 2024 07:34 — with GitHub Actions Abandoned
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 558a31d to 9e394be Compare August 22, 2024 04:13
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 9e394be to fdd5270 Compare September 6, 2024 03:24
@satyaog satyaog changed the base branch from master to staging September 6, 2024 03:28
@satyaog satyaog requested a deployment to cloud-ci September 6, 2024 04:48 — with GitHub Actions Abandoned
@satyaog satyaog requested a deployment to cloud-ci September 6, 2024 04:48 — with GitHub Actions Abandoned
@satyaog satyaog requested a deployment to cloud-ci September 6, 2024 08:08 — with GitHub Actions Abandoned
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from fdd5270 to b03a424 Compare September 11, 2024 05:24
covalent is not compatible with milabench as it requires sqlalchemy<2.0.0

Update .github/workflows/cloud-ci.yml
Apply suggestions from code review
Update .github/workflows/cloud-ci.yml
Add azure covalent cloud infra

Add multi-node on cloud

* VM on the cloud might not have enough space on all partitions. Add a workaround which should cover most cases
* Use branch and commit name to versionize reports directories
* Fix parsing error when temperature is not available in nvidia-smi outputs
* export MILABENCH_* env vars to remote

Add docs

Fix cloud instance name conflict

This would prevent the CI or multiple contributors to run tests with the same config
Fix github push in CI

* Copy ssh key to allow connections from master to workers
* Use local ip for manager's ip such that workers can find it and connect to it
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from b03a424 to b591c23 Compare September 20, 2024 15:48
@satyaog
Copy link
Member Author

satyaog commented Sep 20, 2024

Added the slurm covalent plugin to help debug the cloud setups

@satyaog satyaog force-pushed the satyaog/feature/covalent branch from b591c23 to 3b207f8 Compare September 23, 2024 21:42
@satyaog
Copy link
Member Author

satyaog commented Sep 24, 2024

Tested slurm with:

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 7543 32-Core Processor
n_cpu:    64
product:  NVIDIA A100-SXM4-80GB
n_gpu:    1
memory:   81920.0

Breakdown
---------
bench                    | fail |   n | ngpu |           perf |   sem% |   std% | peak_memory |          score | weight
diffusion-single         |    0 |   1 |    1 |          28.13 |   0.1% |   0.9% |       53815 |          28.13 |   1.00
dimenet                  |    0 |   1 |    1 |         482.46 |   1.8% |   5.4% |         nan |         482.46 |   1.00
dinov2-giant-single      |    0 |   1 |    1 |          54.12 |   0.6% |   2.1% |       69569 |          54.12 |   1.00
dqn                      |    0 |   1 |    1 | 22934535905.03 |   3.3% |  91.1% |         nan | 22934535905.03 |   1.00
bf16                     |    0 |   1 |    1 |         296.65 |   0.0% |   0.2% |        1609 |         296.65 |   0.00
fp16                     |    0 |   1 |    1 |         295.35 |   0.0% |   0.3% |        1609 |         295.35 |   0.00
fp32                     |    0 |   1 |    1 |          19.17 |   0.0% |   0.0% |        1987 |          19.17 |   0.00
tf32                     |    0 |   1 |    1 |         148.64 |   0.0% |   0.1% |        1987 |         148.64 |   0.00
bert-fp16                |    0 |   1 |    1 |         275.25 |   0.0% |   0.2% |         nan |         275.25 |   0.00
bert-fp32                |    0 |   1 |    1 |          45.64 |   0.0% |   0.1% |       20991 |          45.64 |   0.00
bert-tf32                |    0 |   1 |    1 |         147.32 |   0.1% |   0.4% |         nan |         147.32 |   0.00
bert-tf32-fp16           |    0 |   1 |    1 |         274.37 |   0.2% |   1.3% |         nan |         274.37 |   3.00
reformer                 |    0 |   1 |    1 |          62.86 |   0.1% |   0.4% |         nan |          62.86 |   1.00
t5                       |    0 |   1 |    1 |          52.16 |   0.3% |   0.8% |         nan |          52.16 |   2.00
whisper                  |    0 |   1 |    1 |         520.24 |   1.0% |   3.0% |         nan |         520.24 |   1.00
lightning                |    0 |   1 |    1 |         712.70 |   0.5% |   5.0% |       27183 |         712.70 |   1.00
llava-single             |    0 |   1 |    1 |           2.39 |   0.2% |   1.6% |       72377 |           2.39 |   1.00
llama                    |    0 |   1 |    1 |         466.14 |  11.5% |  72.0% |       27641 |         466.14 |   1.00
llm-lora-single          |    0 |   1 |    1 |        3517.85 |   0.1% |   0.7% |       32995 |        3517.85 |   1.00
pna                      |    0 |   1 |    1 |        5079.10 |   1.9% |   5.6% |       39543 |        5079.10 |   1.00
ppo                      |    0 |   1 |    1 |    32372024.27 |   1.5% |  57.6% |       62159 |    32372024.27 |   1.00
recursiongfn             |    0 |   1 |    1 |        9035.14 |   3.5% |  10.5% |        6935 |        9035.14 |   1.00
rlhf-single              |    0 |   1 |    1 |        2573.66 |   0.3% |   2.8% |       19181 |        2573.66 |   1.00
focalnet                 |    0 |   1 |    1 |         389.95 |   0.7% |   2.3% |        3847 |         389.95 |   2.00
torchatari               |    0 |   1 |    1 |        3592.50 |   1.4% |   5.0% |        3655 |        3592.50 |   1.00
convnext_large-fp16      |    0 |   1 |    1 |         354.76 |   0.5% |   2.6% |         nan |         354.76 |   0.00
convnext_large-fp32      |    0 |   1 |    1 |          60.63 |   0.1% |   0.3% |       55771 |          60.63 |   0.00
convnext_large-tf32      |    0 |   1 |    1 |         160.49 |   0.0% |   0.1% |       49471 |         160.49 |   0.00
convnext_large-tf32-fp16 |    0 |   1 |    1 |         357.23 |   0.2% |   1.2% |         nan |         357.23 |   3.00
regnet_y_128gf           |    0 |   1 |    1 |         123.15 |   0.3% |   0.9% |         nan |         123.15 |   2.00
resnet50                 |    0 |   1 |    1 |        1199.53 |   2.4% |   7.3% |         nan |        1199.53 |   1.00
resnet50-noio            |    0 |   1 |    1 |        1177.09 |   0.0% |   0.2% |       27301 |        1177.09 |   0.00
vjepa-single             |    0 |   1 |    1 |          22.22 |   1.8% |  14.0% |       56005 |          22.22 |   1.00

Scores
------
Failure rate:       0.00% (PASS)
Score:             821.42

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 7543 32-Core Processor
n_cpu:    64
product:  NVIDIA A100-SXM4-80GB
n_gpu:    4
memory:   81920.0

Breakdown
---------
bench              | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |      score | weight
brax               |    0 |   1 |    4 |  636209.06 |   0.3% |   0.8% |        2609 |  636209.06 |   1.00
diffusion-gpus     |    0 |   1 |    4 |     109.52 |   0.1% |   0.5% |       58283 |     109.52 |   1.00
dinov2-giant-gpus  |    0 |   1 |    4 |     229.23 |   0.3% |   0.9% |       70961 |     229.23 |   1.00
lightning-gpus     |    0 |   1 |    4 |    2898.55 |   0.3% |   2.6% |       28055 |    2898.55 |   1.00
llm-lora-ddp-gpus  |    0 |   1 |    4 |   10472.82 |   0.6% |   3.1% |       36227 |   10472.82 |   1.00
rlhf-gpus          |    0 |   1 |    4 |    7560.51 |   0.3% |   2.4% |       21489 |    7560.51 |   1.00
resnet152-ddp-gpus |    0 |   1 |    4 |    2438.15 |   0.0% |   0.4% |       27849 |    2438.15 |   0.00
vjepa-gpus         |    0 |   1 |    4 |      78.81 |   3.6% |  28.9% |       63831 |      78.81 |   1.00

Scores
------
Failure rate:       0.00% (PASS)
Score:            2246.64

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 7543 32-Core Processor
n_cpu:    64
product:  NVIDIA A100-SXM4-80GB
n_gpu:    2
memory:   81920.0

Breakdown
---------
bench              | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |      score | weight
diffusion-nodes    |    0 |   2 |    4 |      23.50 |   0.5% |   3.7% |       57299 |      23.50 |   1.00
llm-lora-ddp-nodes |    0 |   2 |    4 |    1043.47 |   0.6% |   3.4% |       35199 |    1043.47 |   1.00

Scores
------
Failure rate:       0.00% (PASS)
Score:             156.58

Large llm models (llama3 70B) have been excluded as I don't have the resources to test yet

It should work as well on azure which I'll test next week

Base automatically changed from staging to master October 2, 2024 17:00
@satyaog satyaog force-pushed the satyaog/feature/covalent branch from 3b207f8 to f75e3a5 Compare October 3, 2024 19:11
@satyaog satyaog requested a deployment to cloud-ci October 3, 2024 20:47 — with GitHub Actions Abandoned
@satyaog satyaog requested a deployment to cloud-ci October 3, 2024 20:47 — with GitHub Actions Abandoned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant