-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dragon launcher #580
Dragon launcher #580
Commits on Apr 4, 2024
-
Dragon Launcher Prototype (#470)
This is the first prototype of the new Dragon-based launcher. The batch launch is still not available for dragon. [ committed by @al-rigazzi @ankona @MattToast ] [ reviewed by @MattToast @ankona @al-rigazzi ] --------- Co-authored-by: Matt Drozt <matthew.drozt@gmail.com> Co-authored-by: Christopher McBride <christopher.mcbride@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 9de7044 - Browse repository at this point
Copy the full SHA 9de7044View commit details
Commits on Apr 10, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 547e20a - Browse repository at this point
Copy the full SHA 547e20aView commit details
Commits on Apr 15, 2024
-
Decouple authenticator and socket creation (#542)
1. ZMQ authenticators appear to have clashing inproc addresses when using the `zmq.Context.instance()` factory method. Replaced as needed. 2. Updated underlying `Dragon` library version, which included a breaking changing causing the swap from `TemplateProcess` to `ProcessTemplate` 3. Fixed incomplete permission set on curve key files [ committed by @ankona] [ reviewed by @MattToast @al-rigazzi ]
Configuration menu - View commit details
-
Copy full SHA for 7f6ecbe - Browse repository at this point
Copy the full SHA 7f6ecbeView commit details -
Fix name mapping for dragon steps (#551)
## Fix a defect in retrieving status updates for the dragon launcher. Pre-dragon launchers used the task/step name to retrieve updates while the dragon launcher uses the `task_id`. This fix ensures that the name for dragon tasks is mapped appropriately. [ committed by @ankona ] [ reviewed by @al-rigazzi ]
Configuration menu - View commit details
-
Copy full SHA for ca38b8a - Browse repository at this point
Copy the full SHA ca38b8aView commit details
Commits on Apr 16, 2024
-
Fix telemetry monitor listener registration timeline (#549)
Reorder experiment startup to ensure telemetry monitor registers event listeners prior to launching entities. [ committed by @ankona ] [ approved by @MattToast ]
Configuration menu - View commit details
-
Copy full SHA for b473413 - Browse repository at this point
Copy the full SHA b473413View commit details -
set correct value for curve server public key (#553)
[ committed by @ankona ] [ reviewed by @al-rigazzi ]
Configuration menu - View commit details
-
Copy full SHA for 5ef4af5 - Browse repository at this point
Copy the full SHA 5ef4af5View commit details
Commits on Apr 23, 2024
-
Add log file cleanup to dragon entrypoint (#554)
Update the dragon entrypoint to ensure that the log file is removed when the environment is shutdown. Additional updates: - minor refactor to enable testing entrypoint features - add tests for entrypoint functions - update incorrect license clause [ committed by @ankona ] [ reviewed by @al-rigazzi ]
Configuration menu - View commit details
-
Copy full SHA for 9fd7fe6 - Browse repository at this point
Copy the full SHA 9fd7fe6View commit details
Commits on May 7, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 9312176 - Browse repository at this point
Copy the full SHA 9312176View commit details -
Configuration menu - View commit details
-
Copy full SHA for a550528 - Browse repository at this point
Copy the full SHA a550528View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4cc9431 - Browse repository at this point
Copy the full SHA 4cc9431View commit details -
Configuration menu - View commit details
-
Copy full SHA for c19651e - Browse repository at this point
Copy the full SHA c19651eView commit details
Commits on May 9, 2024
-
Add option to install Dragon runtime (#569)
Add build option to `smart` CLI for installation of Dragon runtime. ### Additional Changes - minor extract-method refactor to avoid `too-many-statements` linter issue ### Expected Output ```bash (ss39) mcbridch@hotlum-login:/lus/bnchlu1/mcbridch/ss> smart build --dragon [SmartSim] INFO Running SmartSim build process... [SmartSim] INFO Checking requested versions... [SmartSim] INFO Checking for build tools... [SmartSim] DEBUG Retrieved asset metadata: GitReleaseAsset(url="https://api.github.com/repos/DragonHPC/dragon/releases/assets/157545149") [SmartSim] DEBUG Retrieved https://github.com/DragonHPC/dragon/releases/download/v0.8-beta/dragon-0.8-py3.9.4.1-CRAYEX-ac132fe95.tar.gz to /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon [SmartSim] INFO Installing dragon from: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon/dragon-0.8/dragon-0.8-cp39-cp39-linux_x86_64.whl [SmartSim] DEBUG Deleted asset directory: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon [SmartSim] INFO Dragon installation complete [SmartSim] INFO Redis build complete! ML Backends Requested ╒════════════╤════════╤═══════╕ │ PyTorch │ 2.0.1 │ True │ │ TensorFlow │ 2.13.1 │ True │ │ ONNX │ 1.14.1 │ False │ ╘════════════╧════════╧═══════╛ Building for GPU support: False [SmartSim] INFO Building RedisAI version 1.2.7 from https://github.com/RedisAI/RedisAI.git/ [SmartSim] INFO ML Backends and RedisAI build complete! [SmartSim] INFO Tensorflow, Torch backend(s) built [SmartSim] INFO SmartSim build complete! ``` --------- Co-authored-by: Alyssa Cote <46540273+AlyssaCote@users.noreply.github.com> Co-authored-by: amandarichardsonn <30413257+amandarichardsonn@users.noreply.github.com> Co-authored-by: Matt Drozt <drozt@hpe.com> [ reviewed by @al-rigazzi @MattToast ] [ committed by @ankona ]
Configuration menu - View commit details
-
Copy full SHA for f19d7a9 - Browse repository at this point
Copy the full SHA f19d7a9View commit details -
Configuration menu - View commit details
-
Copy full SHA for 390f9cf - Browse repository at this point
Copy the full SHA 390f9cfView commit details
Commits on May 10, 2024
-
Dragon Launcher Batch Job Support (#541)
This PR actually adds several things: - stdout and stderr redirect of Dragon-launched processes - `DragonBatchStep` with logic to keep track of batch jobs run through SLURM and PBS - some more env variables were added to `CONFIG` to help with launching dragon with options - some mitigation of Authenticator's locking behavior is put in place - a cooldown period was added to the `DragonBackend` to make sure telemetry monitor can get updates before it shuts down - the `DragonBackend` status is now a string representation of two tables, one for hosts (indicating Free/Busy status) and one for ProcessGroups (similar to standard WLM output) - documentation was added for Dragon. --------- Co-authored-by: Matt Drozt <matthew.drozt@gmail.com> Co-authored-by: Amanda Richardson <amanda.richardson@hpe.com>
Configuration menu - View commit details
-
Copy full SHA for 4a971fc - Browse repository at this point
Copy the full SHA 4a971fcView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0096a25 - Browse repository at this point
Copy the full SHA 0096a25View commit details
Commits on May 12, 2024
-
Configuration menu - View commit details
-
Copy full SHA for e6dd26c - Browse repository at this point
Copy the full SHA e6dd26cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6428013 - Browse repository at this point
Copy the full SHA 6428013View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3fa00e9 - Browse repository at this point
Copy the full SHA 3fa00e9View commit details
Commits on May 13, 2024
-
Configuration menu - View commit details
-
Copy full SHA for f62f2f3 - Browse repository at this point
Copy the full SHA f62f2f3View commit details