-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dragon launcher #580
Merged
Merged
Dragon launcher #580
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is the first prototype of the new Dragon-based launcher. The batch launch is still not available for dragon. [ committed by @al-rigazzi @ankona @MattToast ] [ reviewed by @MattToast @ankona @al-rigazzi ] --------- Co-authored-by: Matt Drozt <matthew.drozt@gmail.com> Co-authored-by: Christopher McBride <christopher.mcbride@gmail.com>
1. ZMQ authenticators appear to have clashing inproc addresses when using the `zmq.Context.instance()` factory method. Replaced as needed. 2. Updated underlying `Dragon` library version, which included a breaking changing causing the swap from `TemplateProcess` to `ProcessTemplate` 3. Fixed incomplete permission set on curve key files [ committed by @ankona] [ reviewed by @MattToast @al-rigazzi ]
## Fix a defect in retrieving status updates for the dragon launcher. Pre-dragon launchers used the task/step name to retrieve updates while the dragon launcher uses the `task_id`. This fix ensures that the name for dragon tasks is mapped appropriately. [ committed by @ankona ] [ reviewed by @al-rigazzi ]
Reorder experiment startup to ensure telemetry monitor registers event listeners prior to launching entities. [ committed by @ankona ] [ approved by @MattToast ]
[ committed by @ankona ] [ reviewed by @al-rigazzi ]
Update the dragon entrypoint to ensure that the log file is removed when the environment is shutdown. Additional updates: - minor refactor to enable testing entrypoint features - add tests for entrypoint functions - update incorrect license clause [ committed by @ankona ] [ reviewed by @al-rigazzi ]
Add build option to `smart` CLI for installation of Dragon runtime. ### Additional Changes - minor extract-method refactor to avoid `too-many-statements` linter issue ### Expected Output ```bash (ss39) mcbridch@hotlum-login:/lus/bnchlu1/mcbridch/ss> smart build --dragon [SmartSim] INFO Running SmartSim build process... [SmartSim] INFO Checking requested versions... [SmartSim] INFO Checking for build tools... [SmartSim] DEBUG Retrieved asset metadata: GitReleaseAsset(url="https://api.github.com/repos/DragonHPC/dragon/releases/assets/157545149") [SmartSim] DEBUG Retrieved https://github.com/DragonHPC/dragon/releases/download/v0.8-beta/dragon-0.8-py3.9.4.1-CRAYEX-ac132fe95.tar.gz to /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon [SmartSim] INFO Installing dragon from: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon/dragon-0.8/dragon-0.8-cp39-cp39-linux_x86_64.whl [SmartSim] DEBUG Deleted asset directory: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon [SmartSim] INFO Dragon installation complete [SmartSim] INFO Redis build complete! ML Backends Requested ╒════════════╤════════╤═══════╕ │ PyTorch │ 2.0.1 │ True │ │ TensorFlow │ 2.13.1 │ True │ │ ONNX │ 1.14.1 │ False │ ╘════════════╧════════╧═══════╛ Building for GPU support: False [SmartSim] INFO Building RedisAI version 1.2.7 from https://github.com/RedisAI/RedisAI.git/ [SmartSim] INFO ML Backends and RedisAI build complete! [SmartSim] INFO Tensorflow, Torch backend(s) built [SmartSim] INFO SmartSim build complete! ``` --------- Co-authored-by: Alyssa Cote <46540273+AlyssaCote@users.noreply.github.com> Co-authored-by: amandarichardsonn <30413257+amandarichardsonn@users.noreply.github.com> Co-authored-by: Matt Drozt <drozt@hpe.com> [ reviewed by @al-rigazzi @MattToast ] [ committed by @ankona ]
This PR actually adds several things: - stdout and stderr redirect of Dragon-launched processes - `DragonBatchStep` with logic to keep track of batch jobs run through SLURM and PBS - some more env variables were added to `CONFIG` to help with launching dragon with options - some mitigation of Authenticator's locking behavior is put in place - a cooldown period was added to the `DragonBackend` to make sure telemetry monitor can get updates before it shuts down - the `DragonBackend` status is now a string representation of two tables, one for hosts (indicating Free/Busy status) and one for ProcessGroups (similar to standard WLM output) - documentation was added for Dragon. --------- Co-authored-by: Matt Drozt <matthew.drozt@gmail.com> Co-authored-by: Amanda Richardson <amanda.richardson@hpe.com>
mellis13
approved these changes
May 13, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (after the tests finish passing). Lots of incredible work in this PR!
al-rigazzi
added
type: feature
Issues that include feature request or feature idea
area: launcher
Issues related to any of the launchers within SmartSim
area: api
Issues related to API changes
area: Dragon
labels
May 13, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area: api
Issues related to API changes
area: Dragon
area: launcher
Issues related to any of the launchers within SmartSim
type: feature
Issues that include feature request or feature idea
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a Dragon-based launcher to SmartSim.