Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dragon launcher #580

Merged
merged 19 commits into from
May 13, 2024
Merged

Dragon launcher #580

merged 19 commits into from
May 13, 2024

Commits on Apr 4, 2024

  1. Dragon Launcher Prototype (#470)

    This is the first prototype of the new Dragon-based launcher. The batch launch is still not available for dragon.
    
    [ committed by @al-rigazzi @ankona @MattToast ]
    [ reviewed by @MattToast @ankona @al-rigazzi ]
    
    ---------
    
    Co-authored-by: Matt Drozt <matthew.drozt@gmail.com>
    Co-authored-by: Christopher McBride <christopher.mcbride@gmail.com>
    3 people authored Apr 4, 2024
    Configuration menu
    Copy the full SHA
    9de7044 View commit details
    Browse the repository at this point in the history

Commits on Apr 10, 2024

  1. Configuration menu
    Copy the full SHA
    547e20a View commit details
    Browse the repository at this point in the history

Commits on Apr 15, 2024

  1. Decouple authenticator and socket creation (#542)

    1. ZMQ authenticators appear to have clashing inproc addresses when
    using the `zmq.Context.instance()` factory method. Replaced as needed.
    2. Updated underlying `Dragon` library version, which included a
    breaking changing causing the swap from `TemplateProcess` to
    `ProcessTemplate`
    3. Fixed incomplete permission set on curve key files
    
    [ committed by @ankona]
    [ reviewed by @MattToast @al-rigazzi ]
    ankona authored Apr 15, 2024
    Configuration menu
    Copy the full SHA
    7f6ecbe View commit details
    Browse the repository at this point in the history
  2. Fix name mapping for dragon steps (#551)

    ## Fix a defect in retrieving status updates for the dragon launcher. 
    
    Pre-dragon launchers used the task/step name to retrieve updates while
    the dragon launcher uses the `task_id`. This fix ensures that the name
    for dragon tasks is mapped appropriately.
    
    [ committed by @ankona ]
    [ reviewed by @al-rigazzi ]
    ankona authored Apr 15, 2024
    Configuration menu
    Copy the full SHA
    ca38b8a View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2024

  1. Fix telemetry monitor listener registration timeline (#549)

    Reorder experiment startup to ensure telemetry monitor registers event
    listeners prior to launching entities.
    
    [ committed by @ankona ]
    [ approved by @MattToast ]
    ankona authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    b473413 View commit details
    Browse the repository at this point in the history
  2. set correct value for curve server public key (#553)

    [ committed by @ankona ]
    [ reviewed by @al-rigazzi ]
    ankona authored Apr 16, 2024
    Configuration menu
    Copy the full SHA
    5ef4af5 View commit details
    Browse the repository at this point in the history

Commits on Apr 23, 2024

  1. Add log file cleanup to dragon entrypoint (#554)

    Update the dragon entrypoint to ensure that the log file is removed when
    the environment is shutdown.
    
    Additional updates:
    - minor refactor to enable testing entrypoint features
    - add tests for entrypoint functions
    - update incorrect license clause
    
    [ committed by @ankona ]
    [ reviewed by @al-rigazzi ]
    ankona authored Apr 23, 2024
    Configuration menu
    Copy the full SHA
    9fd7fe6 View commit details
    Browse the repository at this point in the history

Commits on May 7, 2024

  1. Configuration menu
    Copy the full SHA
    9312176 View commit details
    Browse the repository at this point in the history
  2. formatting

    ankona committed May 7, 2024
    Configuration menu
    Copy the full SHA
    a550528 View commit details
    Browse the repository at this point in the history
  3. fix merge conflict fail

    ankona committed May 7, 2024
    Configuration menu
    Copy the full SHA
    4cc9431 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    c19651e View commit details
    Browse the repository at this point in the history

Commits on May 9, 2024

  1. Add option to install Dragon runtime (#569)

    Add build option to `smart` CLI for installation of Dragon runtime.
    
    ### Additional Changes
    - minor extract-method refactor to avoid `too-many-statements` linter
    issue
    
    ### Expected Output
    
    ```bash
    (ss39) mcbridch@hotlum-login:/lus/bnchlu1/mcbridch/ss> smart build --dragon
    [SmartSim] INFO Running SmartSim build process...
    [SmartSim] INFO Checking requested versions...
    [SmartSim] INFO Checking for build tools...
    [SmartSim] DEBUG Retrieved asset metadata: GitReleaseAsset(url="https://api.github.com/repos/DragonHPC/dragon/releases/assets/157545149")
    [SmartSim] DEBUG Retrieved https://github.com/DragonHPC/dragon/releases/download/v0.8-beta/dragon-0.8-py3.9.4.1-CRAYEX-ac132fe95.tar.gz to /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon
    [SmartSim] INFO Installing dragon from: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon/dragon-0.8/dragon-0.8-cp39-cp39-linux_x86_64.whl
    [SmartSim] DEBUG Deleted asset directory: /lus/bnchlu1/mcbridch/ss/smartsim/_core/.third-party/dragon
    [SmartSim] INFO Dragon installation complete
    [SmartSim] INFO Redis build complete!
    
    ML Backends Requested
    ╒════════════╤════════╤═══════╕
    │ PyTorch    │ 2.0.1  │ True  │
    │ TensorFlow │ 2.13.1 │ True  │
    │ ONNX       │ 1.14.1 │ False │
    ╘════════════╧════════╧═══════╛
    
    Building for GPU support: False
    
    [SmartSim] INFO Building RedisAI version 1.2.7 from https://github.com/RedisAI/RedisAI.git/
    [SmartSim] INFO ML Backends and RedisAI build complete!
    [SmartSim] INFO Tensorflow, Torch backend(s) built
    [SmartSim] INFO SmartSim build complete!
    ```
    
    ---------
    
    Co-authored-by: Alyssa Cote <46540273+AlyssaCote@users.noreply.github.com>
    Co-authored-by: amandarichardsonn <30413257+amandarichardsonn@users.noreply.github.com>
    Co-authored-by: Matt Drozt <drozt@hpe.com>
    
    [ reviewed by @al-rigazzi @MattToast ]
    [ committed by @ankona ]
    ankona authored May 9, 2024
    Configuration menu
    Copy the full SHA
    f19d7a9 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    390f9cf View commit details
    Browse the repository at this point in the history

Commits on May 10, 2024

  1. Dragon Launcher Batch Job Support (#541)

    This PR actually adds several things:
    - stdout and stderr redirect of Dragon-launched processes
    - `DragonBatchStep` with logic to keep track of batch jobs run through
    SLURM and PBS
    - some more env variables were added to `CONFIG` to help with launching
    dragon with options
    - some mitigation of Authenticator's locking behavior is put in place
    - a cooldown period was added to the `DragonBackend` to make sure
    telemetry monitor can get updates before it shuts down
    - the `DragonBackend` status is now a string representation of two
    tables, one for hosts (indicating Free/Busy status) and one for
    ProcessGroups (similar to standard WLM output)
    - documentation was added for Dragon.
    
    ---------
    
    Co-authored-by: Matt Drozt <matthew.drozt@gmail.com>
    Co-authored-by: Amanda Richardson <amanda.richardson@hpe.com>
    3 people authored May 10, 2024
    Configuration menu
    Copy the full SHA
    4a971fc View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    0096a25 View commit details
    Browse the repository at this point in the history

Commits on May 12, 2024

  1. Update changelog

    al-rigazzi committed May 12, 2024
    Configuration menu
    Copy the full SHA
    e6dd26c View commit details
    Browse the repository at this point in the history
  2. Restore changelog checker

    al-rigazzi committed May 12, 2024
    Configuration menu
    Copy the full SHA
    6428013 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    3fa00e9 View commit details
    Browse the repository at this point in the history

Commits on May 13, 2024

  1. Configuration menu
    Copy the full SHA
    f62f2f3 View commit details
    Browse the repository at this point in the history