Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dragon launcher #580

Merged
merged 19 commits into from
May 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/changelog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,4 @@ jobs:
uses: dangoslen/changelog-enforcer@v3.6.0
with:
changeLogPath: './doc/changelog.md'
missingUpdateErrorMessage: 'changelog.md has not been updated'
missingUpdateErrorMessage: 'changelog.md has not been updated'
6 changes: 3 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ tests/test_output

# Dependencies
smartsim/_core/.third-party
smartsim/_core/.dragon

# Docs
_build
Expand All @@ -22,14 +23,13 @@ venv/
.venv/
env/
.env/
**/.env

# written upon install
smartsim/version.py

smartsim/_core/bin/*-server
smartsim/_core/bin/*-cli

# created upon install
smartsim/_core/bin
smartsim/_core/lib

# optional dev tools
Expand Down
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,13 +174,17 @@ system with which it has a corresponding `RunSettings` class. If one can be foun
## Experiments on HPC Systems

SmartSim integrates with common HPC schedulers providing batch and interactive
launch capabilities for all applications.
launch capabilities for all applications:

- Slurm
- LSF
- PBSPro
- Local (for laptops/single node, no batch)

In addition, on Slurm and PBS systems, [Dragon](https://dragonhpc.github.io/dragon/doc/_build/html/index.html)
can be used as a launcher. Please refer to the documentation for instructions on
how to insall it on your system and use it in SmartSim.


### Interactive Launch Example

Expand Down
118 changes: 115 additions & 3 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,11 @@
import os
import pathlib
import shutil
import subprocess
import signal
import sys
import tempfile
import time
import typing as t
import uuid
import warnings
Expand All @@ -44,14 +46,18 @@

import smartsim
from smartsim import Experiment
from smartsim._core.launcher.dragon.dragonConnector import DragonConnector
from smartsim._core.launcher.dragon.dragonLauncher import DragonLauncher
from smartsim._core.config import CONFIG
from smartsim._core.config.config import Config
from smartsim._core.utils.telemetry.telemetry import JobEntity
from smartsim.database import Orchestrator
from smartsim.entity import Model
from smartsim.error import SSConfigError
from smartsim.log import get_logger
from smartsim.settings import (
AprunSettings,
DragonRunSettings,
JsrunSettings,
MpiexecSettings,
MpirunSettings,
Expand All @@ -60,6 +66,8 @@
SrunSettings,
)

logger = get_logger(__name__)

# pylint: disable=redefined-outer-name,invalid-name,global-statement

# Globals, yes, but its a testing file
Expand All @@ -73,6 +81,9 @@
test_port = CONFIG.test_port
test_account = CONFIG.test_account or ""
test_batch_resources: t.Dict[t.Any, t.Any] = CONFIG.test_batch_resources
test_output_dirs = 0
mpi_app_exe = None
built_mpi_app = False

# Fill this at runtime if needed
test_hostlist = None
Expand Down Expand Up @@ -108,7 +119,7 @@ def print_test_configuration() -> None:

def pytest_configure() -> None:
pytest.test_launcher = test_launcher
pytest.wlm_options = ["slurm", "pbs", "lsf", "pals"]
pytest.wlm_options = ["slurm", "pbs", "lsf", "pals", "dragon"]
account = get_account()
pytest.test_account = account
pytest.test_device = test_device
Expand All @@ -125,6 +136,14 @@ def pytest_sessionstart(
if os.path.isdir(test_output_root):
shutil.rmtree(test_output_root)
os.makedirs(test_output_root)
while not os.path.isdir(test_output_root):
time.sleep(0.1)

if CONFIG.dragon_server_path is None:
dragon_server_path = os.path.join(test_output_root, "dragon_server")
os.makedirs(dragon_server_path)
os.environ["SMARTSIM_DRAGON_SERVER_PATH"] = dragon_server_path

print_test_configuration()


Expand All @@ -136,12 +155,62 @@ def pytest_sessionfinish(
returning the exit status to the system.
"""
if exitstatus == 0:
shutil.rmtree(test_output_root)
cleanup_attempts = 5
while cleanup_attempts > 0:
try:
shutil.rmtree(test_output_root)
except OSError as e:
cleanup_attempts -= 1
time.sleep(1)
if not cleanup_attempts:
raise
else:
break
else:
# kill all spawned processes in case of error
# kill all spawned processes
if CONFIG.test_launcher == "dragon":
time.sleep(5)
kill_all_test_spawned_processes()


def build_mpi_app() -> t.Optional[pathlib.Path]:
global built_mpi_app
built_mpi_app = True
cc = shutil.which("cc")
if cc is None:
cc = shutil.which("gcc")
if cc is None:
return None

path_to_src = pathlib.Path(FileUtils().get_test_conf_path("mpi"))
path_to_out = pathlib.Path(test_output_root) / "apps" / "mpi_app"
os.makedirs(path_to_out.parent, exist_ok=True)
cmd = [cc, str(path_to_src / "mpi_hello.c"), "-o", str(path_to_out)]
proc = subprocess.Popen(cmd)
proc.wait(timeout=1)
if proc.returncode == 0:
return path_to_out
else:
return None

@pytest.fixture(scope="session")
def mpi_app_path() -> t.Optional[pathlib.Path]:
"""Return path to MPI app if it was built

return None if it could not or will not be built
"""
if not CONFIG.test_mpi:
return None

# if we already tried to build, return what we have
if built_mpi_app:
return mpi_app_exe

# attempt to build, set global
mpi_app_exe = build_mpi_app()
return mpi_app_exe


def kill_all_test_spawned_processes() -> None:
# in case of test failure, clean up all spawned processes
pid = os.getpid()
Expand All @@ -157,6 +226,7 @@ def kill_all_test_spawned_processes() -> None:
print("Not all processes were killed after test")



def get_hostlist() -> t.Optional[t.List[str]]:
global test_hostlist
if not test_hostlist:
Expand Down Expand Up @@ -273,6 +343,12 @@ def get_base_run_settings(
run_args.update(kwargs)
settings = RunSettings(exe, args, run_command="srun", run_args=run_args)
return settings
if test_launcher == "dragon":
run_args = {"nodes": nodes}
run_args = {"ntasks": ntasks}
run_args.update(kwargs)
settings = DragonRunSettings(exe, args, run_args=run_args)
return settings
if test_launcher == "pbs":
if shutil.which("aprun"):
run_command = "aprun"
Expand Down Expand Up @@ -314,6 +390,11 @@ def get_run_settings(
run_args = {"nodes": nodes, "ntasks": ntasks, "time": "00:10:00"}
run_args.update(kwargs)
return SrunSettings(exe, args, run_args=run_args)
if test_launcher == "dragon":
run_args = {"nodes": nodes}
run_args.update(kwargs)
settings = DragonRunSettings(exe, args, run_args=run_args)
return settings
if test_launcher == "pbs":
if shutil.which("aprun"):
run_args = {"pes": ntasks}
Expand Down Expand Up @@ -372,6 +453,14 @@ def get_orchestrator(nodes: int = 1, batch: bool = False) -> Orchestrator:
interface=test_nic,
launcher=test_launcher,
)
if test_launcher == "dragon":
return Orchestrator(
db_nodes=nodes,
port=test_port,
batch=batch,
interface=test_nic,
launcher=test_launcher,
)
if test_launcher == "lsf":
return Orchestrator(
db_nodes=nodes,
Expand Down Expand Up @@ -464,6 +553,14 @@ def environment_cleanup(monkeypatch: pytest.MonkeyPatch) -> None:
monkeypatch.delenv("SSKEYOUT", raising=False)


@pytest.fixture(scope="function", autouse=True)
def check_output_dir() -> None:
global test_output_dirs
assert os.path.isdir(test_output_root)
assert len(os.listdir(test_output_root)) >= test_output_dirs
test_output_dirs = len(os.listdir(test_output_root))


@pytest.fixture
def dbutils() -> t.Type[DBUtils]:
return DBUtils
Expand Down Expand Up @@ -696,6 +793,21 @@ def setup_test_colo(
return colo_model


@pytest.fixture(scope="function")
def global_dragon_teardown() -> None:
"""Connect to a dragon server started at the path indicated by
the environment variable SMARTSIM_DRAGON_SERVER_PATH and
force its shutdown to bring down the runtime and allow a subsequent
allocation of a new runtime.
"""
if test_launcher != "dragon" or CONFIG.dragon_server_path is None:
return
logger.debug(f"Tearing down Dragon infrastructure, server path: {CONFIG.dragon_server_path}")
dragon_connector = DragonConnector()
dragon_connector.ensure_connected()
dragon_connector.cleanup()


@pytest.fixture
def config() -> Config:
return CONFIG
Expand Down
23 changes: 23 additions & 0 deletions doc/api/smartsim_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ Types of Settings:
MpiexecSettings
OrterunSettings
JsrunSettings
DragonRunSettings
SbatchSettings
QsubBatchSettings
BsubBatchSettings
Expand Down Expand Up @@ -163,6 +164,28 @@ and within batch launches (e.g., ``QsubBatchSettings``)
:members:


.. _dragonsettings_api:

DragonRunSettings
-----------------

``DragonRunSettings`` can be used on systems that support Slurm or
PBS, if Dragon is available in the Python environment (see `_dragon_install`
for instructions on how to install it through ``smart``).

``DragonRunSettings`` can be used in interactive sessions (on allcation)
and within batch launches (i.e. ``SbatchSettings`` or ``QsubBatchSettings``,
for Slurm and PBS sessions, respectively).

.. autosummary::
DragonRunSettings.set_nodes
DragonRunSettings.set_tasks_per_node

.. autoclass:: DragonRunSettings
:inherited-members:
:undoc-members:
:members:


.. _jsrun_api:

Expand Down
12 changes: 11 additions & 1 deletion doc/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ To be released at some future point in time

Description

- Add dragon runtime installer
- Add launcher based on Dragon
- Fix building of documentation
- Preview entities on experiment before start
- Update authentication in release workflow
Expand Down Expand Up @@ -59,7 +61,15 @@ Description
- Fix publishing of development docs

Detailed Notes

- Add `--dragon` option to `smart build`. Install appropriate Dragon
runtime from Dragon GitHub release assets.
([SmartSim-PR580](https://github.com/CrayLabs/SmartSim/pull/580))
- Add new launcher, based on [Dragon](https://dragonhpc.github.io/dragon/doc/_build/html/index.html).
The new launcher is compatible with the Slurm and PBS schedulers and can
be selected by specifying ``launcher="dragon"`` when creating an `Experiment`,
or by using ``DragonRunSettings`` to launch a job. The Dragon launcher
is at an early stage of development: early adopters are referred to the
dedicated documentation section to learn more about it. ([SmartSim-PR580](https://github.com/CrayLabs/SmartSim/pull/580))
- Manually ensure that typing_extensions==4.6.1 in Dockerfile used to build
docs. This fixes the deploy_dev_docs Github action ([SmartSim-PR564](https://github.com/CrayLabs/SmartSim/pull/564))
- Added preview functionality to Experiment, including preview of all entities, active infrastructure and
Expand Down
Loading
Loading