Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dragon Launcher Prototype #470

Merged
merged 56 commits into from
Apr 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
45a074f
First working version
al-rigazzi Jan 24, 2024
10780f5
Add stop to dragon, reshuffle files
al-rigazzi Jan 24, 2024
ad61696
Add reconnect functionality
al-rigazzi Jan 24, 2024
d87e826
Add handshake
al-rigazzi Jan 25, 2024
d7bc190
Add timeout to handshake
al-rigazzi Jan 25, 2024
d756f8e
Add schemas for bootstrap
al-rigazzi Jan 26, 2024
941ae02
Rename bootstrap req/rep
al-rigazzi Jan 26, 2024
81a2219
Fix colocated and entry points
al-rigazzi Jan 29, 2024
5a4b91f
Merge branch 'develop' of https://github.com/CrayLabs/SmartSim into p…
al-rigazzi Jan 29, 2024
41c1f24
Fix typehints
al-rigazzi Jan 30, 2024
a7b211a
Fixes to tests
al-rigazzi Jan 30, 2024
a0ba5ca
Make DragonStep proxiable
al-rigazzi Jan 30, 2024
e32c1f3
Patch some telemetry tests for Dragon
al-rigazzi Jan 30, 2024
7a8adfd
Add Dragon to telemetry-supported launchers
al-rigazzi Jan 30, 2024
8c3bd20
Make style
al-rigazzi Jan 30, 2024
5dce0b3
Trimming whitespace
al-rigazzi Jan 30, 2024
10bc01c
Downgrade pydantic for TF compat issues
al-rigazzi Jan 30, 2024
e16b845
Fix typehint for pydantic 1.x
al-rigazzi Jan 30, 2024
02aaf21
Merge branch 'develop' of https://github.com/CrayLabs/SmartSim into p…
al-rigazzi Jan 31, 2024
bf2e2c4
Fix typehint
al-rigazzi Jan 31, 2024
17007ab
Ignore Dragon type
al-rigazzi Jan 31, 2024
ed0c59c
Strong typing issues
al-rigazzi Jan 31, 2024
045945f
Lint
al-rigazzi Jan 31, 2024
25680e7
Patch circular import
al-rigazzi Jan 31, 2024
59aaeac
Reformat entrypoints
al-rigazzi Jan 31, 2024
baaff54
Fix wrong type for DBModel func
al-rigazzi Jan 31, 2024
4426eae
Typecast
al-rigazzi Jan 31, 2024
153a19d
Guess who's black
al-rigazzi Jan 31, 2024
7f436f4
Fix pydantic
al-rigazzi Jan 31, 2024
a2ea9bf
Fix type error
al-rigazzi Jan 31, 2024
ca8c661
Merge branch 'develop' of https://github.com/CrayLabs/SmartSim into p…
al-rigazzi Jan 31, 2024
3e24f9e
Use ProcessGroup
al-rigazzi Jan 31, 2024
99164ae
Fix for new DragonRunRequest.exe type
al-rigazzi Jan 31, 2024
8de8fce
Fix test for Dragon
al-rigazzi Feb 1, 2024
327f917
Export env
al-rigazzi Feb 1, 2024
c6717ae
Make current env hidden field
al-rigazzi Feb 2, 2024
d1aa1ec
Merge branch 'develop' of https://github.com/CrayLabs/SmartSim into p…
al-rigazzi Feb 2, 2024
8e23a05
Use developer setting, protect logger defaults in test
al-rigazzi Feb 2, 2024
b69708e
Simplify log_level translation logic
al-rigazzi Feb 2, 2024
5071300
Merge branch 'logger_enhancements' into protodrg
al-rigazzi Feb 2, 2024
b1150b4
Fix wrong inheritance
al-rigazzi Feb 2, 2024
a80b2e2
Merge branch 'develop' of https://github.com/CrayLabs/SmartSim into p…
al-rigazzi Feb 8, 2024
a2b1d29
Remove driver file
al-rigazzi Feb 8, 2024
c68da95
Type Saftey for Dragon Backend (#3)
MattToast Feb 28, 2024
4f9f72e
Dragon iteration two, clean history (#5)
al-rigazzi Mar 6, 2024
372e152
Treat `DragonStep`s as a Managed Steps (#6)
MattToast Mar 14, 2024
97292df
Turn ``SchemaSerializer`` into ``SchemaRegistry`` (#8)
MattToast Mar 25, 2024
578225d
Dynamically assign port to dragon bootstrap queue (#11)
ankona Mar 27, 2024
42b756e
Remove Inheritance Style Constraint Stings from Schemas (#9)
MattToast Mar 27, 2024
7dab455
merge develop
ankona Mar 29, 2024
d859e33
Merge branch 'develop' into protodrg
ankona Mar 29, 2024
a8da95a
Post develop-merge fixes (#13)
ankona Mar 29, 2024
fe8c7ca
Add KeyManager for managing curve key pairs (#12)
ankona Apr 2, 2024
232d67f
Dragon policies (#7)
al-rigazzi Apr 3, 2024
730889f
Refactor fix (#15)
ankona Apr 3, 2024
befb6b6
add get_secure_socket method (#16)
ankona Apr 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions .github/workflows/changelog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,22 +28,22 @@

name: enforce_changelog

on:
pull_request:
push:
branches:
- develop
# on:
# pull_request:
# push:
# branches:
# - develop

jobs:
changelog:
name: check_changelog
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v4

- name: Changelog Enforcer
uses: dangoslen/changelog-enforcer@v3.6.0
with:
changeLogPath: './doc/changelog.rst'
missingUpdateErrorMessage: 'changelog.rst has not been updated'
- name: Changelog Enforcer
uses: dangoslen/changelog-enforcer@v3.6.0
with:
changeLogPath: "./doc/changelog.rst"
missingUpdateErrorMessage: "changelog.rst has not been updated"
4 changes: 1 addition & 3 deletions .github/workflows/run_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,12 +57,10 @@ jobs:
os: [macos-12, macos-14, ubuntu-20.04] # Operating systems
compiler: [8] # GNU compiler version
rai: [1.2.7] # Redis AI versions
py_v: ["3.8", "3.9", "3.10", "3.11"] # Python versions
py_v: ["3.9", "3.10", "3.11"] # Python versions
exclude:
- os: macos-14
py_v: "3.9"
- os: macos-14
py_v: "3.8"

env:
SMARTSIM_REDISAI: ${{ matrix.rai }}
Expand Down
123 changes: 119 additions & 4 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,10 @@
import os
import pathlib
import shutil
import subprocess
import sys
import tempfile
import time
import typing as t
import uuid
import warnings
Expand All @@ -43,14 +45,17 @@

import smartsim
from smartsim import Experiment
from smartsim._core.launcher.dragon.dragonLauncher import DragonLauncher, _dragon_cleanup
from smartsim._core.config import CONFIG
from smartsim._core.config.config import Config
from smartsim._core.utils.telemetry.telemetry import JobEntity
from smartsim.database import Orchestrator
from smartsim.entity import Model
from smartsim.error import SSConfigError
from smartsim.log import get_logger
from smartsim.settings import (
AprunSettings,
DragonRunSettings,
JsrunSettings,
MpiexecSettings,
MpirunSettings,
Expand All @@ -59,6 +64,8 @@
SrunSettings,
)

logger = get_logger(__name__)

# pylint: disable=redefined-outer-name,invalid-name,global-statement

# Globals, yes, but its a testing file
Expand All @@ -72,6 +79,9 @@
test_port = CONFIG.test_port
test_account = CONFIG.test_account or ""
test_batch_resources: t.Dict[t.Any, t.Any] = CONFIG.test_batch_resources
test_output_dirs = 0
mpi_app_exe = None
built_mpi_app = False

# Fill this at runtime if needed
test_hostlist = None
Expand Down Expand Up @@ -107,7 +117,7 @@ def print_test_configuration() -> None:

def pytest_configure() -> None:
pytest.test_launcher = test_launcher
pytest.wlm_options = ["slurm", "pbs", "lsf", "pals"]
pytest.wlm_options = ["slurm", "pbs", "lsf", "pals", "dragon"]
account = get_account()
pytest.test_account = account
pytest.test_device = test_device
Expand All @@ -124,6 +134,14 @@ def pytest_sessionstart(
if os.path.isdir(test_output_root):
shutil.rmtree(test_output_root)
os.makedirs(test_output_root)
while not os.path.isdir(test_output_root):
time.sleep(0.1)

if CONFIG.dragon_server_path is None:
dragon_server_path = os.path.join(test_output_root, "dragon_server")
os.makedirs(dragon_server_path)
os.environ["SMARTSIM_DRAGON_SERVER_PATH"] = dragon_server_path

print_test_configuration()


Expand All @@ -135,10 +153,58 @@ def pytest_sessionfinish(
returning the exit status to the system.
"""
if exitstatus == 0:
shutil.rmtree(test_output_root)
cleanup_attempts = 5
while cleanup_attempts > 0:
try:
shutil.rmtree(test_output_root)
except OSError as e:
cleanup_attempts -= 1
time.sleep(1)
if not cleanup_attempts:
raise
else:
break

# kill all spawned processes
kill_all_test_spawned_processes()


def build_mpi_app() -> t.Optional[pathlib.Path]:
global built_mpi_app
built_mpi_app = True
cc = shutil.which("cc")
if cc is None:
cc = shutil.which("gcc")
if cc is None:
return None

path_to_src = pathlib.Path(FileUtils().get_test_conf_path("mpi"))
path_to_out = pathlib.Path(test_output_root) / "apps" / "mpi_app"
os.makedirs(path_to_out.parent, exist_ok=True)
cmd = [cc, str(path_to_src / "mpi_hello.c"), "-o", str(path_to_out)]
proc = subprocess.Popen(cmd)
proc.wait(timeout=1)
if proc.returncode == 0:
return path_to_out
else:
# kill all spawned processes in case of error
kill_all_test_spawned_processes()
return None

@pytest.fixture(scope="session")
def mpi_app_path() -> t.Optional[pathlib.Path]:
"""Return path to MPI app if it was built

return None if it could not or will not be built
"""
if not CONFIG.test_mpi:
return None

# if we already tried to build, return what we have
if built_mpi_app:
return mpi_app_exe

# attempt to build, set global
mpi_app_exe = build_mpi_app()
return mpi_app_exe


def kill_all_test_spawned_processes() -> None:
Expand All @@ -156,6 +222,7 @@ def kill_all_test_spawned_processes() -> None:
print("Not all processes were killed after test")



def get_hostlist() -> t.Optional[t.List[str]]:
global test_hostlist
if not test_hostlist:
Expand Down Expand Up @@ -252,6 +319,11 @@ def get_base_run_settings(
run_args.update(kwargs)
settings = RunSettings(exe, args, run_command="srun", run_args=run_args)
return settings
if test_launcher == "dragon":
run_args = {"nodes": nodes}
run_args.update(kwargs)
settings = RunSettings(exe, args, run_command="", run_args=run_args)
return settings
if test_launcher == "pbs":
if shutil.which("aprun"):
run_command = "aprun"
Expand Down Expand Up @@ -293,6 +365,11 @@ def get_run_settings(
run_args = {"nodes": nodes, "ntasks": ntasks, "time": "00:10:00"}
run_args.update(kwargs)
return SrunSettings(exe, args, run_args=run_args)
if test_launcher == "dragon":
run_args = {"nodes": nodes}
run_args.update(kwargs)
settings = DragonRunSettings(exe, args, run_args=run_args)
return settings
if test_launcher == "pbs":
if shutil.which("aprun"):
run_args = {"pes": ntasks}
Expand Down Expand Up @@ -351,6 +428,14 @@ def get_orchestrator(nodes: int = 1, batch: bool = False) -> Orchestrator:
interface=test_nic,
launcher=test_launcher,
)
if test_launcher == "dragon":
return Orchestrator(
db_nodes=nodes,
port=test_port,
batch=batch,
interface=test_nic,
launcher=test_launcher,
)
if test_launcher == "lsf":
return Orchestrator(
db_nodes=nodes,
Expand Down Expand Up @@ -443,6 +528,14 @@ def environment_cleanup(monkeypatch: pytest.MonkeyPatch) -> None:
monkeypatch.delenv("SSKEYOUT", raising=False)


@pytest.fixture(scope="function", autouse=True)
def check_output_dir() -> None:
global test_output_dirs
assert os.path.isdir(test_output_root)
assert len(os.listdir(test_output_root)) >= test_output_dirs
test_output_dirs = len(os.listdir(test_output_root))


@pytest.fixture
def dbutils() -> t.Type[DBUtils]:
return DBUtils
Expand Down Expand Up @@ -678,6 +771,28 @@ def setup_test_colo(
return colo_model


@pytest.fixture(scope="function")
def global_dragon_teardown() -> None:
"""Connect to a dragon server started at the path indicated by
the environment variable SMARTSIM_DRAGON_SERVER_PATH and
force its shutdown to bring down the runtime and allow a subsequent
allocation of a new runtime.
"""
if test_launcher != "dragon" or CONFIG.dragon_server_path is None:
return
exp_path = os.path.join(test_output_root, "dragon_teardown")
os.makedirs(exp_path, exist_ok=True)
exp: Experiment = Experiment("dragon_shutdown", exp_path=exp_path, launcher=test_launcher)
rs = exp.create_run_settings("sleep", ["0.1"])
model = exp.create_model("dummy", run_settings=rs)
exp.generate(model, overwrite=True)
exp.start(model, block=True)

launcher: DragonLauncher = exp._control._launcher
launcher.cleanup()
time.sleep(5)


@pytest.fixture
def config() -> Config:
return CONFIG
Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ namespace_packages = true
files = [
"smartsim"
]
plugins = []
plugins = ["pydantic.mypy"]
ignore_errors = false

# Dynamic typing
Expand Down Expand Up @@ -124,6 +124,7 @@ module = [
"torch",
"smartsim.ml.torch.*", # must solve/ignore inheritance issues
"watchdog",
"dragon.*",
]
ignore_missing_imports = true
ignore_errors = true
2 changes: 2 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,8 @@ def has_ext_modules(_placeholder):
"filelock>=3.4.2",
"protobuf~=3.20",
"watchdog>=3.0.0,<4.0.0",
"pydantic==1.10.14",
"pyzmq>=25.1.2",
]

# Add SmartRedis at specific version
Expand Down
12 changes: 2 additions & 10 deletions smartsim/_core/_cli/validate.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@
import multiprocessing as mp
import os
import os.path
import socket
import tempfile
import typing as t
from types import TracebackType
Expand All @@ -42,6 +41,7 @@
from smartsim._core._cli.utils import SMART_LOGGER_FORMAT
from smartsim._core._install.builder import Device
from smartsim._core.utils.helpers import installed_redisai_backends
from smartsim._core.utils.network import find_free_port
from smartsim.log import get_logger

logger = get_logger("Smart", fmt=SMART_LOGGER_FORMAT)
Expand Down Expand Up @@ -152,8 +152,8 @@ def test_install(
) -> None:
exp = Experiment("ValidationExperiment", exp_path=location, launcher="local")
exp.telemetry.disable()
port = find_free_port() if port is None else port

port = _find_free_port() if port is None else port
with _make_managed_local_orc(exp, port) as client:
logger.info("Verifying Tensor Transfer")
client.put_tensor("plain-tensor", np.ones((1, 1, 3, 3)))
Expand Down Expand Up @@ -206,14 +206,6 @@ def _make_managed_local_orc(
exp.stop(orc)


def _find_free_port() -> int:
"""A 'good enough' way to find an open port to bind to"""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
sock.bind(("0.0.0.0", 0))
_, port = sock.getsockname()
return int(port)


def _test_tf_install(client: Client, tmp_dir: str, device: Device) -> None:
recv_conn, send_conn = mp.Pipe(duplex=False)
# Build the model in a subproc so that keras does not hog the gpu
Expand Down
Loading
Loading