Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application executes before colocated Orchestrator is created #522

Merged
merged 21 commits into from
Mar 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ To be released at some future point in time

Description

- Colo Orchestrator setup now blocks application start until setup finished.
- ExecArgs handling correction
- ReadTheDocs config file added and enabled on PRs
- Enforce changelog updates
Expand All @@ -30,6 +31,11 @@ Description

Detailed Notes

- The request to the colocated entrypoints file within the shell script
is now a blocking process. Once the Orchestrator is setup, it returns
which moves the process to the background and allows the application to
start. This prevents the application from requesting a ML model or
script that has not been uploaded to the Orchestrator yet. (SmartSim-PR522_)
- Add checks and tests to ensure SmartSim users cannot initialize run settings
with a list of lists as the exe_args argument. (SmartSim-PR517_)
- Add readthedocs configuration file and enable readthedocs builds
Expand All @@ -55,6 +61,7 @@ Detailed Notes
Slurm and Open MPI. (SmartSim-PR520_)


.. _SmartSim-PR522: https://github.com/CrayLabs/SmartSim/pull/522
.. _SmartSim-PR524: https://github.com/CrayLabs/SmartSim/pull/524
.. _SmartSim-PR520: https://github.com/CrayLabs/SmartSim/pull/520
.. _SmartSim-PR518: https://github.com/CrayLabs/SmartSim/pull/518
Expand Down
19 changes: 13 additions & 6 deletions smartsim/_core/entrypoints/colocated.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
import tempfile
import typing as t
from pathlib import Path
from subprocess import PIPE, STDOUT
from subprocess import STDOUT
from types import FrameType

import filelock
Expand Down Expand Up @@ -177,6 +177,7 @@ def main(
db_scripts: t.List[t.List[str]],
db_identifier: str,
) -> None:
# pylint: disable=too-many-statements
global DBPID # pylint: disable=global-statement

lo_address = current_ip("lo")
Expand All @@ -201,8 +202,17 @@ def main(
# we generally want to catch all exceptions here as
# if this process dies, the application will most likely fail
try:
process = psutil.Popen(cmd, stdout=PIPE, stderr=STDOUT)
DBPID = process.pid
hostname = socket.gethostname()
filename = (
f"colo_orc_{hostname}.log"
if os.getenv("SMARTSIM_LOG_LEVEL") == "debug"
else os.devnull
)
with open(filename, "w", encoding="utf-8") as file:
process = psutil.Popen(cmd, stdout=file.fileno(), stderr=STDOUT)
DBPID = process.pid
# printing to stdout shell file for extraction
print(f"__PID__{DBPID}__PID__", flush=True)
amandarichardsonn marked this conversation as resolved.
Show resolved Hide resolved

except Exception as e:
cleanup()
Expand Down Expand Up @@ -249,9 +259,6 @@ def launch_db_scripts(client: Client, db_scripts: t.List[t.List[str]]) -> None:
# Make sure we don't keep this around
del client

for line in iter(process.stdout.readline, b""):
print(line.decode("utf-8").rstrip(), flush=True)

except Exception as e:
cleanup()
logger.error(f"Colocated database process failed: {str(e)}")
Expand Down
15 changes: 9 additions & 6 deletions smartsim/_core/launcher/colocated.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,14 @@
# STDOUT of the job
if colocated_settings["debug"]:
script_file.write("export SMARTSIM_LOG_LEVEL=debug\n")

script_file.write(f"{colocated_cmd}\n")
script_file.write("DBPID=$!\n\n")
script_file.write(f"db_stdout=$({colocated_cmd})\n")

Check warning on line 70 in smartsim/_core/launcher/colocated.py

View check run for this annotation

Codecov / codecov/patch

smartsim/_core/launcher/colocated.py#L70

Added line #L70 was not covered by tests
# extract and set DBPID within the shell script that is
# enclosed between __PID__ and sent to stdout by the colocated
# entrypoints file
script_file.write(

Check warning on line 74 in smartsim/_core/launcher/colocated.py

View check run for this annotation

Codecov / codecov/patch

smartsim/_core/launcher/colocated.py#L74

Added line #L74 was not covered by tests
"DBPID=$(echo $db_stdout | sed -n "
"'s/.*__PID__\\([0-9]*\\)__PID__.*/\\1/p')\n"
)

# Write the actual launch command for the app
script_file.write("$@\n\n")
Expand Down Expand Up @@ -190,10 +195,8 @@
db_script_cmd = _build_db_script_cmd(db_scripts)
db_cmd.extend(db_script_cmd)

# run colocated db in the background
db_cmd.append("&")

cmd.extend(db_cmd)

return " ".join(cmd)


Expand Down
Loading