-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dragon Launcher Prototype #470
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dragon_launcher #470 +/- ##
===================================================
- Coverage 90.70% 85.21% -5.50%
===================================================
Files 65 75 +10
Lines 4498 5303 +805
===================================================
+ Hits 4080 4519 +439
- Misses 418 784 +366
|
""" | ||
|
||
|
||
def parse_salloc(output: str) -> t.Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid re-creating what the slurm launcher already does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has not reason to exist. Deleted as part of al-rigazzi#7.
current_env: t.Dict[str, t.Optional[str]] = {} | ||
|
||
def __str__(self) -> str: | ||
return str(DragonRunRequestView.parse_obj(self.dict(exclude={"current_env"}))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a lot of work to get to a built-in!
Could we just use:
return self.model_dump_json(exclude={"current_env"})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid model_dump_json
is part of pydantic 2.X, but we are constrained to 1.X because of TF (iirc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, we could, in theory, do something similar with return self.json(exclude={'current_env'})
for a V1 compatible alternative?
# pylint: disable=multiple-statements | ||
|
||
|
||
class DragonResponse(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider separating the schemas for the bootstrap process? That request isn't really related to dragon, it's a SmartSim control event about a resource being set up. I could see a similar event being raised to indicate that any other SmartSimEntity
has been created...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a field about the pid
which is strictly a dragon thing (allows us to shut it down), so I'm going to push back for a little, and we can think about this later if we believe we can generalize it, which I find an intriguing idea.
smartsim/_core/utils/network.py
Outdated
for interface in available_ifs: | ||
if any(interface.startswith(if_prefix) for if_prefix in known_ifs): | ||
return interface, get_ip_from_interface(interface) | ||
return None, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we return an actual object instead of a tuple, or even a named tuple. There is no indication of which position in the resulting tuple is the address and which is the interface in this style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I felt like it was half baked. Named tuple it is.
smartsim/_core/utils/redis.py
Outdated
@@ -241,7 +241,8 @@ def shutdown_db_node(host_ip: str, port: int) -> t.Tuple[int, str, str]: # cov- | |||
|
|||
if returncode != 0: | |||
logger.error(out) | |||
logger.error(err) | |||
if err: | |||
logger.error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the RC is non-zero, we shouldn't care about the error message to log... that just hides that the error occurred.
i think you might just get rid of if err
to make sure we don't hide a problem:
logger.error(f"something happened. rc: {returncode}, err: {err}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done as part of the other PR.
if not single_cmd: | ||
return single_cmd | ||
|
||
if launcher == "dragon": | ||
return False | ||
|
||
if run_command == "srun" and getenv("SLURM_HET_SIZE") is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't look like it should live in an orchestrator. Perhaps something in a slurm launcher or a slurm commands module could be imported (or this moved there) so we know where slurm biz logic lives?
@@ -847,6 +855,8 @@ def _get_start_script_args( | |||
] | |||
if cluster: | |||
cmd.append("+cluster") # is the shard part of a cluster | |||
if self.launcher == "dragon": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see a LOT of if launcher == "dragon"
repeats here... can we pull the things that deal w/dragon into a single location? Maybe a DragonStep
would have this line, for example, and we'd do step.get_cmd()
As-is, we've completely coupled an orchestrator script to dragon and this way lies madness!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this very check will go away once Dragon process can redirect output to a file. The one above (single_cmd
) is actually something that will go away once we implement MPMD call for Dragon (which should not be difficult at all, just need to understand how to chain calls or sth similar). May I ask you to postpone this correct criticism to the final review?
smartsim/ml/data.py
Outdated
# If the info was not published, proceed with default parameters | ||
logger.warning( | ||
"Could not retrieve data for DataInfo object, the following " | ||
"values will be kept." | ||
) | ||
logger.debug(f"Original error from Redis was {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the original error should be a debug level message. An exception should always be logged, shouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promoted to error
.
""" | ||
self.run_args["nodes"] = int(nodes) | ||
|
||
def set_hostlist(self, host_list: t.Union[str, t.List[str]]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this host list method is copied in multiple places (e.g. AlpsSettings has the only difference being the key node-list
instead of nodelist
.
Can we move this up into the base class or pull it out so we don't repeat ourselves?
If we need a different key, consider instead adding to the base class:
@abc.abstractmethod
@property
def _nodelist_key(self) -> str:
...
so all implementations are only the difference and the line in set_hostlist
can be:
self.run_args[self._nodelist_key] = ",".join(host_list)
- alternative solution might be to implement the mapping from the standard
self.run_args["nodelist"]
key to the launcher-specific version in theget_cmd
method. that way we don't have to worry about tiny differences in keys and we focus on the mapping of what we have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need it like that in this class, as we don't pass things as strings. Can you open a ticket for the refactor for other RunSettings
-derived classes?
:returns: the formatted string of environment variables | ||
:rtype: list[str] | ||
""" | ||
return [f"{k}={v}" for k, v in self.env_vars.items()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description is not consistent with behavior. it doesn't "build a formatted string" - it builds a list of strings. If we turn around and do ",".join(obj.format_env_vars)
, why not do it here and not return a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, but I guess your comment applies to SrunSettings
.
@@ -171,7 +173,7 @@ def create_run_settings( | |||
|
|||
def _detect_command(launcher: str) -> str: | |||
if launcher in by_launcher: | |||
if launcher == "local": | |||
if launcher in ["local", "dragon"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we repeated this change in a few places. is it time to get detect command into a shared place?
pytest.skip( | ||
f"Test only runs on systems with PBS or Slurm as WLM. Current launcher: {launcher}" | ||
f"Test only runs on systems with PBS, Dragon, or Slurm as WLM. Current launcher: {launcher}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about making things that are annoyingly easy to miss unmissable!
supported_launchers = ",".join([key.capitalize() for key in by_launcher.keys()])
f"Test only runs on ...{supported_launchers}. you supplied: {launcher}
|
||
|
||
class DragonRunRequestView(DragonRequest): | ||
exe: NonEmptyStr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add our class level docstrings to explain usage? I'm not sure why this guy is a "view" and the others are not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is half a hack to avoid printing the env everywhere, but I'm not sure it is usable anymore because of the serialization/registry activity. Let's keep this comment around.
Rename `SchemaSerializer` to `SchemaRegistry`. Move message formatting into a dedicated `_Message` class. Rename methods. Move schema coercion to and from `_Message` strings into a `SocketSchemaTranslator` class. Better error handling. [ committed by @MattToast ] [ reviewed by @al-rigazzi @ankona ]
This PR adds a dynamic port assignation for DragonLauncher sockets. The function `smartsim._core._cli.validate._find_free_port` was moved and is now `smartsim._core.utils.network.find_free_port`. [ committed by @ankona ] [ reviewed by @al-rigazzi @MattToast ]
Remove the Inheritance style constraint strings from schemas. Use ``t.Annotated`` style ``pydantic`` constraints. Warning: ``t.Annotated`` was introduced in Python 3.9! This change is incompatible with Python 3.8, but seeing as dragon does not support any python versions less than 3.9 this should not be an issue. If needed, we can consider depending on ``typing_extensions`` to supply ``Annotated``. [ committed by @MattToast ] [ reviewed by @al-rigazzi @ankona ]
* copyright update * handle possibly null objects * formatting * temp disable changelog verify on fork [ committed by @ankona ] [ reviewed by @MattToast ]
add key manager and supporting tests [ committed by @ankona ] [ reviewed by @MattToast @al-rigazzi ]
Adds new policies for MPI-like jobs running on multiple hosts. Also adds queues to DragonBackend and several enhancements to the frontend. [ committed by @al-rigazzi ] [ reviewed by @ankona @MattToast ]
[committed by @ankona ] [reviewed by @al-rigazzi ]
Adds method for producing always-secured sockets in Dragon server and launcher. [ committed by @ankona ] [ reviewed by @al-rigazzi @MattToast ]
This is the first prototype of the new Dragon-based launcher.
Issues to fix (or to defer to a future PR):