Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dragon Launcher Prototype #470

Merged
merged 56 commits into from
Apr 4, 2024
Merged

Conversation

al-rigazzi
Copy link
Collaborator

@al-rigazzi al-rigazzi commented Jan 30, 2024

This is the first prototype of the new Dragon-based launcher.

Issues to fix (or to defer to a future PR):

  • Several telemetry test errors
  • Missing batch settings for Dragon launcher
  • Use Process Groups for MPI-based applications - we have process groups but Dragon team is working on a more complete API

Copy link

codecov bot commented Jan 31, 2024

Codecov Report

Attention: Patch coverage is 56.24242% with 361 lines in your changes are missing coverage. Please review.

Project coverage is 85.21%. Comparing base (6f800b1) to head (730889f).

❗ Current head 730889f differs from pull request most recent head befb6b6. Consider uploading reports for the commit befb6b6 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##           dragon_launcher     #470      +/-   ##
===================================================
- Coverage            90.70%   85.21%   -5.50%     
===================================================
  Files                   65       75      +10     
  Lines                 4498     5303     +805     
===================================================
+ Hits                  4080     4519     +439     
- Misses                 418      784     +366     
Files Coverage Δ
smartsim/_core/launcher/__init__.py 100.00% <100.00%> (ø)
smartsim/_core/launcher/colocated.py 97.82% <100.00%> (+0.02%) ⬆️
smartsim/_core/launcher/step/__init__.py 100.00% <100.00%> (ø)
smartsim/_core/launcher/step/step.py 100.00% <ø> (ø)
smartsim/_core/schemas/__init__.py 100.00% <100.00%> (ø)
smartsim/_core/schemas/dragonResponses.py 100.00% <100.00%> (ø)
smartsim/_core/schemas/utils.py 100.00% <100.00%> (ø)
smartsim/_core/utils/helpers.py 91.96% <100.00%> (ø)
smartsim/_core/utils/redis.py 80.76% <100.00%> (+0.37%) ⬆️
smartsim/experiment.py 85.80% <100.00%> (ø)
... and 19 more

... and 1 file with indirect coverage changes

"""


def parse_salloc(output: str) -> t.Optional[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid re-creating what the slurm launcher already does?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has not reason to exist. Deleted as part of al-rigazzi#7.

current_env: t.Dict[str, t.Optional[str]] = {}

def __str__(self) -> str:
return str(DragonRunRequestView.parse_obj(self.dict(exclude={"current_env"})))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of work to get to a built-in!

Could we just use:

return self.model_dump_json(exclude={"current_env"})

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid model_dump_json is part of pydantic 2.X, but we are constrained to 1.X because of TF (iirc)

Copy link
Member

@MattToast MattToast Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, we could, in theory, do something similar with return self.json(exclude={'current_env'}) for a V1 compatible alternative?

# pylint: disable=multiple-statements


class DragonResponse(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider separating the schemas for the bootstrap process? That request isn't really related to dragon, it's a SmartSim control event about a resource being set up. I could see a similar event being raised to indicate that any other SmartSimEntity has been created...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a field about the pid which is strictly a dragon thing (allows us to shut it down), so I'm going to push back for a little, and we can think about this later if we believe we can generalize it, which I find an intriguing idea.

for interface in available_ifs:
if any(interface.startswith(if_prefix) for if_prefix in known_ifs):
return interface, get_ip_from_interface(interface)
return None, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we return an actual object instead of a tuple, or even a named tuple. There is no indication of which position in the resulting tuple is the address and which is the interface in this style.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I felt like it was half baked. Named tuple it is.

@@ -241,7 +241,8 @@ def shutdown_db_node(host_ip: str, port: int) -> t.Tuple[int, str, str]: # cov-

if returncode != 0:
logger.error(out)
logger.error(err)
if err:
logger.error(err)
Copy link
Contributor

@ankona ankona Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the RC is non-zero, we shouldn't care about the error message to log... that just hides that the error occurred.

i think you might just get rid of if err to make sure we don't hide a problem:

logger.error(f"something happened. rc: {returncode}, err: {err}")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done as part of the other PR.

if not single_cmd:
return single_cmd

if launcher == "dragon":
return False

if run_command == "srun" and getenv("SLURM_HET_SIZE") is not None:
Copy link
Contributor

@ankona ankona Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't look like it should live in an orchestrator. Perhaps something in a slurm launcher or a slurm commands module could be imported (or this moved there) so we know where slurm biz logic lives?

@@ -847,6 +855,8 @@ def _get_start_script_args(
]
if cluster:
cmd.append("+cluster") # is the shard part of a cluster
if self.launcher == "dragon":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see a LOT of if launcher == "dragon" repeats here... can we pull the things that deal w/dragon into a single location? Maybe a DragonStep would have this line, for example, and we'd do step.get_cmd()

As-is, we've completely coupled an orchestrator script to dragon and this way lies madness!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this very check will go away once Dragon process can redirect output to a file. The one above (single_cmd) is actually something that will go away once we implement MPMD call for Dragon (which should not be difficult at all, just need to understand how to chain calls or sth similar). May I ask you to postpone this correct criticism to the final review?

# If the info was not published, proceed with default parameters
logger.warning(
"Could not retrieve data for DataInfo object, the following "
"values will be kept."
)
logger.debug(f"Original error from Redis was {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the original error should be a debug level message. An exception should always be logged, shouldn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promoted to error.

"""
self.run_args["nodes"] = int(nodes)

def set_hostlist(self, host_list: t.Union[str, t.List[str]]) -> None:
Copy link
Contributor

@ankona ankona Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this host list method is copied in multiple places (e.g. AlpsSettings has the only difference being the key node-list instead of nodelist.

Can we move this up into the base class or pull it out so we don't repeat ourselves?

If we need a different key, consider instead adding to the base class:

@abc.abstractmethod
@property
def _nodelist_key(self) -> str:
  ...

so all implementations are only the difference and the line in set_hostlist can be:

self.run_args[self._nodelist_key] = ",".join(host_list)

  • alternative solution might be to implement the mapping from the standard self.run_args["nodelist"] key to the launcher-specific version in the get_cmd method. that way we don't have to worry about tiny differences in keys and we focus on the mapping of what we have

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need it like that in this class, as we don't pass things as strings. Can you open a ticket for the refactor for other RunSettings-derived classes?

:returns: the formatted string of environment variables
:rtype: list[str]
"""
return [f"{k}={v}" for k, v in self.env_vars.items()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description is not consistent with behavior. it doesn't "build a formatted string" - it builds a list of strings. If we turn around and do ",".join(obj.format_env_vars), why not do it here and not return a list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, but I guess your comment applies to SrunSettings.

@@ -171,7 +173,7 @@ def create_run_settings(

def _detect_command(launcher: str) -> str:
if launcher in by_launcher:
if launcher == "local":
if launcher in ["local", "dragon"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we repeated this change in a few places. is it time to get detect command into a shared place?

pytest.skip(
f"Test only runs on systems with PBS or Slurm as WLM. Current launcher: {launcher}"
f"Test only runs on systems with PBS, Dragon, or Slurm as WLM. Current launcher: {launcher}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about making things that are annoyingly easy to miss unmissable!

supported_launchers = ",".join([key.capitalize() for key in by_launcher.keys()])
f"Test only runs on ...{supported_launchers}. you supplied: {launcher}



class DragonRunRequestView(DragonRequest):
exe: NonEmptyStr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add our class level docstrings to explain usage? I'm not sure why this guy is a "view" and the others are not

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is half a hack to avoid printing the env everywhere, but I'm not sure it is usable anymore because of the serialization/registry activity. Let's keep this comment around.

MattToast and others added 10 commits March 25, 2024 15:34
Rename `SchemaSerializer` to `SchemaRegistry`. Move message formatting into a
dedicated `_Message` class. Rename methods. Move schema coercion to and from
`_Message` strings into a `SocketSchemaTranslator` class. Better error
handling.

  [ committed by @MattToast ]
  [ reviewed by @al-rigazzi @ankona ]
This PR adds a dynamic port assignation for DragonLauncher sockets.
The function `smartsim._core._cli.validate._find_free_port` was moved
and is now `smartsim._core.utils.network.find_free_port`.

[ committed by @ankona ]
[ reviewed by @al-rigazzi @MattToast ]
Remove the Inheritance style constraint strings from schemas. Use
``t.Annotated`` style ``pydantic`` constraints.

Warning:
``t.Annotated`` was introduced in Python 3.9! This change is incompatible with
Python 3.8, but seeing as dragon does not support any python versions less than
3.9 this should not be an issue. If needed, we can consider depending on
``typing_extensions`` to supply ``Annotated``.

[ committed by @MattToast ]
[ reviewed by @al-rigazzi @ankona ]
* copyright update
* handle possibly null objects
* formatting
* temp disable changelog verify on fork

[ committed by @ankona ]
[ reviewed by @MattToast ]
add key manager and supporting tests

[ committed by @ankona ]
[ reviewed by @MattToast @al-rigazzi ]
Adds new policies for MPI-like jobs running on multiple hosts. Also adds queues to DragonBackend and several enhancements to the frontend.

[ committed by @al-rigazzi ]
[ reviewed by @ankona @MattToast ]
[committed by @ankona ]
[reviewed by @al-rigazzi ]
Adds method for producing always-secured sockets in Dragon server and launcher.

[ committed by @ankona ]
[ reviewed by @al-rigazzi @MattToast ]
@al-rigazzi al-rigazzi merged commit 9de7044 into CrayLabs:dragon_launcher Apr 4, 2024
27 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants