Dragon Launcher Prototype #470

al-rigazzi · 2024-01-30T16:13:20Z

This is the first prototype of the new Dragon-based launcher.

Issues to fix (or to defer to a future PR):

Several telemetry test errors
Missing batch settings for Dragon launcher
Use Process Groups for MPI-based applications - we have process groups but Dragon team is working on a more complete API

…rotodrg

codecov · 2024-01-31T10:36:47Z

Codecov Report

Attention: Patch coverage is 56.24242% with 361 lines in your changes are missing coverage. Please review.

Project coverage is 85.21%. Comparing base (6f800b1) to head (730889f).

❗ Current head 730889f differs from pull request most recent head befb6b6. Consider uploading reports for the commit befb6b6 to get more accurate results

Additional details and impacted files

@@                 Coverage Diff                 @@
##           dragon_launcher     #470      +/-   ##
===================================================
- Coverage            90.70%   85.21%   -5.50%     
===================================================
  Files                   65       75      +10     
  Lines                 4498     5303     +805     
===================================================
+ Hits                  4080     4519     +439     
- Misses                 418      784     +366

Files	Coverage Δ
smartsim/_core/launcher/__init__.py	`100.00% <100.00%> (ø)`
smartsim/_core/launcher/colocated.py	`97.82% <100.00%> (+0.02%)`	⬆️
smartsim/_core/launcher/step/__init__.py	`100.00% <100.00%> (ø)`
smartsim/_core/launcher/step/step.py	`100.00% <ø> (ø)`
smartsim/_core/schemas/__init__.py	`100.00% <100.00%> (ø)`
smartsim/_core/schemas/dragonResponses.py	`100.00% <100.00%> (ø)`
smartsim/_core/schemas/utils.py	`100.00% <100.00%> (ø)`
smartsim/_core/utils/helpers.py	`91.96% <100.00%> (ø)`
smartsim/_core/utils/redis.py	`80.76% <100.00%> (+0.37%)`	⬆️
smartsim/experiment.py	`85.80% <100.00%> (ø)`
... and 19 more

... and 1 file with indirect coverage changes

ankona · 2024-03-19T22:20:10Z

smartsim/_core/launcher/dragon/dragonParser.py

+"""
+
+
+def parse_salloc(output: str) -> t.Optional[str]:


Can we avoid re-creating what the slurm launcher already does?

This file has not reason to exist. Deleted as part of al-rigazzi#7.

ankona · 2024-03-19T22:27:57Z

smartsim/_core/schemas/dragonRequests.py

+    current_env: t.Dict[str, t.Optional[str]] = {}
+
+    def __str__(self) -> str:
+        return str(DragonRunRequestView.parse_obj(self.dict(exclude={"current_env"})))


This seems like a lot of work to get to a built-in!

Could we just use:

return self.model_dump_json(exclude={"current_env"})

I'm afraid model_dump_json is part of pydantic 2.X, but we are constrained to 1.X because of TF (iirc)

iirc, we could, in theory, do something similar with return self.json(exclude={'current_env'}) for a V1 compatible alternative?

ankona · 2024-03-19T22:33:15Z

smartsim/_core/schemas/dragonResponses.py

+# pylint: disable=multiple-statements
+
+
+class DragonResponse(BaseModel):


Should we consider separating the schemas for the bootstrap process? That request isn't really related to dragon, it's a SmartSim control event about a resource being set up. I could see a similar event being raised to indicate that any other SmartSimEntity has been created...

I added a field about the pid which is strictly a dragon thing (allows us to shut it down), so I'm going to push back for a little, and we can think about this later if we believe we can generalize it, which I find an intriguing idea.

smartsim/_core/schemas/types.py

smartsim/_core/schemas/utils.py

smartsim/_core/utils/helpers.py

ankona · 2024-03-19T22:50:40Z

smartsim/_core/utils/network.py

+    for interface in available_ifs:
+        if any(interface.startswith(if_prefix) for if_prefix in known_ifs):
+            return interface, get_ip_from_interface(interface)
+    return None, None


Could we return an actual object instead of a tuple, or even a named tuple. There is no indication of which position in the resulting tuple is the address and which is the interface in this style.

Yeah, I felt like it was half baked. Named tuple it is.

ankona · 2024-03-19T22:53:02Z

smartsim/_core/utils/redis.py

@@ -241,7 +241,8 @@ def shutdown_db_node(host_ip: str, port: int) -> t.Tuple[int, str, str]:  # cov-

    if returncode != 0:
        logger.error(out)
-        logger.error(err)
+        if err:
+            logger.error(err)


if the RC is non-zero, we shouldn't care about the error message to log... that just hides that the error occurred.

i think you might just get rid of if err to make sure we don't hide a problem:

logger.error(f"something happened. rc: {returncode}, err: {err}")

Thanks, done as part of the other PR.

ankona · 2024-03-19T22:56:42Z

smartsim/database/orchestrator.py

    if not single_cmd:
        return single_cmd

+    if launcher == "dragon":
+        return False
+
    if run_command == "srun" and getenv("SLURM_HET_SIZE") is not None:


This test doesn't look like it should live in an orchestrator. Perhaps something in a slurm launcher or a slurm commands module could be imported (or this moved there) so we know where slurm biz logic lives?

ankona · 2024-03-19T22:59:30Z

smartsim/database/orchestrator.py

@@ -847,6 +855,8 @@ def _get_start_script_args(
        ]
        if cluster:
            cmd.append("+cluster")  # is the shard part of a cluster
+        if self.launcher == "dragon":


i see a LOT of if launcher == "dragon" repeats here... can we pull the things that deal w/dragon into a single location? Maybe a DragonStep would have this line, for example, and we'd do step.get_cmd()

As-is, we've completely coupled an orchestrator script to dragon and this way lies madness!

So, this very check will go away once Dragon process can redirect output to a file. The one above (single_cmd) is actually something that will go away once we implement MPMD call for Dragon (which should not be difficult at all, just need to understand how to chain calls or sth similar). May I ask you to postpone this correct criticism to the final review?

ankona · 2024-03-19T23:03:15Z

smartsim/ml/data.py

            # If the info was not published, proceed with default parameters
            logger.warning(
                "Could not retrieve data for DataInfo object, the following "
                "values will be kept."
            )
+            logger.debug(f"Original error from Redis was {e}")


I don't think the original error should be a debug level message. An exception should always be logged, shouldn't it?

Promoted to error.

smartsim/settings/dragonRunSettings.py

ankona · 2024-03-19T23:12:03Z

smartsim/settings/dragonRunSettings.py

+        """
+        self.run_args["nodes"] = int(nodes)
+
+    def set_hostlist(self, host_list: t.Union[str, t.List[str]]) -> None:


It looks like this host list method is copied in multiple places (e.g. AlpsSettings has the only difference being the key node-list instead of nodelist.

Can we move this up into the base class or pull it out so we don't repeat ourselves?

If we need a different key, consider instead adding to the base class:

@abc.abstractmethod @property def _nodelist_key(self) -> str: ...

so all implementations are only the difference and the line in set_hostlist can be:

self.run_args[self._nodelist_key] = ",".join(host_list)

alternative solution might be to implement the mapping from the standard self.run_args["nodelist"] key to the launcher-specific version in the get_cmd method. that way we don't have to worry about tiny differences in keys and we focus on the mapping of what we have

I think we don't need it like that in this class, as we don't pass things as strings. Can you open a ticket for the refactor for other RunSettings-derived classes?

ankona · 2024-03-19T23:14:35Z

smartsim/settings/dragonRunSettings.py

+        :returns: the formatted string of environment variables
+        :rtype: list[str]
+        """
+        return [f"{k}={v}" for k, v in self.env_vars.items()]


description is not consistent with behavior. it doesn't "build a formatted string" - it builds a list of strings. If we turn around and do ",".join(obj.format_env_vars), why not do it here and not return a list?

Removed, but I guess your comment applies to SrunSettings.

ankona · 2024-03-19T23:15:18Z

smartsim/settings/settings.py

@@ -171,7 +173,7 @@ def create_run_settings(

    def _detect_command(launcher: str) -> str:
        if launcher in by_launcher:
-            if launcher == "local":
+            if launcher in ["local", "dragon"]:


we repeated this change in a few places. is it time to get detect command into a shared place?

ankona · 2024-03-19T23:19:08Z

tests/on_wlm/test_containers_wlm.py

        pytest.skip(
-            f"Test only runs on systems with PBS or Slurm as WLM. Current launcher: {launcher}"
+            f"Test only runs on systems with PBS, Dragon, or Slurm as WLM. Current launcher: {launcher}"


how about making things that are annoyingly easy to miss unmissable!

supported_launchers = ",".join([key.capitalize() for key in by_launcher.keys()]) f"Test only runs on ...{supported_launchers}. you supplied: {launcher}

ankona · 2024-03-20T20:57:40Z

smartsim/_core/schemas/dragonRequests.py

+
+
+class DragonRunRequestView(DragonRequest):
+    exe: NonEmptyStr


Could we add our class level docstrings to explain usage? I'm not sure why this guy is a "view" and the others are not

It is half a hack to avoid printing the env everywhere, but I'm not sure it is usable anymore because of the serialization/registry activity. Let's keep this comment around.

@MattToast

Rename `SchemaSerializer` to `SchemaRegistry`. Move message formatting into a dedicated `_Message` class. Rename methods. Move schema coercion to and from `_Message` strings into a `SocketSchemaTranslator` class. Better error handling. [ committed by @MattToast ] [ reviewed by @al-rigazzi @ankona ]

@ankona

This PR adds a dynamic port assignation for DragonLauncher sockets. The function `smartsim._core._cli.validate._find_free_port` was moved and is now `smartsim._core.utils.network.find_free_port`. [ committed by @ankona ] [ reviewed by @al-rigazzi @MattToast ]

@MattToast

Remove the Inheritance style constraint strings from schemas. Use ``t.Annotated`` style ``pydantic`` constraints. Warning: ``t.Annotated`` was introduced in Python 3.9! This change is incompatible with Python 3.8, but seeing as dragon does not support any python versions less than 3.9 this should not be an issue. If needed, we can consider depending on ``typing_extensions`` to supply ``Annotated``. [ committed by @MattToast ] [ reviewed by @al-rigazzi @ankona ]

@ankona

* copyright update * handle possibly null objects * formatting * temp disable changelog verify on fork [ committed by @ankona ] [ reviewed by @MattToast ]

@ankona

add key manager and supporting tests [ committed by @ankona ] [ reviewed by @MattToast @al-rigazzi ]

@al-rigazzi

Adds new policies for MPI-like jobs running on multiple hosts. Also adds queues to DragonBackend and several enhancements to the frontend. [ committed by @al-rigazzi ] [ reviewed by @ankona @MattToast ]

@ankona

[committed by @ankona ] [reviewed by @al-rigazzi ]

@ankona

Adds method for producing always-secured sockets in Dragon server and launcher. [ committed by @ankona ] [ reviewed by @al-rigazzi @MattToast ]

al-rigazzi added 25 commits January 24, 2024 09:43

First working version

45a074f

Add stop to dragon, reshuffle files

10780f5

Add reconnect functionality

ad61696

Add handshake

d87e826

Add timeout to handshake

d7bc190

Add schemas for bootstrap

d756f8e

Rename bootstrap req/rep

941ae02

Fix colocated and entry points

81a2219

Merge branch 'develop' of https://github.com/CrayLabs/SmartSim into p…

5a4b91f

…rotodrg

Fix typehints

41c1f24

Fixes to tests

a7b211a

Make DragonStep proxiable

a0ba5ca

Patch some telemetry tests for Dragon

e32c1f3

Add Dragon to telemetry-supported launchers

7a8adfd

Make style

8c3bd20

Trimming whitespace

5dce0b3

Downgrade pydantic for TF compat issues

10bc01c

Fix typehint for pydantic 1.x

e16b845

Merge branch 'develop' of https://github.com/CrayLabs/SmartSim into p…

02aaf21

…rotodrg

Fix typehint

bf2e2c4

Ignore Dragon type

17007ab

Strong typing issues

ed0c59c

Lint

045945f

Patch circular import

25680e7

Reformat entrypoints

59aaeac

al-rigazzi added 4 commits January 31, 2024 05:01

Fix wrong type for DBModel func

baaff54

Typecast

4426eae

Guess who's black

153a19d

Fix pydantic

7f436f4