-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Co-located Orchestrator #139
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The scaling tests used the old Experiment.summary() method to obtain job data. Since we decided to refactor that to remove the pandas requirement, get_job_data was introduced to obtain data from the Controller instance. We hide job data from the user intentionally, which is why the Controller is a private variable in the Experiment, but this method will be expanded upon later to provide more job data (with filters) once we have a reliable, persistent accross experiment run, database. Experiment._launch_summary() was also refactor to use the logger instead prints and handle the co-located Model case.
For the local launcher, the colocated db process was not being killed correctly. we now trap and cleanup the database process in the bash script so we can be sure that despite any exit code, the database will be terminated.
Add SbatchSettings.set_cpus_per_task which seems to have been mistakenly taken out
Adds 1 test for colocated models, and removed some tests that were unnessary. spotted we were overridding the repr methods in the base classes of some entities which we shouldn't so that was removed as well
This option turns off the db log file which results in a massive performance boost on shared(networked) file systems
launching colocated models with mpirunsettings is now supported with this commit
This commit introduces the interrupt strategy needed to handle jobs and tasks during a keyboard interrupt. follow ups will include the parameter to kill all tasks on interrupt. this commit is primarily geared towards informing users of colocated tasks that may still be running when an interrupt is triggered. A few changes were also made to the colocated launch script as it seems psutil does not support cpu pinning on MacOS
Add orchestrator methods
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area: orchestrator
Issues related to the Ochestrator API, launch, and runtime
area: settings
Issues related to Batch or Run settings
type: feature
Issues that include feature request or feature idea
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a long awaited feature: database colocation. With this feature, users will be able to launch their workload on HPC systems with a single Redis/KeyDB shard placed on each compute node their application is using. This is specifically geared towards tightly coupled, performant, online inference.
As the database is not clustered, locality must be taken into consideration when using this approach. Each MPI rank will be able to interact with the rank data local to the compute node they are running on. This approach is called locality based inference. Typically in workflows that use online inference in SmartSim, such as (https://arxiv.org/abs/2104.09355), each MPI rank is performing some inference with data local to that rank, hence locality based inference.
A new method
Model.colocate_db
can be used to add a co-located orchestrator to any instantiated model. The full interface looks likeThis feature is accomplished through launching with a SmartSim entrypoint which is a new concept introduced in this Pr. A SmartSim entrypoint is a python module with main function for starting or performing a specific task. The entrypoint used to start a colocated database is called with
python -m smartsim._core.entrypoints.colocated
. The typical Orchestrator and Ray starter scripts have also been migrated to entrypoints as well.In the
Step
creation, a wrapper script is created thatdb_cpus
cpus)db_cpus
cpus)A full example of creating a colocated database model is as follows:
This PR builds off the refactor work done in (#134). The changes there are:
SlurmOrchestrator()
, users will instantiateOrchestator(launcher='slurm')
or evenOrchestrator(launcher='auto')
.single_cmd
which is used to launch all shards with a single command, using the MPMD mechanism available for every run command (srun
,jsrun
,aprun
,mpirun
).__init__
function with the correctlauncher
arg.Experiment.create_database
can be used to create an Orchestrator, similar to what can be done to create an ensemble, a model, and so on.Orchestrator.set_hosts
function now only sets the hosts to each DBNode (or to each MPMD run settings) but not to the Orchestrator itself. The reason is that the host of each DBNode is what is used to launch the corresponding redis server, but the name or address needed by the Orchestrator is instead the one linked to the interface the redis server is bound to: if they differ, setting this with the name of the node can result in an error. We now always rely on parsing the output from theredisstarter.py
script.redisstarter.py
output for the MPMD case has now two possibilities: either look for N IP addresses in one output (where N is the number of shards) or look for N output files, each one containing one IP address. Most run commands rely on the first mechanism, LSF relies on the second one.len(Orchestrator)
anymore, as this would not work for MPMD instances. We assign the shard count to it.database_per_host
is now removed, and only 1 db per node is supported.TODO
gpus_per_shard
andcpus_per_shard
).-
mpirun
-
jsrun
-
aprun
Experiment._launch_summary
with regards to colocated settings