Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop ray support and delete references in docs #263

Merged
merged 2 commits into from
Mar 1, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/run_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ jobs:
- name: Install SmartSim (with ML backends)
run: |
python -m pip install git+https://github.com/CrayLabs/SmartRedis.git@develop#egg=smartredis
python -m pip install .[dev,ml,ray]
python -m pip install .[dev,ml]

- name: Install ML Runtimes with Smart
if: contains( matrix.os, 'macos' )
Expand Down
50 changes: 0 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,6 @@ exchanged between applications at runtime without the utilization of MPI.
- [Local Launch](#local-launch)
- [Interactive Launch](#interactive-launch)
- [Batch Launch](#batch-launch)
- [Ray](#ray)
- [Ray on HPC](#ray-on-hpc)
- [SmartRedis](#smartredis)
- [Tensors](#tensors)
- [Datasets](#datasets)
Expand Down Expand Up @@ -284,7 +282,6 @@ initialization. Local launching does not support batch workloads.

# Infrastructure Library Applications
- Orchestrator - In-memory data store and Machine Learning Inference (Redis + RedisAI)
- Ray - Distributed Reinforcement Learning (RL), Hyperparameter Optimization (HPO)

## Redis + RedisAI

Expand Down Expand Up @@ -398,53 +395,6 @@ exp.stop(db_cluster)
python run_db_batch.py
```

-----
## Ray

Ray is a distributed computation framework that supports a number of applications
- RLlib - Distributed Reinforcement Learning (RL)
- RaySGD - Distributed Training
- Ray Tune - Hyperparameter Optimization (HPO)
- Ray Serve - ML/DL inference
As well as other integrations with frameworks like Modin, Mars, Dask, and Spark.

Historically, Ray has not been well supported on HPC systems. A few examples exist,
but none are well maintained. Because SmartSim already has launchers for HPC systems,
launching Ray through SmartSim is a relatively simple task.

### Ray on HPC

Below is an example of how to launch a Ray cluster on an HPC system and connect to it.
In this example, we set `batch=True`, which means that the cluster will be started
requesting an allocation through the scheduler (Slurm, PBS, etc). If this code
is run within a sufficiently large interactive allocation, setting `batch=False`
will spin the Ray cluster on the allocated nodes.

```Python
import ray

from smartsim import Experiment
from smartsim.exp.ray import RayCluster

exp = Experiment("ray-cluster", launcher='auto')
# 3 workers + 1 head node = 4 node-cluster
cluster = RayCluster(name="ray-cluster", run_args={},
ray_args={"num-cpus": 24},
launcher='auto', num_nodes=4, batch=True)

exp.generate(cluster, overwrite=True)
exp.start(cluster, block=False, summary=True)

# Connect to the Ray cluster
ctx = ray.init(f"ray://{cluster.get_head_address()}:10001")

# <run Ray tune, RLlib, HPO...>
```

*New in 0.4.0* the auto argument enables the Ray Cluster to be launched
across scheduler types. Both batch launch and interactive launch commands
will be automatically detected and used by SmartSim.

------
# SmartRedis

Expand Down
17 changes: 0 additions & 17 deletions doc/api/smartsim_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -538,20 +538,3 @@ Slurm
.. automodule:: smartsim.slurm
:members:


Ray
===

.. currentmodule:: smartsim.exp.ray

.. _ray_api:

``RayCluster`` is used to launch a Ray cluster
and can be launched as a batch or in an interactive allocation.

.. autoclass:: RayCluster
:show-inheritance:
:members:
:inherited-members:
:undoc-members:
:exclude-members: batch set_path type
4 changes: 4 additions & 0 deletions doc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Description
- Fix bug in colocated database entrypoint when loading PyTorch models
- Add support for RedisAI 1.2.7, pyTorch 1.11.0, Tensorflow 2.8.0, ONNXRuntime 1.11.1
- Allow for models to be launched independently as batch jobs
- Drop support for Ray

Detailed Notes

Expand All @@ -38,6 +39,9 @@ Detailed Notes
satisfied, the `Experiment` will attempt to wrap the underlying run command in a batch job using
the object referenced at `Model.batch_settings` as the batch settings for the job. If the check
is not satisfied, the `Model` is launched in the traditional manner as a job step. (PR245_)
- The support for Ray was dropped, as its most recent versions caused problems when deployed through SmartSim.
We plan to release a separate add-on library to accomplish the same results. If
you are interested in getting the Ray launch functionality back in your workflow, please get in touch with us!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following the pattern of the changelog.rst, should we link this PR here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally, I did not have the link until it was published :)


.. _PR255: https://github.com/CrayLabs/SmartSim/pull/258
.. _PR245: https://github.com/CrayLabs/SmartSim/pull/245
Expand Down
2 changes: 0 additions & 2 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,6 @@
tutorials/online_analysis/lattice/online_analysis
tutorials/ml_inference/Inference-in-SmartSim
tutorials/ml_training/surrogate/train_surrogate
tutorials/ray/starting_ray


.. toctree::
:maxdepth: 2
Expand Down
3 changes: 0 additions & 3 deletions doc/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,11 +124,8 @@ can request their installation through the ``ml`` flag as follows:
.. code-block:: bash

pip install smartsim[ml]
# add ray extra if you would like to use ray with SmartSim as well
pip install smartsim[ml,ray]
# or if using ZSH
pip install smartsim\[ml\]
pip install smartsim\[ml,ray\]


At this point, SmartSim is installed and can be used for more basic features.
Expand Down
1 change: 0 additions & 1 deletion docker/prod/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -52,5 +52,4 @@ RUN python -m pip install smartsim[ml]==0.4.1 jupyter jupyterlab matplotlib && \
rm -rf ~/.cache/pip

# remove non-jupyter notebook tutorials
RUN rm -rf /home/craylabs/tutorials/ray
CMD ["/bin/bash", "-c", "PATH=/home/craylabs/.local/bin:$PATH /home/craylabs/.local/bin/jupyter lab --port 8888 --no-browser --ip=0.0.0.0"]
3 changes: 0 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,9 +184,6 @@ def has_ext_modules(_placeholder):
],
# see smartsim/_core/_install/buildenv.py for more details
"ml": versions.ml_extras_required(),
"ray": [
"ray==1.6",
],
}


Expand Down
4 changes: 1 addition & 3 deletions smartsim/_core/control/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -290,15 +290,13 @@ def _launch(self, manifest):
raise SmartSimError(msg)
self._launch_orchestrator(orchestrator)

for rc in manifest.ray_clusters: # cov-wlm
rc._update_workers()

if self.orchestrator_active:
self._set_dbobjects(manifest)

# create all steps prior to launch
steps = []
all_entity_lists = manifest.ensembles + manifest.ray_clusters
all_entity_lists = manifest.ensembles
for elist in all_entity_lists:
if elist.batch:
batch_step = self._create_batch_job_step(elist)
Expand Down
29 changes: 11 additions & 18 deletions smartsim/_core/control/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,12 @@
from ...database import Orchestrator
from ...entity import EntityList, SmartSimEntity
from ...error import SmartSimError
from ...exp.ray import RayCluster
from ..utils.helpers import fmt_dict

# List of types derived from EntityList which require specific behavior
# A corresponding property needs to exist (like db for Orchestrator),
# otherwise they will not be accessible
entity_list_exception_types = [Orchestrator, RayCluster]
entity_list_exception_types = [Orchestrator]


class Manifest:
Expand All @@ -51,6 +50,7 @@ def __init__(self, *args):
self._check_names(self._deployables)
self._check_entity_lists_nonempty()


@property
def db(self):
"""Return Orchestrator instances in Manifest
Expand All @@ -69,6 +69,7 @@ def db(self):
_db = deployable
return _db


@property
def models(self):
"""Return Model instances in Manifest
Expand All @@ -82,6 +83,7 @@ def models(self):
_models.append(deployable)
return _models


@property
def ensembles(self):
"""Return Ensemble instances in Manifest
Expand All @@ -101,34 +103,23 @@ def ensembles(self):

return _ensembles

@property
def ray_clusters(self):
"""Return all RayCluster instances in Manifest

:return: list of RayCluster instances
:rtype: List[RayCluster]
"""
_ray_cluster = []
for deployable in self._deployables:
if isinstance(deployable, RayCluster):
_ray_cluster.append(deployable)
return _ray_cluster

@property
def all_entity_lists(self):
"""All entity lists, including ensembles and
exceptional ones like Orchestrator and RayCluster
exceptional ones like Orchestrator

:return: list of entity lists
:rtype: List[EntityList]
"""
_all_entity_lists = self.ray_clusters + self.ensembles
_all_entity_lists = self.ensembles
db = self.db
if db is not None:
_all_entity_lists.append(db)

return _all_entity_lists


def _check_names(self, deployables):
used = []
for deployable in deployables:
Expand All @@ -139,6 +130,7 @@ def _check_names(self, deployables):
raise SmartSimError("User provided two entities with the same name")
used.append(name)


def _check_types(self, deployables):
for deployable in deployables:
if not (
Expand All @@ -149,13 +141,15 @@ def _check_types(self, deployables):
f"Entity has type {type(deployable)}, not SmartSimEntity or EntityList"
)


def _check_entity_lists_nonempty(self):
"""Check deployables for sanity before launching"""

for entity_list in self.all_entity_lists:
if len(entity_list) < 1:
raise ValueError(f"{entity_list.name} is empty. Nothing to launch.")


def __str__(self):
s = ""
e_header = "=== Ensembles ===\n"
Expand All @@ -164,8 +158,7 @@ def __str__(self):
if self.ensembles:
s += e_header

# include ray clusters as an ensemble while still in experimental API
all_ensembles = self.ensembles + self.ray_clusters
all_ensembles = self.ensembles
for ensemble in all_ensembles:
s += f"{ensemble.name}\n"
s += f"Members: {len(ensemble)}\n"
Expand Down
Loading