CrayLabs · al-rigazzi · Mar 1, 2023 · Feb 25, 2023 · Feb 28, 2023 · MattToast
diff --git a/.github/workflows/run_tests.yml b/.github/workflows/run_tests.yml
@@ -105,7 +105,7 @@ jobs:
       - name: Install SmartSim (with ML backends)
         run: |
           python -m pip install git+https://github.com/CrayLabs/SmartRedis.git@develop#egg=smartredis
-          python -m pip install .[dev,ml,ray]
+          python -m pip install .[dev,ml]
 
       - name: Install ML Runtimes with Smart
         if: contains( matrix.os, 'macos' )

diff --git a/README.md b/README.md
@@ -69,8 +69,6 @@ exchanged between applications at runtime without the utilization of MPI.
     - [Local Launch](#local-launch)
     - [Interactive Launch](#interactive-launch)
     - [Batch Launch](#batch-launch)
-  - [Ray](#ray)
-    - [Ray on HPC](#ray-on-hpc)
 - [SmartRedis](#smartredis)
   - [Tensors](#tensors)
   - [Datasets](#datasets)
@@ -284,7 +282,6 @@ initialization. Local launching does not support batch workloads.
 
 # Infrastructure Library Applications
  - Orchestrator - In-memory data store and Machine Learning Inference (Redis + RedisAI)
- - Ray - Distributed Reinforcement Learning (RL), Hyperparameter Optimization (HPO)
 
 ## Redis + RedisAI
 
@@ -398,53 +395,6 @@ exp.stop(db_cluster)
 python run_db_batch.py
 ```
 
------
-## Ray
-
-Ray is a distributed computation framework that supports a number of applications
- - RLlib - Distributed Reinforcement Learning (RL)
- - RaySGD - Distributed Training
- - Ray Tune - Hyperparameter Optimization (HPO)
- - Ray Serve - ML/DL inference
-As well as other integrations with frameworks like Modin, Mars, Dask, and Spark.
-
-Historically, Ray has not been well supported on HPC systems. A few examples exist,
-but none are well maintained. Because SmartSim already has launchers for HPC systems,
-launching Ray through SmartSim is a relatively simple task.
-
-### Ray on HPC
-
-Below is an example of how to launch a Ray cluster on an HPC system and connect to it.
-In this example, we set `batch=True`, which means that the cluster will be started
-requesting an allocation through the scheduler (Slurm, PBS, etc). If this code
-is run within a sufficiently large interactive allocation, setting `batch=False`
-will spin the Ray cluster on the allocated nodes.
-
-```Python
-import ray
-
-from smartsim import Experiment
-from smartsim.exp.ray import RayCluster
-
-exp = Experiment("ray-cluster", launcher='auto')
-# 3 workers + 1 head node = 4 node-cluster
-cluster = RayCluster(name="ray-cluster", run_args={},
-                     ray_args={"num-cpus": 24},
-                     launcher='auto', num_nodes=4, batch=True)
-
-exp.generate(cluster, overwrite=True)
-exp.start(cluster, block=False, summary=True)
-
-# Connect to the Ray cluster
-ctx = ray.init(f"ray://{cluster.get_head_address()}:10001")
-
-# <run Ray tune, RLlib, HPO...>
-```
-
-*New in 0.4.0* the auto argument enables the Ray Cluster to be launched
-across scheduler types. Both batch launch and interactive launch commands
-will be automatically detected and used by SmartSim.
-
 ------
 # SmartRedis
 

diff --git a/doc/api/smartsim_api.rst b/doc/api/smartsim_api.rst
@@ -538,20 +538,3 @@ Slurm
 .. automodule:: smartsim.slurm
     :members:
 
-
-Ray
-===
-
-.. currentmodule:: smartsim.exp.ray
-
-.. _ray_api:
-
-``RayCluster`` is used to launch a Ray cluster
- and can be launched as a batch or in an interactive allocation.
-
-.. autoclass:: RayCluster
-    :show-inheritance:
-    :members:
-    :inherited-members:
-    :undoc-members:
-    :exclude-members: batch set_path type
diff --git a/doc/changelog.rst b/doc/changelog.rst
@@ -26,6 +26,7 @@ Description
 - Fix bug in colocated database entrypoint when loading PyTorch models
 - Add support for RedisAI 1.2.7, pyTorch 1.11.0, Tensorflow 2.8.0, ONNXRuntime 1.11.1
 - Allow for models to be launched independently as batch jobs
+- Drop support for Ray
 
 Detailed Notes
 
@@ -38,6 +39,9 @@ Detailed Notes
   satisfied, the `Experiment` will attempt to wrap the underlying run command in a batch job using
   the object referenced at `Model.batch_settings` as the batch settings for the job. If the check
   is not satisfied, the `Model` is launched in the traditional manner as a job step. (PR245_)
+- The support for Ray was dropped, as its most recent versions caused problems when deployed through SmartSim.
+  We plan to release a separate add-on library to accomplish the same results. If
+  you are interested in getting the Ray launch functionality back in your workflow, please get in touch with us!
 
 .. _PR255: https://github.com/CrayLabs/SmartSim/pull/258
 .. _PR245: https://github.com/CrayLabs/SmartSim/pull/245

diff --git a/doc/index.rst b/doc/index.rst
@@ -23,8 +23,6 @@
    tutorials/online_analysis/lattice/online_analysis
    tutorials/ml_inference/Inference-in-SmartSim
    tutorials/ml_training/surrogate/train_surrogate
-   tutorials/ray/starting_ray
-
 
 .. toctree::
    :maxdepth: 2

diff --git a/doc/installation.rst b/doc/installation.rst
@@ -124,11 +124,8 @@ can request their installation through the ``ml`` flag as follows:
 .. code-block:: bash
 
     pip install smartsim[ml]
-    # add ray extra if you would like to use ray with SmartSim as well
-    pip install smartsim[ml,ray]
     # or if using ZSH
     pip install smartsim\[ml\]
-    pip install smartsim\[ml,ray\]
 
 
 At this point, SmartSim is installed and can be used for more basic features.

diff --git a/docker/prod/Dockerfile b/docker/prod/Dockerfile
@@ -52,5 +52,4 @@ RUN python -m pip install smartsim[ml]==0.4.1 jupyter jupyterlab matplotlib && \
     rm -rf ~/.cache/pip
 
 # remove non-jupyter notebook tutorials
-RUN rm -rf /home/craylabs/tutorials/ray
 CMD ["/bin/bash", "-c", "PATH=/home/craylabs/.local/bin:$PATH /home/craylabs/.local/bin/jupyter lab --port 8888 --no-browser --ip=0.0.0.0"]
diff --git a/setup.py b/setup.py
@@ -184,9 +184,6 @@ def has_ext_modules(_placeholder):
     ],
     # see smartsim/_core/_install/buildenv.py for more details
     "ml": versions.ml_extras_required(),
-    "ray": [
-        "ray==1.6",
-    ],
 }
 
 

diff --git a/smartsim/_core/control/controller.py b/smartsim/_core/control/controller.py
@@ -290,15 +290,13 @@ def _launch(self, manifest):
                 raise SmartSimError(msg)
             self._launch_orchestrator(orchestrator)
 
-        for rc in manifest.ray_clusters:  # cov-wlm
-            rc._update_workers()
 
         if self.orchestrator_active:
             self._set_dbobjects(manifest)
 
         # create all steps prior to launch
         steps = []
-        all_entity_lists = manifest.ensembles + manifest.ray_clusters
+        all_entity_lists = manifest.ensembles
         for elist in all_entity_lists:
             if elist.batch:
                 batch_step = self._create_batch_job_step(elist)

diff --git a/smartsim/_core/control/manifest.py b/smartsim/_core/control/manifest.py
@@ -27,13 +27,12 @@
 from ...database import Orchestrator
 from ...entity import EntityList, SmartSimEntity
 from ...error import SmartSimError
-from ...exp.ray import RayCluster
 from ..utils.helpers import fmt_dict
 
 # List of types derived from EntityList which require specific behavior
 # A corresponding property needs to exist (like db for Orchestrator),
 # otherwise they will not be accessible
-entity_list_exception_types = [Orchestrator, RayCluster]
+entity_list_exception_types = [Orchestrator]
 
 
 class Manifest:
@@ -51,6 +50,7 @@ def __init__(self, *args):
         self._check_names(self._deployables)
         self._check_entity_lists_nonempty()
 
+
     @property
     def db(self):
         """Return Orchestrator instances in Manifest
@@ -69,6 +69,7 @@ def db(self):
                 _db = deployable
         return _db
 
+
     @property
     def models(self):
         """Return Model instances in Manifest
@@ -82,6 +83,7 @@ def models(self):
                 _models.append(deployable)
         return _models
 
+
     @property
     def ensembles(self):
         """Return Ensemble instances in Manifest
@@ -101,34 +103,23 @@ def ensembles(self):
 
         return _ensembles
 
-    @property
-    def ray_clusters(self):
-        """Return all RayCluster instances in Manifest
-
-        :return: list of RayCluster instances
-        :rtype: List[RayCluster]
-        """
-        _ray_cluster = []
-        for deployable in self._deployables:
-            if isinstance(deployable, RayCluster):
-                _ray_cluster.append(deployable)
-        return _ray_cluster
 
     @property
     def all_entity_lists(self):
         """All entity lists, including ensembles and
-        exceptional ones like Orchestrator and RayCluster
+        exceptional ones like Orchestrator
 
         :return: list of entity lists
         :rtype: List[EntityList]
         """
-        _all_entity_lists = self.ray_clusters + self.ensembles
+        _all_entity_lists = self.ensembles
         db = self.db
         if db is not None:
             _all_entity_lists.append(db)
 
         return _all_entity_lists
 
+
     def _check_names(self, deployables):
         used = []
         for deployable in deployables:
@@ -139,6 +130,7 @@ def _check_names(self, deployables):
                 raise SmartSimError("User provided two entities with the same name")
             used.append(name)
 
+
     def _check_types(self, deployables):
         for deployable in deployables:
             if not (
@@ -149,13 +141,15 @@ def _check_types(self, deployables):
                     f"Entity has type {type(deployable)}, not SmartSimEntity or EntityList"
                 )
 
+
     def _check_entity_lists_nonempty(self):
         """Check deployables for sanity before launching"""
 
         for entity_list in self.all_entity_lists:
             if len(entity_list) < 1:
                 raise ValueError(f"{entity_list.name} is empty. Nothing to launch.")
 
+
     def __str__(self):
         s = ""
         e_header = "=== Ensembles ===\n"
@@ -164,8 +158,7 @@ def __str__(self):
         if self.ensembles:
             s += e_header
 
-            # include ray clusters as an ensemble while still in experimental API
-            all_ensembles = self.ensembles + self.ray_clusters
+            all_ensembles = self.ensembles
             for ensemble in all_ensembles:
                 s += f"{ensemble.name}\n"
                 s += f"Members: {len(ensemble)}\n"