forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync upstream #518
Merged
Merged
Sync upstream #518
+95,942
−22,273
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ct#50591) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Writing data containing a tensor column to `ParquetDatasink` with partition column(s) fails because there is no pyarrow kernel for `hash_list`. This is because current implementation uses pyarrow `groupby.aggregate`. Aggregate doesnt have kernels for tensor types (see snippet below). This PR rewrites the implementation without aggregation on non partition cols, thus avoiding this issue. ``` import pyarrow.parquet as pq tensor_type = pa.fixed_shape_tensor(pa.int32(), [2, 2]) x = {"category": ["a", "b"] * 10, "tensor": list(np.random.random((20, 128)))} schema = pa.schema( [ ("category", pa.dictionary(pa.int32(), pa.string())), ("tensor", pa.fixed_shape_tensor(value_type=pa.float32(), shape=(128,))), ] ) t = pa.Table.from_pydict(x, schema=schema) t.group_by("category").aggregate([("tensor", "list")]) >> Traceback (most recent call last): File "/Users/praveengorthy/anyscale/rayturbo/python/ray/data/test_dataset.py", line 63, in <module> t.group_by("category").aggregate([("tensor", "list")]) File "pyarrow/table.pxi", line 5562, in pyarrow.lib.TableGroupBy.aggregate File "/opt/miniconda3/lib/python3.9/site-packages/pyarrow/acero.py", line 308, in _group_by return decl.to_table(use_threads=use_threads) File "pyarrow/_acero.pyx", line 511, in pyarrow._acero.Declaration.to_table File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Function 'hash_list' has no kernel matching input types (extension<arrow.fixed_shape_tensor[value_type=float, shape=[128]]>, uint32) ``` ## Related issue number Closes ray-project#50506 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Praveen Gorthy <praveeng@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
…alls (don't serialize lambda for each single call). (ray-project#50527)
so that we can use the image for running related release tests. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ect#50517) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…y-project#50592) Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
…earners=1 (GPU); Remote Learner is always faster now. (ray-project#50600)
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
…0610) for running release tests Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Remove python/ray/serve/ directory during core tests to enforce componentization. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
closes ray-project#45288 Signed-off-by: Hao Chen <chenh1024@gmail.com>
…y-project#50435) ray-project#49317 initiated the decoupling of Ray Train and Ray Tune top-level APIs. This PR updates all of the internal usage in Ray Tune examples and tests to switch from `ray.air` (super out-dated) and `ray.train` imports to `ray.tune` imports instead. See ray-project/enhancements#57 for context around the changes. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
strict check on script execution, and properly install the dependencies Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ray-project#50597) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This fixes an issue @erictang000 was running into while using the uv runtime env hook for https://github.com/hiyouga/LLaMA-Factory. LLaMA Factory modifies sys.argv before calling ray.init (and therefore the hook), which broke the original logic. We instead use the value from `/proc/{pid}/cmdline` and add a test that we can tolerate changes of sys.argv. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(
import vllm and say hello --------- Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
This PR adds state tracking capabilities to Ray Train V2. ## Key Changes ### State Management Added new state tracking system for Train V2 that captures: - Training run status (INITIALIZING, SCHEDULING, RUNNING, etc.) - Run attempts within each training run and their statuses - Training worker metadata (ranks, node IP / PID, etc.) This is done through the following classes: - [Read][Write] The `TrainStateActor` is the centralized data access object, which is called to write data (currently in memory), and to read the data. - [Write] The `TrainStateManager` manages the Ray Train state, and writes to the TrainStateActor. - [Write] The `StateManagerCallback` implements the `ControllerCallback` and `WorkerGroupCallback` and maps actions from the `Controller` and `WorkerGroup` to the `TrainStateManager`. - [Read] The `TrainHead` exposes an endpoint for reading from `TrainStateActor`, and performs additional decoration logic before returning it. ### Schema Defined comprehensive schema for training state including: - `TrainRun` - Top-level training run information, which maps to one call to `Trainer.fit()` - `TrainRunAttempt` - Individual training attempts within a training run (e.g. fault tolerance retries). - `TrainWorker` - Worker-specific state and metrics - Status enums for runs, attempts, and actors --------- Signed-off-by: Matthew Deng <matt@anyscale.com>
these are heavy tests at the end of the dependency chain, not supported by microcheck smart filtering, yet rarely breaks Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…t#50616) these are tools to be used to run on a cloud provider Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Bug fixes: 1. The fn in Stage abstraction should actually be Type[StatefulStage] but it was StatefulStage instead. 2. preprocess / postprocess fns could not be set to None before. None means identity now 3. the endpoint you hit in http processor can reject extra keys. So it's important to take the output of preprocessor as is as the payload and sent it without extra fields to the underlying endpoint. Same thing on the output. It should not be flattened out to clutter the output results. http_response will be the thing you index on to get the response from the endpoint. --------- Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
…ray-project#50290) `AllToAllOperator` and `ZipOperator` don't implement accurate memory. As a result, if a plan contains either of these operators, the streaming executors falls back to the legacy scheduling algorithm. Fixes ray-project#48104 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Why are these changes needed? Add in-place ArrowBlockAccessor::random_shuffle Addresses ray-project#42146 --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com>
…ray-project#50601) Signed-off-by: 400Ping <43886578+400Ping@users.noreply.github.com>
…ect#50621) ## Why are these changes needed? * deletes the binder/ folder * deletes all references to launching Binder notebooks * unhides some important cells while hiding the output of others * clarifies the pip requirements for Tune tutorials * adds `.conda` to `.gitignore` ## Related issue number N/A ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
… in memory (ray-project#50121) We always want to make pull requests to get the objects passed into `ray.wait` on different nodes when `ray.wait` is called with `fetch_local`. Right now we don't make that request if we've already gotten `num_objects` from the core worker memory store. Closes ray-project#49257 --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
…ct#50615) The following two functions are the same. https://github.com/ray-project/ray/blob/30df6a43a7324eaf65b83b94b5c1511b0f954a39/python/ray/autoscaler/v2/utils.py#L840-L851 https://github.com/ray-project/ray/blob/30df6a43a7324eaf65b83b94b5c1511b0f954a39/python/ray/autoscaler/v2/sdk.py#L100-L110 --------- Signed-off-by: kaihsun <kaihsun@anyscale.com>
…ject#51016) This is the same as ray-project#50644 Currently, tasks are sometimes executed on the main thread and sometimes on other threads which is created by C++. However, C++ threads are only considered Python threads when they execute Python functions. Hence, when a task finishes, the thread-local state will be garbage-collected. See ray-project#46336 (comment) for more details. This PR uses the CPython API to treat C++ threads as Python threads, even if they do not execute Python functions. https://docs.python.org/3/c-api/init.html#non-python-created-threads This PR handles synchronous tasks. After this PR is merged, the follow-up tasks are: * Always create a default executor. * Move the constructor to the default executor. * Support asynchronous tasks. ## Related issue number Part of ray-project#46336 --------- Signed-off-by: kaihsun <kaihsun@anyscale.com>
…ler State (ray-project#50886) This is the last PR for the feature of early termination of infeasible tasks. The previous PRs: (1) added logic in GCS to obtain the per node infeasible resource requests from the autoscaler state; (2) added the new API in raylet_client, node_manager and cluster_task_manager to cancel tasks with certain resource request shapes This PR added the integration of the above PRs: (1) added the logic to call the cancel resource shape API based on the per node infeasible requests from the autoscaler (2) put the feature behind a ray config and make the default to be on (3) small improvements on previous PRs (logging, comments, messages, add an early exit when getting the per node infeasible requests) (4) added integration tests for both normal task scheduling case and actor creation case With the change, the infeasible tasks (both normal tasks and the actor creation tasks) can be early terminated by default. Closes ray-project#45909 --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Remove a duplicated sentence <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: maofagui <maofg92@163.com> Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
…quantized models. (ray-project#51007) Fixes a bug where get_device_capability() was getting cached based on the wrong environment variables. `vllm` assumes that upon import the `CUDA_AVAILABLE_DEVICES` will have their final values and will cache some attributes like cuda device compatibility. If later we use ray actors to serialize the modules over the cache apparently also gets serialized with the wrong values. This PR clears the cache before creating engine so the values will be recomputed based on the right env variables. --------- Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ct#50990) When trying to write to bigquery with `dataset.write_bigquery(...)``, I get the exception `TypeError: _create_client() got an unexpected keyword argument 'project'` This seems to be a slight mismatch in function signatures and was unfortunately not picked up by tests due to the level at which mocking was applied Fixes: ray-project#50991 Signed-off-by: David Farrington <david@shipit.ltd>
Redis was removed from Ray's dependencies more than 2.5 years ago. However, some Redis-related parameters remain in the Ray codebase, such as `redis_max_memory`. This PR deletes the use of `redis_max_memory` in all tests. The next step is removed `redis_max_memory` from the Ray codebase. --------- Signed-off-by: kaihsun <kaihsun@anyscale.com>
…ray-project#50874) Remove ability to access private attributes and method of the TrainContext. --------- Signed-off-by: Hongpeng Guo <hpguo@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
content moved to config file; var not used any more Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Add doc for torch profile. Also add example and step-by-step guide to nsight profile. --------- Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
fix windows build Signed-off-by: dayshah <dhyey2019@gmail.com>
ray-project#51041) ## Why are these changes needed? Our tests with pyarrow nightly caught a backwards incompatibility bug with a [recent pyarrow change](apache/arrow#45471). To fix this we simply need to pass along kwargs in our `as_py` method as suggested by the pyarrow team [here](apache/arrow#45471 (comment)). --------- Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
## Why are these changes needed? Make deployment retry count configurable through environment variable ## Related issue number This PR addresses ray-project#5071 Since i did not find any references to this behavior in the public doc, decided not to update any `docs`, let me know if that's not true. - Testing Strategy ### updated unit tests ### manual test 1. create a simple application ``` import logging import requests from fastapi import FastAPI from ray import serve fastapi = FastAPI() logger = logging.getLogger("ray.serve") @serve.deployment(name="fastapi-deployment", num_replicas=2) @serve.ingress(fastapi) class FastAPIDeployment: def __init__(self): self.counter = 0 raise Exception("test") # FastAPI automatically parses the HTTP request. @fastapi.get("/hello") def say_hello(self, name: str) -> str: logger.info("Handling request!") return f"Hello {name}!" my_app = FastAPIDeployment.bind() ``` 2. ran the application from local cli ``` MAX_PER_REPLICA_RETRY_MULTIPLIER=1 serve run test:my_app ``` 3. from the logs i can see that we are only retrying one instead of the default `3` https://gist.github.com/abrarsheikh/e85e00bb94ba443f76f77220b6ace530 since my app contain 2 replicas, the code retrying 2 * 1 times as expected. 4. running without overriding the env variable `serve run test:my_app` retries 6 times. --------- Signed-off-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: Abrar Sheikh <abrar@abrar-FK79L5J97K.local> Co-authored-by: Saihajpreet Singh <c-saihajpreet.singh@anyscale.com> Co-authored-by: Abrar Sheikh <abrar@abrar-FK79L5J97K.local>
stop running serve tests on core changes, especially on c++ changes Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ject#50893) ## Related issue number Improves readability of Ray Data Quickstart ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Ricardo Decal <rdecal@anyscale.com> Signed-off-by: Ricardo Decal <crypdick@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Gene Su <e870252314@gmail.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
wumuzi520
approved these changes
Mar 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.