Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync upstream #518

Merged
merged 528 commits into from
Mar 4, 2025
Merged

Sync upstream #518

merged 528 commits into from
Mar 4, 2025

Conversation

xsuler
Copy link
Collaborator

@xsuler xsuler commented Mar 4, 2025

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

gvspraveen and others added 30 commits February 14, 2025 04:53
…ct#50591)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Writing data containing a tensor column to `ParquetDatasink` with
partition column(s) fails because there is no pyarrow kernel for
`hash_list`.
This is because current implementation uses pyarrow `groupby.aggregate`.
Aggregate doesnt have kernels for tensor types (see snippet below).

This PR rewrites the implementation without aggregation on non partition
cols, thus avoiding this issue.

```
import pyarrow.parquet as pq

tensor_type = pa.fixed_shape_tensor(pa.int32(), [2, 2])

x = {"category": ["a", "b"] * 10, "tensor": list(np.random.random((20, 128)))}
schema = pa.schema(
    [
        ("category", pa.dictionary(pa.int32(), pa.string())),
        ("tensor", pa.fixed_shape_tensor(value_type=pa.float32(), shape=(128,))),
    ]
)
t = pa.Table.from_pydict(x, schema=schema)
t.group_by("category").aggregate([("tensor", "list")])

>> 
Traceback (most recent call last):
  File "/Users/praveengorthy/anyscale/rayturbo/python/ray/data/test_dataset.py", line 63, in <module>
    t.group_by("category").aggregate([("tensor", "list")])
  File "pyarrow/table.pxi", line 5562, in pyarrow.lib.TableGroupBy.aggregate
  File "/opt/miniconda3/lib/python3.9/site-packages/pyarrow/acero.py", line 308, in _group_by
    return decl.to_table(use_threads=use_threads)
  File "pyarrow/_acero.pyx", line 511, in pyarrow._acero.Declaration.to_table
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'hash_list' has no kernel matching input types (extension<arrow.fixed_shape_tensor[value_type=float, shape=[128]]>, uint32)
```

## Related issue number

Closes ray-project#50506


## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Praveen Gorthy <praveeng@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
…alls (don't serialize lambda for each single call). (ray-project#50527)
so that we can use the image for running related release tests.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…earners=1 (GPU); Remote Learner is always faster now. (ray-project#50600)
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
…0610)

for running release tests

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Remove python/ray/serve/ directory during core tests to enforce componentization.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
closes ray-project#45288

Signed-off-by: Hao Chen <chenh1024@gmail.com>
…y-project#50435)

ray-project#49317 initiated the decoupling
of Ray Train and Ray Tune top-level APIs. This PR updates all of the
internal usage in Ray Tune examples and tests to switch from `ray.air`
(super out-dated) and `ray.train` imports to `ray.tune` imports instead.

See ray-project/enhancements#57 for context
around the changes.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
strict check on script execution, and properly install the dependencies

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ray-project#50597)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This fixes an issue @erictang000 was running into while using the uv
runtime env hook for https://github.com/hiyouga/LLaMA-Factory. LLaMA
Factory modifies sys.argv before calling ray.init (and therefore the
hook), which broke the original logic. We instead use the value from
`/proc/{pid}/cmdline` and add a test that we can tolerate changes of
sys.argv.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
import vllm and say hello

---------

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
This PR adds state tracking capabilities to Ray Train V2.

## Key Changes
### State Management
Added new state tracking system for Train V2 that captures:
- Training run status (INITIALIZING, SCHEDULING, RUNNING, etc.)
- Run attempts within each training run and their statuses
- Training worker metadata (ranks, node IP / PID, etc.)

This is done through the following classes:

- [Read][Write] The `TrainStateActor` is the centralized data access
object, which is called to write data (currently in memory), and to read
the data.
- [Write] The `TrainStateManager` manages the Ray Train state, and
writes to the TrainStateActor.
- [Write] The `StateManagerCallback` implements the `ControllerCallback`
and `WorkerGroupCallback` and maps actions from the `Controller` and
`WorkerGroup` to the `TrainStateManager`.
- [Read] The `TrainHead` exposes an endpoint for reading from
`TrainStateActor`, and performs additional decoration logic before
returning it.

### Schema
Defined comprehensive schema for training state including:

- `TrainRun` - Top-level training run information, which maps to one
call to `Trainer.fit()`
- `TrainRunAttempt` - Individual training attempts within a training run
(e.g. fault tolerance retries).
- `TrainWorker` - Worker-specific state and metrics
- Status enums for runs, attempts, and actors

---------

Signed-off-by: Matthew Deng <matt@anyscale.com>
these are heavy tests at the end of the dependency chain, not supported
by microcheck smart filtering, yet rarely breaks

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…t#50616)

these are tools to be used to run on a cloud provider

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Bug fixes: 

1. The fn in Stage abstraction should actually be Type[StatefulStage]
but it was StatefulStage instead.
2. preprocess / postprocess fns could not be set to None before. None
means identity now
3. the endpoint you hit in http processor can reject extra keys. So it's
important to take the output of preprocessor as is as the payload and
sent it without extra fields to the underlying endpoint. Same thing on
the output. It should not be flattened out to clutter the output
results. http_response will be the thing you index on to get the
response from the endpoint.

---------

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
…ray-project#50290)

`AllToAllOperator` and `ZipOperator` don't implement accurate memory. As
a result, if a plan contains either of these operators, the streaming
executors falls back to the legacy scheduling algorithm.

Fixes ray-project#48104

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Why are these changes needed?
Add in-place ArrowBlockAccessor::random_shuffle
Addresses ray-project#42146

---------

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: srinathk10 <68668616+srinathk10@users.noreply.github.com>
…ray-project#50601)

Signed-off-by: 400Ping <43886578+400Ping@users.noreply.github.com>
…ect#50621)

## Why are these changes needed?

* deletes the binder/ folder
* deletes all references to launching Binder notebooks
* unhides some important cells while hiding the output of others
* clarifies the pip requirements for Tune tutorials
* adds `.conda` to `.gitignore`

## Related issue number

N/A

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
… in memory (ray-project#50121)

We always want to make pull requests to get the objects passed into
`ray.wait` on different nodes when `ray.wait` is called with
`fetch_local`. Right now we don't make that request if we've already
gotten `num_objects` from the core worker memory store.

Closes ray-project#49257

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
kevin85421 and others added 22 commits March 2, 2025 07:06
…ject#51016)

This is the same as ray-project#50644

Currently, tasks are sometimes executed on the main thread and sometimes
on other threads which is created by C++. However, C++ threads are only
considered Python threads when they execute Python functions. Hence,
when a task finishes, the thread-local state will be garbage-collected.
See
ray-project#46336 (comment)
for more details.

This PR uses the CPython API to treat C++ threads as Python threads,
even if they do not execute Python functions.
https://docs.python.org/3/c-api/init.html#non-python-created-threads

This PR handles synchronous tasks. After this PR is merged, the
follow-up tasks are:
* Always create a default executor.
* Move the constructor to the default executor.
* Support asynchronous tasks.

## Related issue number

Part of ray-project#46336 

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
…ler State (ray-project#50886)

This is the last PR for the feature of early termination of infeasible
tasks.

The previous PRs:
(1) added logic in GCS to obtain the per node infeasible resource
requests from the autoscaler state;
(2) added the new API in raylet_client, node_manager and
cluster_task_manager to cancel tasks with certain resource request
shapes

This PR added the integration of the above PRs:
(1) added the logic to call the cancel resource shape API based on the
per node infeasible requests from the autoscaler
(2) put the feature behind a ray config and make the default to be on
(3) small improvements on previous PRs (logging, comments, messages, add
an early exit when getting the per node infeasible requests)
(4) added integration tests for both normal task scheduling case and
actor creation case

With the change, the infeasible tasks (both normal tasks and the actor
creation tasks) can be early terminated by default.

Closes ray-project#45909

---------

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Remove a duplicated sentence

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: maofagui <maofg92@163.com>
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
)

Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
…quantized models. (ray-project#51007)

Fixes a bug where get_device_capability() was getting cached based on
the wrong environment variables. `vllm` assumes that upon import the
`CUDA_AVAILABLE_DEVICES` will have their final values and will cache
some attributes like cuda device compatibility.


If later we use ray actors to serialize the modules over the cache
apparently also gets serialized with the wrong values. This PR clears
the cache before creating engine so the values will be recomputed based
on the right env variables.

---------

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ct#50990)

When trying to write to bigquery with `dataset.write_bigquery(...)``, I
get the exception `TypeError: _create_client() got an unexpected keyword
argument 'project'`

This seems to be a slight mismatch in function signatures and was
unfortunately not picked up by tests due to the level at which mocking
was applied

Fixes: ray-project#50991

Signed-off-by: David Farrington <david@shipit.ltd>
Redis was removed from Ray's dependencies more than 2.5 years ago.
However, some Redis-related parameters remain in the Ray codebase, such
as `redis_max_memory`.

This PR deletes the use of `redis_max_memory` in all tests. The next
step is removed `redis_max_memory` from the Ray codebase.

---------

Signed-off-by: kaihsun <kaihsun@anyscale.com>
…ray-project#50874)

Remove ability to access private attributes and method of the TrainContext.

---------

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
content moved to config file; var not used any more

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Add doc for torch profile.
Also add example and step-by-step guide to nsight profile.

---------

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
fix windows build

Signed-off-by: dayshah <dhyey2019@gmail.com>
ray-project#51041)

## Why are these changes needed?
Our tests with pyarrow nightly caught a backwards incompatibility bug
with a [recent pyarrow
change](apache/arrow#45471). To fix this we
simply need to pass along kwargs in our `as_py` method as suggested by
the pyarrow team
[here](apache/arrow#45471 (comment)).


---------

Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
## Why are these changes needed?

Make deployment retry count configurable through environment variable

## Related issue number

This PR addresses ray-project#5071 

Since i did not find any references to this behavior in the public doc,
decided not to update any `docs`, let me know if that's not true.

- Testing Strategy
### updated unit tests

### manual test

1. create a simple application
```
import logging
import requests
from fastapi import FastAPI
from ray import serve

fastapi = FastAPI()
logger = logging.getLogger("ray.serve")

@serve.deployment(name="fastapi-deployment", num_replicas=2)
@serve.ingress(fastapi)
class FastAPIDeployment:
    def __init__(self):
        self.counter = 0
        raise Exception("test")

    # FastAPI automatically parses the HTTP request.
    @fastapi.get("/hello")
    def say_hello(self, name: str) -> str:
        logger.info("Handling request!")
        return f"Hello {name}!"

my_app = FastAPIDeployment.bind()

```

2. ran the application from local cli
```
MAX_PER_REPLICA_RETRY_MULTIPLIER=1 serve run test:my_app
```

3. from the logs i can see that we are only retrying one instead of the
default `3`
https://gist.github.com/abrarsheikh/e85e00bb94ba443f76f77220b6ace530

since my app contain 2 replicas, the code retrying 2 * 1 times as
expected.

4. running without overriding the env variable `serve run test:my_app`
retries 6 times.

---------

Signed-off-by: Abrar Sheikh <abrar2002as@gmail.com>
Signed-off-by: Abrar Sheikh <abrar@abrar-FK79L5J97K.local>
Co-authored-by: Saihajpreet Singh <c-saihajpreet.singh@anyscale.com>
Co-authored-by: Abrar Sheikh <abrar@abrar-FK79L5J97K.local>
stop running serve tests on core changes, especially on c++ changes

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…ject#50893)

## Related issue number

Improves readability of Ray Data Quickstart

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
Signed-off-by: Ricardo Decal <crypdick@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Copy link
Collaborator

@wumuzi520 wumuzi520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xsuler xsuler merged commit 91690dd into main Mar 4, 2025
1 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.