[Datasets] Do not eagerly execute first block for read_xxx API #31558

c21 · 2023-01-10T06:37:30Z

Signed-off-by: Cheng Su scnju13@gmail.com

Why are these changes needed?

This PR is the followup of #31286 (review). The change includes:

read_api.py:read_datasource(): Remove the logic to eagerly execute first block for read.
Dataset.schema(): Change default value of fetch_if_missing from False to True. So always trigger execution if schame is missing.
ExecutionPlan.schema(): if plan is having lazy block list as output, execute the first block only to get schema, instead of executing all blocks.
Other files: unit test change to work with new logic.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 · 2023-01-10T20:16:49Z

The failed book-documentation looks unrelated to this PR. Seeing it also has same failure on other PR.

clarkzinzow

LGTM!

python/ray/data/_internal/plan.py

python/ray/data/tests/test_dataset.py

Signed-off-by: Cheng Su <scnju13@gmail.com>

Do not eagerly execute first block for read_xxx API

43de763

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and a team as code owners January 10, 2023 06:37

Fix unit tests

ef83ed1

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 assigned ericl, clarkzinzow and jianoaix Jan 10, 2023

jianoaix approved these changes Jan 10, 2023

View reviewed changes

clarkzinzow approved these changes Jan 10, 2023

View reviewed changes

python/ray/data/_internal/plan.py Show resolved Hide resolved

c21 commented Jan 10, 2023

View reviewed changes

python/ray/data/tests/test_dataset.py Show resolved Hide resolved

ericl approved these changes Jan 11, 2023

View reviewed changes

clarkzinzow merged commit 3fabd7f into ray-project:master Jan 11, 2023

c21 deleted the schema branch January 12, 2023 01:08

AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023

[Datasets] Do not eagerly execute first block for read_xxx API (#31558)

e645d52

Signed-off-by: Cheng Su <scnju13@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Do not eagerly execute first block for read_xxx API #31558

[Datasets] Do not eagerly execute first block for read_xxx API #31558

c21 commented Jan 10, 2023 •

edited

Loading

c21 commented Jan 10, 2023

clarkzinzow left a comment

[Datasets] Do not eagerly execute first block for read_xxx API #31558

[Datasets] Do not eagerly execute first block for read_xxx API #31558

Conversation

c21 commented Jan 10, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

c21 commented Jan 10, 2023

clarkzinzow left a comment

Choose a reason for hiding this comment

c21 commented Jan 10, 2023 •

edited

Loading