[Data] Refactor block batching to follow iterator pattern #31425

amogkam · 2023-01-04T04:06:52Z

As discussed offline with @clarkzinzow (#30190 (comment)), this PR refactors block batching to follow a chained iterators pattern.

This allows for more flexibility, composability, and better testing of components upstream of Iterator[Block] (formatting, shuffling, batching, prefetching).

This PR only does a refactor and adds tests. There are no API or functionality changes in this PR. This PR also consolidates the map_batches and iter_batches codepaths.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

clarkzinzow

Looking good overall, main note is about using collections.deque rather than queue.Queue for the sliding prefetch window.

clarkzinzow · 2023-01-06T20:51:11Z

python/ray/data/_internal/block_batching.py

+    sliding_window = queue.Queue(maxsize=window_size)
+
+    # Create the initial set of blocks to prefetch.
+    while not sliding_window.full():
+        try:
+            sliding_window.put(next(block_ref_iter))
+        except StopIteration:
+            break
+    with stats.iter_wait_s.timer() if stats else nullcontext():
+        prefetcher.prefetch_blocks(list(sliding_window.queue))
+
+    while not sliding_window.empty():
+        block_ref = sliding_window.get()
+        try:
+            sliding_window.put(next(block_ref_iter))
+            with stats.iter_wait_s.timer() if stats else nullcontext():
+                prefetcher.prefetch_blocks(list(sliding_window.queue))
+        except StopIteration:
+            pass
+        yield block_ref
        if clear_block_after_read:
            ray._private.internal_api.free(block_ref, local_only=False)


I think that we'd want to stick to collections.deque for a single-threaded sliding window implementation (more efficient, less complicated semantics):

Suggested change

sliding_window = queue.Queue(maxsize=window_size)

# Create the initial set of blocks to prefetch.

while not sliding_window.full():

try:

sliding_window.put(next(block_ref_iter))

except StopIteration:

break

with stats.iter_wait_s.timer() if stats else nullcontext():

prefetcher.prefetch_blocks(list(sliding_window.queue))

while not sliding_window.empty():

block_ref = sliding_window.get()

try:

sliding_window.put(next(block_ref_iter))

with stats.iter_wait_s.timer() if stats else nullcontext():

prefetcher.prefetch_blocks(list(sliding_window.queue))

except StopIteration:

pass

yield block_ref

if clear_block_after_read:

ray._private.internal_api.free(block_ref, local_only=False)

sliding_window = collections.deque(

itertools.islice(block_ref_iter, window_size), maxsize=window_size

)

while sliding_window:

block_ref = sliding_window.popleft()

try:

sliding_window.append(next(block_ref_iter))

with stats.iter_wait_s.timer() if stats else nullcontext():

prefetcher.prefetch_blocks(list(sliding_window))

except StopIteration:

pass

yield block_ref

if clear_block_after_read:

ray._private.internal_api.free(block_ref, local_only=False)

Even after we have a background thread worker, I don't think that we'd want to have a multithreading queue inside of _prefetch_blocks. We can keep each of these "batch preprocessing" generators threading-agnostic by pushing the producer generator into the background thread and wiring up a multithreading queue between the producer generator and the consumer generator, which should be a lot cleaner and easier to evolve (and easier to test).

Good point, updated

clarkzinzow · 2023-01-06T20:56:25Z

python/ray/data/_internal/block_batching.py

+    sliding_window = queue.Queue(maxsize=window_size)
+
+    # Create the initial set of blocks to prefetch.
+    while not sliding_window.full():


It should be noted that this LBYL pattern is not thread-safe in multithreaded code, since the Queue class makes no guarantees that a subsequent put() will not block even if full() returns False. https://docs.python.org/3/library/queue.html#queue.Queue.full

The more idiomatic/correct pattern is EAFP, where you try to sliding_window.put_nowait() and catch a queue.Full exception.

I know that this isn't an issue for single-threaded use of Queue, but just pointing it out for the follow-up PR.

clarkzinzow · 2023-01-06T20:57:07Z

python/ray/data/_internal/block_batching.py

+    # Create the initial set of blocks to prefetch.
+    while not sliding_window.full():
+        try:
+            sliding_window.put(next(block_ref_iter))


This should probably be sliding_window.put_nowait() since we'd rather throw an error if the Queue somehow ends up being full (e.g. due to a bug) rather than hanging forever. Same with sliding_window.put() and sliding_window.get() below.

Good to know...will keep this in mind for next PR since we changed to collections.deque in this one.

clarkzinzow · 2023-01-06T20:58:44Z

python/ray/data/_internal/block_batching.py

+        for block_ref in block_ref_iter:
+            yield block_ref
+            if clear_block_after_read:
+                ray._private.internal_api.free(block_ref, local_only=False)


An interesting thing to note is that this block ref clearing assumes that block_ref is no longer in use after control is returned to this generator, so this assumes no buffering by downstream generators, which may or may not hold true for future tweaks. We should keep this in mind.

python/ray/data/_internal/block_batching.py

c21

Nice refactoring, thanks @amogkam!

python/ray/data/_internal/block_batching.py

c21 · 2023-01-06T22:55:44Z

python/ray/data/_internal/block_batching.py

+    # Signal to the batcher that there are no more blocks to add.
+    batcher.done_adding()
+
+    # Get any leftover batches in ShufflingBatcher.


nit: ShufflingBatcher -> batcher?

this is specifically ShufflingBatcher.

Regular Batcher will no longer have any full batches at this point. But ShufflingBatcher may still have full batches if the shuffle buffer size is larger than the batch size.

python/ray/data/_internal/block_batching.py

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam · 2023-01-07T04:08:03Z

Thanks for the review guys! I updated the PR, please take another look!

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

c21

LGTM

clarkzinzow

LGTM overall, only thing is the sliding_window.queue line!

python/ray/data/_internal/block_batching.py

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

c21 · 2023-01-09T23:07:13Z

just FYI, seems have some CI test failure (example):

ray.exceptions.RayTaskError(TypeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=27294, ip=172.16.16.3, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fb2e055d790>)
--
  | File "/ray/python/ray/train/_internal/worker_group.py", line 31, in __execute
  | raise skipped from exception_cause(skipped)
  | File "/ray/python/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
  | train_func(*args, **kwargs)
  | File "/root/.cache/bazel/_bazel_root/5fe90af4e7d1ed9fcf52f59e39e126f5/execroot/com_github_ray_project_ray/bazel-out/k8-opt/bin/doc/datasets_train.runfiles/com_github_ray_project_ray/doc/source/ray-core/_examples/datasets_train/datasets_train.py", line 439, in train_func
  | train_torch_dataset, net, device, criterion, optimizer
  | File "/root/.cache/bazel/_bazel_root/5fe90af4e7d1ed9fcf52f59e39e126f5/execroot/com_github_ray_project_ray/bazel-out/k8-opt/bin/doc/datasets_train.runfiles/com_github_ray_project_ray/doc/source/ray-core/_examples/datasets_train/datasets_train.py", line 340, in train_epoch
  | for i, (inputs, labels) in enumerate(dataset):
  | File "/ray/python/ray/data/_internal/torch_iterable_dataset.py", line 10, in __iter__
  | yield from it
  | File "/ray/python/ray/data/dataset.py", line 3031, in make_generator
  | local_shuffle_seed=local_shuffle_seed,
  | File "/ray/python/ray/data/dataset_pipeline.py", line 213, in iter_batches
  | shuffle_seed=local_shuffle_seed,
  | TypeError: batch_block_refs() takes 1 positional argument but 2 positional arguments (and 7 keyword-only arguments) were given

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

…ck-batching

@clarkzinzow

As discussed offline with @clarkzinzow (#30190 (comment)), this PR refactors block batching to follow a chained iterators pattern. This allows for more flexibility, composability, and better testing of components upstream of Iterator[Block] (formatting, shuffling, batching, prefetching). This PR only does a refactor and adds tests. There are no API or functionality changes in this PR. This PR also consolidates the map_batches and iter_batches codepaths. Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam added 4 commits January 3, 2023 17:19

wip

4c443cb

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

formatting

8fa29b4

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

add tests

a22ae14

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

update

bd4ba03

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners January 4, 2023 04:06

add test file

92eac45

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam assigned c21 and clarkzinzow Jan 4, 2023

amogkam added 4 commits January 4, 2023 13:23

lint

c98f472

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix local shuffling

937041f

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

python 3.6

396b9bb

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix spilling

a49e2c8

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 5, 2023

amogkam changed the title ~~[Data] Refactor block batching to follow iterator format~~ [Data] Refactor block batching to follow iterator pattern Jan 5, 2023

clarkzinzow reviewed Jan 6, 2023

View reviewed changes

c21 reviewed Jan 6, 2023

View reviewed changes

amogkam added 3 commits January 6, 2023 20:04

address comments

a634296

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

address comments

126eaa4

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

update

8df9ef4

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam requested review from c21 and clarkzinzow January 7, 2023 04:07

fix

9b20a64

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

c21 approved these changes Jan 9, 2023

View reviewed changes

clarkzinzow approved these changes Jan 9, 2023

View reviewed changes

python/ray/data/_internal/block_batching.py Outdated Show resolved Hide resolved

amogkam added 2 commits January 9, 2023 14:05

fix

0782bcd

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

442829c

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam added 2 commits January 9, 2023 15:14

fix

c985aa8

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Merge branch 'master' of github.com:ray-project/ray into refactor-blo…

f440a07

…ck-batching

amogkam merged commit 86ec3e2 into ray-project:master Jan 10, 2023

amogkam deleted the refactor-block-batching branch January 10, 2023 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Refactor block batching to follow iterator pattern #31425

[Data] Refactor block batching to follow iterator pattern #31425

amogkam commented Jan 4, 2023 •

edited

Loading

clarkzinzow left a comment

clarkzinzow Jan 6, 2023

amogkam Jan 7, 2023

clarkzinzow Jan 6, 2023

clarkzinzow Jan 6, 2023

amogkam Jan 7, 2023

clarkzinzow Jan 6, 2023

c21 left a comment

c21 Jan 6, 2023

amogkam Jan 7, 2023

amogkam commented Jan 7, 2023

c21 left a comment

clarkzinzow left a comment

c21 commented Jan 9, 2023

[Data] Refactor block batching to follow iterator pattern #31425

[Data] Refactor block batching to follow iterator pattern #31425

Conversation

amogkam commented Jan 4, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

clarkzinzow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam commented Jan 7, 2023

c21 left a comment

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

c21 commented Jan 9, 2023

amogkam commented Jan 4, 2023 •

edited

Loading