[air] Increase global batch size for air_benchmark_tensorflow_mnist_gpu_4x4 #31402

krfricke · 2023-01-03T16:12:54Z

Signed-off-by: Kai Fricke kai@anyscale.com

Why are these changes needed?

The benchmark currently times out, because a single run takes over 16 minutes. This is a regression compared to e.g. the 2.0.0 release, where a run took only 4 minutes.

Upon closer investigation, this seems to be related to severe underutilization of the GPU. With a small batch size, we are bound by compute/data iteration, and this seems to have been regressed in later tensorflow versions.

To achieve shorter training times, we increase the global batch size from 64 (which is tiny) to 1024. This severely speeds up training (even though the GPUs are still underutilized with <10% utilizaton).

Related issue number

Closes #29922

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…pu_4x4 Signed-off-by: Kai Fricke <kai@anyscale.com>

xwjiang2010

Should we further increase the batch size then to have a reasonable gpu utilization?

krfricke · 2023-01-03T18:00:22Z

Increasing the batch size does not increase GPU util - likely because the dataset size is limited (fashion nist only has 60k examples). For that we would have to add more data

…pu_4x4 (#31402) The benchmark currently times out, because a single run takes over 16 minutes. This is a regression compared to e.g. the 2.0.0 release, where a run took only 4 minutes. Upon closer investigation, this seems to be related to severe underutilization of the GPU. With a small batch size, we are bound by compute/data iteration, and this seems to have been regressed in later tensorflow versions. To achieve shorter training times, we increase the global batch size from 64 (which is tiny) to 1024. This severely speeds up training (even though the GPUs are still underutilized with <10% utilizaton). Signed-off-by: Kai Fricke <kai@anyscale.com>

…pu_4x4 (ray-project#31402) The benchmark currently times out, because a single run takes over 16 minutes. This is a regression compared to e.g. the 2.0.0 release, where a run took only 4 minutes. Upon closer investigation, this seems to be related to severe underutilization of the GPU. With a small batch size, we are bound by compute/data iteration, and this seems to have been regressed in later tensorflow versions. To achieve shorter training times, we increase the global batch size from 64 (which is tiny) to 1024. This severely speeds up training (even though the GPUs are still underutilized with <10% utilizaton). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

[air] Increase global batch size for air_benchmark_tensorflow_mnist_g…

9847b85

…pu_4x4 Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke changed the title ~~[air] Increase global batch size for air_benchmark_tensorflow_mnist_g…~~ [air] Increase global batch size for air_benchmark_tensorflow_mnist_gpu_4x4 Jan 3, 2023

krfricke assigned xwjiang2010 Jan 3, 2023

krfricke requested a review from xwjiang2010 January 3, 2023 16:13

xwjiang2010 approved these changes Jan 3, 2023

View reviewed changes

krfricke merged commit 2c2c8e6 into ray-project:master Jan 3, 2023

krfricke deleted the air/benchmark-tf-gpu branch January 3, 2023 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] Increase global batch size for air_benchmark_tensorflow_mnist_gpu_4x4 #31402

[air] Increase global batch size for air_benchmark_tensorflow_mnist_gpu_4x4 #31402

krfricke commented Jan 3, 2023 •

edited

Loading

xwjiang2010 left a comment

krfricke commented Jan 3, 2023

[air] Increase global batch size for air_benchmark_tensorflow_mnist_gpu_4x4 #31402

[air] Increase global batch size for air_benchmark_tensorflow_mnist_gpu_4x4 #31402

Conversation

krfricke commented Jan 3, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

xwjiang2010 left a comment

Choose a reason for hiding this comment

krfricke commented Jan 3, 2023

krfricke commented Jan 3, 2023 •

edited

Loading