Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] Increase global batch size for air_benchmark_tensorflow_mnist_gpu_4x4 #31402

Merged
merged 1 commit into from
Jan 3, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Jan 3, 2023

Signed-off-by: Kai Fricke kai@anyscale.com

Why are these changes needed?

The benchmark currently times out, because a single run takes over 16 minutes. This is a regression compared to e.g. the 2.0.0 release, where a run took only 4 minutes.

Upon closer investigation, this seems to be related to severe underutilization of the GPU. With a small batch size, we are bound by compute/data iteration, and this seems to have been regressed in later tensorflow versions.

To achieve shorter training times, we increase the global batch size from 64 (which is tiny) to 1024. This severely speeds up training (even though the GPUs are still underutilized with <10% utilizaton).

Related issue number

Closes #29922

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…pu_4x4

Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke changed the title [air] Increase global batch size for air_benchmark_tensorflow_mnist_g… [air] Increase global batch size for air_benchmark_tensorflow_mnist_gpu_4x4 Jan 3, 2023
@krfricke krfricke requested a review from xwjiang2010 January 3, 2023 16:13
Copy link
Contributor

@xwjiang2010 xwjiang2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we further increase the batch size then to have a reasonable gpu utilization?

@krfricke krfricke merged commit 2c2c8e6 into ray-project:master Jan 3, 2023
@krfricke krfricke deleted the air/benchmark-tf-gpu branch January 3, 2023 17:15
@krfricke
Copy link
Contributor Author

krfricke commented Jan 3, 2023

Increasing the batch size does not increase GPU util - likely because the dataset size is limited (fashion nist only has 60k examples). For that we would have to add more data

AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…pu_4x4 (#31402)

The benchmark currently times out, because a single run takes over 16 minutes. This is a regression compared to e.g. the 2.0.0 release, where a run took only 4 minutes.

Upon closer investigation, this seems to be related to severe underutilization of the GPU. With a small batch size, we are bound by compute/data iteration, and this seems to have been regressed in later tensorflow versions.

To achieve shorter training times, we increase the global batch size from 64 (which is tiny) to 1024. This severely speeds up training (even though the GPUs are still underutilized with <10% utilizaton).

Signed-off-by: Kai Fricke <kai@anyscale.com>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…pu_4x4 (ray-project#31402)

The benchmark currently times out, because a single run takes over 16 minutes. This is a regression compared to e.g. the 2.0.0 release, where a run took only 4 minutes.

Upon closer investigation, this seems to be related to severe underutilization of the GPU. With a small batch size, we are bound by compute/data iteration, and this seems to have been regressed in later tensorflow versions.

To achieve shorter training times, we increase the global batch size from 64 (which is tiny) to 1024. This severely speeds up training (even though the GPUs are still underutilized with <10% utilizaton).

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ml release] air_benchmark_tensorflow_mnist_gpu_4x4 times out
2 participants