Skip to content

Commit

Permalink
[air] Increase global batch size for air_benchmark_tensorflow_mnist_g…
Browse files Browse the repository at this point in the history
…pu_4x4 (#31402)

The benchmark currently times out, because a single run takes over 16 minutes. This is a regression compared to e.g. the 2.0.0 release, where a run took only 4 minutes.

Upon closer investigation, this seems to be related to severe underutilization of the GPU. With a small batch size, we are bound by compute/data iteration, and this seems to have been regressed in later tensorflow versions.

To achieve shorter training times, we increase the global batch size from 64 (which is tiny) to 1024. This severely speeds up training (even though the GPUs are still underutilized with <10% utilizaton).

Signed-off-by: Kai Fricke <kai@anyscale.com>
  • Loading branch information
krfricke authored Jan 3, 2023
1 parent 85f365e commit 2c2c8e6
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion release/release_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -504,7 +504,7 @@

run:
timeout: 5400
script: python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 200 --num-workers 16 --cpus-per-worker 4 --batch-size 64 --use-gpu
script: python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 200 --num-workers 16 --cpus-per-worker 4 --batch-size 1024 --use-gpu

wait_for_nodes:
num_nodes: 4
Expand Down

0 comments on commit 2c2c8e6

Please sign in to comment.