Skip to content

Commit

Permalink
fix GPU mapping error for Horovod + finetune (#3048)
Browse files Browse the repository at this point in the history
When fine-tuning with Horovod, the same error as
#2712 is thrown at the
place I modified in this PR.

It seems `tf.test.is_gpu_available` will try to use all GPUs, but
`tf.config.get_visible_devices` won't.

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
njzjz and pre-commit-ci[bot] authored Dec 11, 2023
1 parent a6f1333 commit 2204ec1
Showing 1 changed file with 8 additions and 1 deletion.
9 changes: 8 additions & 1 deletion deepmd/utils/batch_size.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,12 @@
)

import numpy as np
from packaging.version import (
Version,
)

from deepmd.env import (
TF_VERSION,
tf,
)
from deepmd.utils.errors import (
Expand Down Expand Up @@ -59,7 +63,10 @@ def __init__(self, initial_batch_size: int = 1024, factor: float = 2.0) -> None:
self.minimal_not_working_batch_size = self.maximum_working_batch_size + 1
else:
self.maximum_working_batch_size = initial_batch_size
if tf.test.is_gpu_available():
if (
Version(TF_VERSION) >= Version("1.14")
and tf.config.experimental.get_visible_devices("GPU")
) or tf.test.is_gpu_available():
self.minimal_not_working_batch_size = 2**31
else:
self.minimal_not_working_batch_size = (
Expand Down

0 comments on commit 2204ec1

Please sign in to comment.