-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Auto Device Map option for BERT Models #26176
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your great contribution !
Can you run
make fix-copies
to fix the failing checks.
Also can you confirm accelerate
tests pass ?
pytest -m accelerate_tests tests/models/bert
Thanks!
@younesbelkada, Sure! I am travelling to London this weekend and early next week so after that I will be able to push other changes and fix this. Thanks for taking time and reviewing this, I will mark the PR to be "Ready" once I am done making changes. Cheers! |
Hi @younesbelkada, I am facing a rather peculiar issue. While testing my Below is the error:
What's peculiar is that, as soon as I comment out the What I don't understand is the nature of this error, since it is caused when trying to download the model (and not loading it, which would've been a plausible place for the error to occur). Also, If I download the model with Below is the test code script that I am running to test. import torch
import random
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased', device_map='auto')
print(model(torch.tensor([[random.randint(0, 300) for x in range(512)]]))) |
Hi @tanaymeh import torch
import random
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased', device_map='auto')
print(model(torch.tensor([[random.randint(0, 300) for x in range(512)]]))) with the changes proposed in the PR and the script worked fine on my end - not sure what is happening I have also tried to run the accelerate tests and they seem to fail :/ Let me know if you need any help! |
Hi @younesbelkada, I checked line by line and Bert and RoBERTa have almost the same exact implementations. I tried debugging but to no avail, do you suspect any potential causes? |
Hmmm I see, what are the errors you get? Can you share the full traceback ? |
@younesbelkada Here's the entire error log: ============================= test session starts ==============================
platform linux -- Python 3.10.13, pytest-7.4.2, pluggy-1.0.0
rootdir: /root/new/transformers
configfile: setup.cfg
plugins: hypothesis-6.87.2, anyio-4.0.0
collected 364 items / 361 deselected / 3 selected
tests/models/bert/test_modeling_bert.py FFF [100%]
=================================== FAILURES ===================================
________________________ BertModelTest.test_cpu_offload ________________________
self = <tests.models.bert.test_modeling_bert.BertModelTest testMethod=test_cpu_offload>
@require_accelerate
@mark.accelerate_tests
@require_torch_gpu
def test_cpu_offload(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
if model_class._no_split_modules is None:
continue
inputs_dict_class = self._prepare_for_class(inputs_dict, model_class)
model = model_class(config).eval()
model = model.to(torch_device)
torch.manual_seed(0)
base_output = model(**inputs_dict_class)
model_size = compute_module_sizes(model)[""]
# We test several splits of sizes to make sure it works.
max_gpu_sizes = [int(p * model_size) for p in self.model_split_percents[1:]]
with tempfile.TemporaryDirectory() as tmp_dir:
model.cpu().save_pretrained(tmp_dir)
for max_size in max_gpu_sizes:
max_memory = {0: max_size, "cpu": model_size * 2}
new_model = model_class.from_pretrained(tmp_dir, device_map="auto", max_memory=max_memory)
# Making sure part of the model will actually end up offloaded
self.assertSetEqual(set(new_model.hf_device_map.values()), {0, "cpu"})
> self.check_device_map_is_respected(new_model, new_model.hf_device_map)
tests/test_modeling_common.py:2600:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_modeling_common.py:2529: in check_device_map_is_respected
self.assertEqual(param.device, torch.device("meta"))
E AssertionError: device(type='cpu') != device(type='meta')
----------------------------- Captured stderr call -----------------------------
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
_______________________ BertModelTest.test_disk_offload ________________________
self = <tests.models.bert.test_modeling_bert.BertModelTest testMethod=test_disk_offload>
@require_accelerate
@mark.accelerate_tests
@require_torch_gpu
def test_disk_offload(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
if model_class._no_split_modules is None:
continue
inputs_dict_class = self._prepare_for_class(inputs_dict, model_class)
model = model_class(config).eval()
model = model.to(torch_device)
torch.manual_seed(0)
base_output = model(**inputs_dict_class)
model_size = compute_module_sizes(model)[""]
with tempfile.TemporaryDirectory() as tmp_dir:
model.cpu().save_pretrained(tmp_dir)
with self.assertRaises(ValueError):
max_size = int(self.model_split_percents[0] * model_size)
max_memory = {0: max_size, "cpu": max_size}
# This errors out cause it's missing an offload folder
new_model = model_class.from_pretrained(tmp_dir, device_map="auto", max_memory=max_memory)
max_size = int(self.model_split_percents[1] * model_size)
max_memory = {0: max_size, "cpu": max_size}
new_model = model_class.from_pretrained(
tmp_dir, device_map="auto", max_memory=max_memory, offload_folder=tmp_dir
)
> self.check_device_map_is_respected(new_model, new_model.hf_device_map)
tests/test_modeling_common.py:2565:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_modeling_common.py:2529: in check_device_map_is_respected
self.assertEqual(param.device, torch.device("meta"))
E AssertionError: device(type='cpu') != device(type='meta')
----------------------------- Captured stderr call -----------------------------
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
_____________________ BertModelTest.test_model_parallelism _____________________
self = <tests.models.bert.test_modeling_bert.BertModelTest testMethod=test_model_parallelism>
@require_accelerate
@mark.accelerate_tests
@require_torch_multi_gpu
def test_model_parallelism(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
if model_class._no_split_modules is None:
continue
inputs_dict_class = self._prepare_for_class(inputs_dict, model_class)
model = model_class(config).eval()
model = model.to(torch_device)
torch.manual_seed(0)
base_output = model(**inputs_dict_class)
model_size = compute_module_sizes(model)[""]
# We test several splits of sizes to make sure it works.
max_gpu_sizes = [int(p * model_size) for p in self.model_split_percents[1:]]
with tempfile.TemporaryDirectory() as tmp_dir:
model.cpu().save_pretrained(tmp_dir)
for max_size in max_gpu_sizes:
max_memory = {0: max_size, 1: model_size * 2, "cpu": model_size * 2}
new_model = model_class.from_pretrained(tmp_dir, device_map="auto", max_memory=max_memory)
# Making sure part of the model will actually end up offloaded
> self.assertSetEqual(set(new_model.hf_device_map.values()), {0, 1})
E AssertionError: Items in the second set but not the first:
E 0
tests/test_modeling_common.py:2634: AssertionError
----------------------------- Captured stderr call -----------------------------
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
=============================== warnings summary ===============================
../../../opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py:1373
/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob
self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")
tests/test_modeling_common.py:2746
/root/new/transformers/tests/test_modeling_common.py:2746: PytestUnknownMarkWarning: Unknown pytest.mark.flash_attn_test - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
@mark.flash_attn_test
tests/test_modeling_common.py:2773
/root/new/transformers/tests/test_modeling_common.py:2773: PytestUnknownMarkWarning: Unknown pytest.mark.flash_attn_test - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
@mark.flash_attn_test
tests/test_modeling_common.py:2815
/root/new/transformers/tests/test_modeling_common.py:2815: PytestUnknownMarkWarning: Unknown pytest.mark.flash_attn_test - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
@mark.flash_attn_test
tests/test_modeling_common.py:2857
/root/new/transformers/tests/test_modeling_common.py:2857: PytestUnknownMarkWarning: Unknown pytest.mark.flash_attn_test - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
@mark.flash_attn_test
tests/test_modeling_common.py:2894
/root/new/transformers/tests/test_modeling_common.py:2894: PytestUnknownMarkWarning: Unknown pytest.mark.flash_attn_test - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
@mark.flash_attn_test
tests/test_modeling_common.py:2931
/root/new/transformers/tests/test_modeling_common.py:2931: PytestUnknownMarkWarning: Unknown pytest.mark.flash_attn_test - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
@mark.flash_attn_test
../../../opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py:28
/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py:28: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
from pkg_resources import packaging # type: ignore[attr-defined]
../../../opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py:2871
../../../opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py:2871
/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('ruamel')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/models/bert/test_modeling_bert.py::BertModelTest::test_cpu_offload
FAILED tests/models/bert/test_modeling_bert.py::BertModelTest::test_disk_offload
FAILED tests/models/bert/test_modeling_bert.py::BertModelTest::test_model_parallelism
================ 3 failed, 361 deselected, 10 warnings in 5.16s ================ |
Hi @younesbelkada, have you found any updates on the issue? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi @amyeroberts! Thanks! |
@tanaymeh The failures of these tests indicate that the model weights aren't being distributed across devices as expected e.g. for |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
seeing error. Please help here ValueError: BertLMHeadModel does not support |
Hi @bp020108, you're seeing this error as |
What does this PR do?
This PR adds the
'device_map': "auto"
functionality for BERT Models for ease in multi-GPU training.Fixes #25296
Who can review?
@younesbelkada