-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the suggested num_workers
warning
#18591
Changes from all commits
2ca329e
2db3680
c5c5de2
ada09ed
b6dfaea
4dff113
63628f4
84f6dea
6daa8f3
d745a76
120c353
0467092
c1dccae
dca5162
79267e4
9228d64
8055027
22c1d25
388e25a
d35cb67
311348b
47258a8
747b62b
bbf7782
8462322
6b7066d
b1a451c
32797cd
49eef42
acbdb01
be9b46c
0a9dd28
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,7 +11,6 @@ | |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
import multiprocessing | ||
import os | ||
from dataclasses import dataclass, field | ||
from typing import Any, Iterable, Optional, Tuple, Union | ||
|
@@ -25,6 +24,7 @@ | |
_replace_dunder_methods, | ||
_set_sampler_epoch, | ||
has_iterable_dataset, | ||
suggested_max_num_workers, | ||
) | ||
from lightning.fabric.utilities.distributed import DistributedSamplerWrapper | ||
from lightning.pytorch.overrides.distributed import UnrepeatedDistributedSamplerWrapper | ||
|
@@ -420,11 +420,11 @@ def _check_dataloader_iterable( | |
) | ||
|
||
|
||
def _worker_check(dataloader: object, using_spawn: bool, name: str) -> None: | ||
def _worker_check(trainer: "pl.Trainer", using_spawn: bool, dataloader: object, name: str) -> None: | ||
if not isinstance(dataloader, DataLoader): | ||
return | ||
|
||
num_cpus = multiprocessing.cpu_count() | ||
upper_bound = suggested_max_num_workers(trainer.num_devices) | ||
|
||
# ddp_spawn + num_workers > 0 don't mix! tell the user | ||
if dataloader.num_workers > 0 and using_spawn: | ||
|
@@ -442,14 +442,11 @@ def _worker_check(dataloader: object, using_spawn: bool, name: str) -> None: | |
"strategy=ddp_spawn and num_workers=0 may result in data loading bottlenecks." | ||
" Consider setting num_workers>0 and persistent_workers=True" | ||
) | ||
|
||
elif dataloader.num_workers <= 2 < num_cpus and not using_spawn: | ||
elif dataloader.num_workers <= 2 < upper_bound or dataloader.num_workers < 2 <= upper_bound: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the first part is pretty much certain to be True on the majority of setups I think, since the default in many frameworks is usually 2 workers and most modern machines have lots of cpu-cores for desktops, and any serious gpus nodes will have lots of cpu-cores. Would you consider that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I read you correctly, you suggest simplifying the condition by dropping
Today if we do small to medium size LLM training on small pre-tokenized datasets that fit in a machine, we're not going to require that many workers. So num workers=2 is not that unreasonable. We could consider dropping that to
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are in agreement, Adrian
Ideally the general solution would be to provide users with a breakdown of timing for Now you're switching to measuring the real impact of num_workers to the efficiency of the compute, as compared to the guesswork that is implemented now. So for example you could warn a user if you see that dl_time > 10% of compute time or something like that. I don't have a good general recommendation on threshold % since LLM/VLM workload could be very different to a 1 gpu workload. But the principle is the same - compute is the most expensive resource, especially if it's rented per hour, so now the user wants to detect and remove the bottlenecks to pay less and of course have an earlier finish line. Now, with the raise of the cloud storage there is a new problem emerging and that is of the data not even being present on the compute node at the time of compute. So not only DL might need to do transforms, it also needs to fetch remote data from the cloud. Here 2 workers are almost never enough. During IDEFICS-80B training we had this issue, but we couldn't raise the number of workers not because we didn't have the spare cores, but because I hope that WebDataset and other streaming DL solutions will get better over time, but this was a very painful experience 6 months ago, since we did want more workers, but couldn't have them. In other words load-based heuristics (like I proposed above) are needed to really help the users to optimize their setup and one-fit-all guessing heuristics will break even more so. |
||
# if changed, update the `filterwarnings` snippet in 'speed.html#num-workers' | ||
rank_zero_warn( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since our bounds now match PyTorch's, should we filter https://github.com/pytorch/pytorch/blob/v2.0.1/torch/utils/data/dataloader.py#L488-L563 so that users don't get the warning twice? Alternatively, we could remove ours There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No the warning in the PyTorch dataloader is the opposite: If the user selects an excessive amount of workers, then the PyTorch warning triggers. Our warning is only being triggered if there are fewer than our minimum suggested workers (and we suggest no more than the upper limit defined by PyTorch). These warnings no longer overlap (lemme know if you find and edge case) with each other as reported in #15572 and so this is fixed now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think you could assert that the dataloader warning is not raised? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, good idea. I added a test |
||
f"The dataloader, {name}, does not have many workers which may be a bottleneck." | ||
" Consider increasing the value of the `num_workers` argument`" | ||
f" (try {num_cpus} which is the number of cpus on this machine)" | ||
" in the `DataLoader` init to improve performance.", | ||
f"The '{name}' does not have many workers which may be a bottleneck. Consider increasing the value of the" | ||
f" `num_workers` argument` to `num_workers={upper_bound}` in the `DataLoader` to improve performance.", | ||
category=PossibleUserWarning, | ||
) | ||
|
||
|
@@ -507,9 +504,10 @@ def _process_dataloader( | |
|
||
# check the workers | ||
_worker_check( | ||
dataloader, | ||
isinstance(strategy, DDPStrategy) and strategy._start_method == "spawn", | ||
f"{stage.dataloader_prefix}_dataloader", | ||
trainer=trainer, | ||
using_spawn=isinstance(strategy, DDPStrategy) and strategy._start_method == "spawn", | ||
dataloader=dataloader, | ||
name=f"{stage.dataloader_prefix}_dataloader", | ||
) | ||
|
||
# add worker_init_fn for correct seeding in worker processes | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need at least one cpu-core per gpu-bound process, so I think would make a more sensible recommendation:
i.e. if you have 48 cpu-cores, and 8 gpus, you need at least 8 cores for the main processes, so only 40 remain then - so
40/8=5
and then there is the OS as well, which needs at least a core or 2 for its functioning.