-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow dataloading with big datasets issue persists #2252
Comments
Hi ! Sorry to hear that. This may come from another issue then. First can we check if this latency comes from the dataset itself ? import time
import numpy as np
from datasets import load_from_disk
dataset = load_from_disk(...) # or from load_dataset...
_start = time.time()
n = 100
for i in np.random.default_rng(42).integers(0, len(dataset), size=n):
_ = dataset[i]
print(time.time() - _start) If we see a significant speed difference between your two datasets then it would mean that there's an issue somewhere |
Hi @lhoestq, here is the result. I additionally measured time to
Hmm.. I double checked that it's version 1.6.0. The difference seems quite big, could it be related to the running environment? |
I'm surprised by the speed change. Can you give more details about your dataset ? Also can you explain what parameters you used if you used |
Also if you could give us more info about your env like your OS, version of pyarrow and if you're using an HDD or a SSD |
Here are some details of my 600GB dataset. This is a dataset AFTER the
Here the some parameters to
|
Regarding the environment, I am running the code on a cloud server. Here are some info:
The data is stored in SSD and it is mounted to the machine via Network File System. If you could point me to some of the commands to check the details of the environment, I would be happy to provide relevant information @lhoestq ! |
I am not sure how I could provide you with the reproducible code, since the problem only arises when the data is big. For the moment, I would share the part that I think is relevant. Feel free to ask me for more info. class MyModel(pytorch_lightning.LightningModule)
def setup(self, stage):
self.dataset = datasets.load_from_disk(path)
self.dataset.set_format("torch")
def train_dataloader(self):
collate_fn = transformers.DataCollatorForLanguageModeling(
tokenizer=transformers.ElectraTokenizerFast.from_pretrained(tok_path)
)
dataloader = torch.utils.DataLoader(
self.dataset,
batch_size=32,
collate_fn=collate_fn,
num_workers=8,
pin_memory=True,
) |
Hi ! Sorry for the delay I haven't had a chance to take a look at this yet. Are you still experiencing this issue ? |
Hi! I just ran the same code with different datasets (one is 60 GB and another 600 GB), and the latter runs much slower. ETA differs by 10x. |
Despite upgrading to datasets 1.6.2, still experiencing extremely slow (2h00) loading for a 300Gb local dataset shard size 1.1Gb on local HDD (40Mb/s read speed). This corresponds almost exactly to total data divided by reading speed implying that it reads the entire dataset at each load. Stack details:
|
Hi @BenoitDalFerro how do your load your dataset ? |
Hi @lhoestq thanks for the quick turn-around, actually the plain vanilla way, without an particular knack or fashion, I tried to look into the documentation for some alternative but couldn't find any
|
I’m facing the same issue when loading a 900GB dataset (stored via |
@tsproisl same here, smells like @lhoestq perhaps solution to detect bug location in code is to track its signature via HD read usage monitoring, option is to add tracking decorator on top each function and sequentially close all hatches from top to bottom, suggest PySmart https://pypi.org/project/pySMART/ a Smartmontools implementation |
I wasn't able to reproduce this on a toy dataset of around 300GB: import datasets as ds
s = ds.load_dataset("squad", split="train")
s4000 = ds.concatenate_datasets([s] * 4000)
print(ds.utils.size_str(s4000.data.nbytes)) # '295.48 GiB'
s4000.save_to_disk("tmp/squad_4000") import psutil
import time
from datasets import load_from_disk
disk = "disk0" # You may have to change your disk here
iocnt1 = psutil.disk_io_counters(perdisk=True)[disk]
time1 = time.time()
s4000_reloaded = load_from_disk("tmp/squad_4000")
time2 = time.time()
iocnt2 = psutil.disk_io_counters(perdisk=True)[disk]
print(f"Blocks read {iocnt2.read_count - iocnt1.read_count}") # Blocks read 18
print(f"Elapsed time: {time2 - time1:.02f}s") # Elapsed time: 14.60s Could you run this on your side and tell me if how much time it takes ? Please run this when your machine is idle so that other processes don't interfere. I got these results on my macbook pro on datasets 1.6.2 |
@lhoestq thanks, test running as we speak, bear with me |
Just tried on google colab and got ~1min for a 15GB dataset (only 200 times SQuAD), while it should be instantaneous. The time is spent reading the Apache Arrow table from the memory mapped file. This might come a virtual disk management issue. I'm trying to see if I can still speed it up on colab. |
@lhoestq what is Google Colab's HD read speed, is it possible to introspect incl. make like SSD or HDD ? |
@lhoestq Thank you! The issue is getting more interesting. The second script is still running, but it's definitely taking much longer than 15 seconds. |
Okay, here’s the ouput: Also using datasets 1.6.2. Do you have any ideas, how to pinpoint the problem? |
The 529.10s was a bit too optimistic. I cancelled the reading process once before running it completely, therefore the harddrive cache probably did its work. Here are three consecutive runs |
@lhoestq
Second test running bear with me, for Windows users slight trick to modify original "disk0" string: First find physical unit relevant key in dictionnary
In my case it's PhysicalDrive1 Then insert relevant key's string as disk variable
|
@lhoestq
|
@lhoestq any luck ? |
Unfortunately no. Thanks for running the benchmark though, it shows that you machine does a lot of read operations. This is not expected: in other machines it does almost no read operations which enables a very fast loading. I did some tests on google colab and have the same issue. The first time the dataset arrow file is memory mapped takes always a lot of time (time seems linear with respect to the dataset size). Reloading the dataset is then instantaneous since the arrow file has already been memory mapped. I also tried using the Arrow IPC file format (see #1933) instead of the current streaming format that we use but it didn't help. Memory mapping is handled by the OS and depends on the disk you're using, so I'm not sure we can do much about it. I'll continue to investigate anyway, because I still don't know why in some cases it would go through the entire file (high |
@lhoestq thanks for the effort, let's stay in touch |
Just want to say that I am seeing the same issue. Dataset size if 268GB and it takes 3 hours to load |
Cool !
When you process an unshuffled dataset with This is fast because internally Arrow simply iterates over the record batches. On the other hand, if you use a map-style dataset in PyTorch, then PyTorch samples uniformly from the files on your disk. This is slower for your disk, and also requires an extra step to get the location of the examples from an index. |
Now it makes sense, I thought that even the map-style unshuffled dataset, will be processed iteratively (I mean from start to end without any sampling). Great! |
Hey all, I'm facing the same issue with the PILE.
takes ~1h 30min, although I already cached it. Later in my code, I use
Is there any way I can instantiate the (also already cached) tokenized dataset directly without having to wait until |
An alternative is to load the dataset as iterable but this is not implemented yet, see #5481
If you want to skip that step, next time I'd recommend you to save the dataset somewhere after tokenization (e.g. using Though you could look for the cached arrow files in your cache and reload the data from there if you're adventurous. You can use |
Thank you so much, you made my day. The adventurous route worked out well; it now only takes ~5 min :) |
@lhoestq I'm also still having this issue where Do you have any other suggestions on what to try here? What is the best guidance around how to use the In case it helps, this is my code for loading and concatenating.
|
Can you try using You'll get an |
@lhoestq thank you for the suggestion! Can we also use |
Yup using |
@lhoestq Why memory map the full dataset is slow? It is said to be memory efficient, so I assume I think (if I didn't take it wrong) mmap should take nearly no time and returns immediately, and all data is read from disk later when it is actually wanted. So, no pre-fetch here. But it is not true for Profiling (in this comment, and also in my profiling) shows |
We memory map the arrow files and use However if you have a slow disk and your dataset has thousands of arrow files with lots of record batches then it can take a few minutes to run. This is because of For iterable datasets we don't need to know the length of the dataset in advance, so we don't need to read the metadata, which makes things faster to load. |
@lhoestq Thanks for your quick reply and detailed explanation. What is the proportion of metadata accounts for all data typically? I'd like to give more details on my experiments if you don't mind taking a look. I'm basically calling For a better view of what's in the cache file: I specified the feature of cache file to be It's also interesting that Environment:
|
Thanks for investigating @RmZeta2718 , this is useful information and indeed abnormal. Not sure what would cause that but may I ask you to see if reducing the number of record batches in the mapped dataset helps ? You can try passing |
Sorry for the delay. @lhoestq I found NO performance improvement after increasing And it's hard to test because once the data are loaded, they are cached in memory. Any consecutive attempts to load the same data will return immediately. Can you guys reproduce the problem? Testing just |
I also faced the slow mmap when loading a large dataset (~2TB) using
Is there any recommendations for stripe_size or stripe-count of luster or the shards size to improve the loading speed? |
Reproducing these issues is not easy on our side, given they depend on the setup. For
That's helpful information, thanks ! It seems like Lustre doesn't read at full speed with the memory mapping in
I would try increasing the stripe size in case the memory mapping does too much unecessary readahead with the default value |
Hey! This is what I do: raw_datasets["train"] = load_dataset("openwebtext", split=f"train[5000:]")
raw_datasets["validation"] = load_dataset("openwebtext", split=f"train[:5000]") I mean there is nothing special in the code, so I believe the slowing issue is general and should be able to be reproduced by a plain call to |
I managed to speed up the loading time (on the Lustre file system) by mmapping the arrow shards in parallel ( Here are some results:
It seems that preloading the files in processes (without returning the table) speeds up subsequent Threads are slower to mmap the table but faster to communicate. If this works on other file systems, it may be worth it to have the option to load the shards in parallel here. # preload_mmap.py
import datasets
import os
from datasets.table import MemoryMappedTable, concat_tables
import glob
import logging
from time import perf_counter
from multiprocessing import Pool
from multiprocessing.pool import ThreadPool
import sys
import concurrent
import functools
logger = datasets.logging.get_logger(__name__)
datasets.logging.set_verbosity_info()
class catchtime:
# context to measure loading time: https://stackoverflow.com/questions/33987060/python-context-manager-that-measures-time
def __init__(self, debug_print="Time", logger=logger):
self.debug_print = debug_print
self.logger = logger
def __enter__(self):
self.start = perf_counter()
return self
def __exit__(self, type, value, traceback):
self.time = perf_counter() - self.start
readout = f"{self.debug_print}: {self.time:.3f} seconds"
self.logger.info(readout)
def load_file(f, return_len=False):
with catchtime(f"Loading {f}", logger=logger):
ds = MemoryMappedTable.from_file(f)
if return_len: # process pool is slow to serialize
return len(ds)
return ds
def load_files(
files, debug_name="dataset_name", num_proc=16, use_threads=False, return_len=False
):
if use_threads:
pool_cls = concurrent.futures.ThreadPoolExecutor
pool_kwargs = {"max_workers": num_proc}
debug_desc = "threads"
else:
pool_cls = Pool
pool_kwargs = {"processes": num_proc}
debug_desc = "processes"
if return_len:
debug_desc = "(returning table) " + debug_desc
else:
debug_desc = "(returning len) " + debug_desc
with catchtime(
f"Loading {debug_name} using num of {debug_desc}={num_proc}", logger=logger
):
with pool_cls(**pool_kwargs) as pool:
result = list(
pool.map(functools.partial(load_file, return_len=return_len), files)
)
return result
def main(use_threads, return_len):
datasets.logging.set_verbosity_info()
logging.basicConfig(level=logging.DEBUG, format="%(message)s")
logger.info("Starting")
jc = "datset_name"
local = "dataset_path"
split = "train"
files = glob.glob(os.path.join(local, jc, split, "*.arrow"))
files = sorted(files)
ds = load_files(files, jc, use_threads=use_threads, return_len=return_len)
if not return_len:
with catchtime(f"concat_ tables"):
ds = concat_tables(ds)
logger.info("done")
if __name__ == "__main__":
use_threads = False
return_len = True
print(
"Usage: \n Threads: python preload_mmap.py t \n Threads and concatenate datasets: python preload_mmap.py t c"
"\n processes: python preload_mmap.py p \n processes and concatenate datasets: python preload_mmap.py p c "
)
if len(sys.argv) > 1 and sys.argv[1].startswith("t"):
use_threads = True
if len(sys.argv) > 2:
return_len = False
main(use_threads, return_len) |
I'm very happying that I use this way to acclereate the time of the data load from 15 minutes to 45 seconds. And my dataset size is on the TB scale. My file system is virtio-fs. I can't know its real file system because my code is in the virtual machine and i can't access the host machine. I guess it should be a distributed file system similar to Luster. I don't know why it is so slow and why multi threading can acclerate it. But what i know is it really acclerates the load time and it is vital for me. Thanks for your code. |
Nice to see this method validated on multiple setups ! Would be cool to integrate multithreading when memory mapping the Arrow files then datasets/src/datasets/arrow_reader.py Lines 199 to 201 in 796a47e
and here (for load_from_disk): datasets/src/datasets/arrow_dataset.py Lines 1701 to 1704 in 796a47e
I can take some time next week to do it, but feel free to open a PR if you want to give it a try |
Threading seems to work faster in arrow_dataset.py . However, changing arrow_reader.py may require changes in the higher api of |
Hi,
I reported too slow data fetching when data is large(#2210) a couple of weeks ago, and @lhoestq referred me to the fix (#2122).
However, the problem seems to persist. Here is the profiled results:
I see that
get_train_batch
lags when data is large. Could this be related to different issues?I would be happy to provide necessary information to investigate.
The text was updated successfully, but these errors were encountered: