You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's not normal that I have to wait 7-8 hours for a dataset to be loaded from disk, as there are no preprocessing steps, it's only loading it with load_from_disk. I have 96 cpus, however only 1 is used for this, which is inefficient. Moreover, its usage is at 1%... This is happening in the context of a language model training, therefore I'm wasting 100$ each time I have to load the dataset from disk again (because the spot instance was stopped by aws and I need to relaunch it for example).
Steps to reproduce the bug
Just get the oscar in spanish (around 150GGB) and try to first save in disk and then load the processed dataset. It's not dependent on the task you're doing, it just depends on the size of the text dataset.
Expected results
I expect the dataset to be loaded in a normal time, by using the whole machine for loading it, I mean if you store the dataset in multiple files (.arrow) and then load it from multiple files, you can use multiprocessing for that and therefore don't waste so much time.
Environment info
datasets version: 1.8.0
Platform: Ubuntu 18
Python version: 3.8
I've seen you're planning to include a streaming mode for load_dataset, but that only saves the downloading and processing time, that's not being a problem for me, you cannot save the pure loading from disk time, therefore that's not a solution for my use case or for anyone who wants to use your library for training a language model.
The text was updated successfully, but these errors were encountered:
Hi ! It looks like an issue with the virtual disk you are using.
We load datasets using memory mapping. In general it makes it possible to load very big files instantaneously since it doesn't have to read the file (it just assigns virtual memory to the file on disk).
However there happens to be issues with virtual disks (for example on spot instances), for which memory mapping does a pass over the entire file, and this takes a while. We are discussing about this issue here: #2252
Memory mapping is something handled by the OS so we can't do much about it, though we're still trying to figure out what's causing this behavior exactly to see what we can do.
Okay, that's exactly my case, with spot instances... Therefore this isn't something we can change in any way to be able to load the dataset faster? I mean, what do you do internally at huggingface for being able to use spot instances with datasets efficiently?
@lhoestq
Describe the bug
It's not normal that I have to wait 7-8 hours for a dataset to be loaded from disk, as there are no preprocessing steps, it's only loading it with load_from_disk. I have 96 cpus, however only 1 is used for this, which is inefficient. Moreover, its usage is at 1%... This is happening in the context of a language model training, therefore I'm wasting 100$ each time I have to load the dataset from disk again (because the spot instance was stopped by aws and I need to relaunch it for example).
Steps to reproduce the bug
Just get the oscar in spanish (around 150GGB) and try to first save in disk and then load the processed dataset. It's not dependent on the task you're doing, it just depends on the size of the text dataset.
Expected results
I expect the dataset to be loaded in a normal time, by using the whole machine for loading it, I mean if you store the dataset in multiple files (.arrow) and then load it from multiple files, you can use multiprocessing for that and therefore don't waste so much time.
Environment info
datasets
version: 1.8.0I've seen you're planning to include a streaming mode for load_dataset, but that only saves the downloading and processing time, that's not being a problem for me, you cannot save the pure loading from disk time, therefore that's not a solution for my use case or for anyone who wants to use your library for training a language model.
The text was updated successfully, but these errors were encountered: