Dataset load_from_disk is too slow #2547

avacaondata · 2021-06-24T12:45:44Z

Describe the bug

It's not normal that I have to wait 7-8 hours for a dataset to be loaded from disk, as there are no preprocessing steps, it's only loading it with load_from_disk. I have 96 cpus, however only 1 is used for this, which is inefficient. Moreover, its usage is at 1%... This is happening in the context of a language model training, therefore I'm wasting 100$ each time I have to load the dataset from disk again (because the spot instance was stopped by aws and I need to relaunch it for example).

Steps to reproduce the bug

Just get the oscar in spanish (around 150GGB) and try to first save in disk and then load the processed dataset. It's not dependent on the task you're doing, it just depends on the size of the text dataset.

Expected results

I expect the dataset to be loaded in a normal time, by using the whole machine for loading it, I mean if you store the dataset in multiple files (.arrow) and then load it from multiple files, you can use multiprocessing for that and therefore don't waste so much time.

Environment info

datasets version: 1.8.0
Platform: Ubuntu 18
Python version: 3.8

I've seen you're planning to include a streaming mode for load_dataset, but that only saves the downloading and processing time, that's not being a problem for me, you cannot save the pure loading from disk time, therefore that's not a solution for my use case or for anyone who wants to use your library for training a language model.

The text was updated successfully, but these errors were encountered:

lhoestq · 2021-06-24T13:28:04Z

Hi ! It looks like an issue with the virtual disk you are using.

We load datasets using memory mapping. In general it makes it possible to load very big files instantaneously since it doesn't have to read the file (it just assigns virtual memory to the file on disk).
However there happens to be issues with virtual disks (for example on spot instances), for which memory mapping does a pass over the entire file, and this takes a while. We are discussing about this issue here: #2252

Memory mapping is something handled by the OS so we can't do much about it, though we're still trying to figure out what's causing this behavior exactly to see what we can do.

avacaondata · 2021-06-24T18:10:56Z

Okay, that's exactly my case, with spot instances... Therefore this isn't something we can change in any way to be able to load the dataset faster? I mean, what do you do internally at huggingface for being able to use spot instances with datasets efficiently?

lhoestq · 2021-06-25T14:56:38Z

There are no solutions yet unfortunately.
We're still trying to figure out a way to make the loading instantaneous on such disks, I'll keep you posted

avacaondata added the bug Something isn't working label Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset load_from_disk is too slow #2547

Dataset load_from_disk is too slow #2547

avacaondata commented Jun 24, 2021 •

edited

Loading

lhoestq commented Jun 24, 2021

avacaondata commented Jun 24, 2021

lhoestq commented Jun 25, 2021

Dataset load_from_disk is too slow #2547

Dataset load_from_disk is too slow #2547

Comments

avacaondata commented Jun 24, 2021 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected results

Environment info

lhoestq commented Jun 24, 2021

avacaondata commented Jun 24, 2021

lhoestq commented Jun 25, 2021

avacaondata commented Jun 24, 2021 •

edited

Loading