Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset load_from_disk is too slow #2547

Open
avacaondata opened this issue Jun 24, 2021 · 3 comments
Open

Dataset load_from_disk is too slow #2547

avacaondata opened this issue Jun 24, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@avacaondata
Copy link

avacaondata commented Jun 24, 2021

@lhoestq

Describe the bug

It's not normal that I have to wait 7-8 hours for a dataset to be loaded from disk, as there are no preprocessing steps, it's only loading it with load_from_disk. I have 96 cpus, however only 1 is used for this, which is inefficient. Moreover, its usage is at 1%... This is happening in the context of a language model training, therefore I'm wasting 100$ each time I have to load the dataset from disk again (because the spot instance was stopped by aws and I need to relaunch it for example).

Steps to reproduce the bug

Just get the oscar in spanish (around 150GGB) and try to first save in disk and then load the processed dataset. It's not dependent on the task you're doing, it just depends on the size of the text dataset.

Expected results

I expect the dataset to be loaded in a normal time, by using the whole machine for loading it, I mean if you store the dataset in multiple files (.arrow) and then load it from multiple files, you can use multiprocessing for that and therefore don't waste so much time.

Environment info

  • datasets version: 1.8.0
  • Platform: Ubuntu 18
  • Python version: 3.8

I've seen you're planning to include a streaming mode for load_dataset, but that only saves the downloading and processing time, that's not being a problem for me, you cannot save the pure loading from disk time, therefore that's not a solution for my use case or for anyone who wants to use your library for training a language model.

@avacaondata avacaondata added the bug Something isn't working label Jun 24, 2021
@lhoestq
Copy link
Member

lhoestq commented Jun 24, 2021

Hi ! It looks like an issue with the virtual disk you are using.

We load datasets using memory mapping. In general it makes it possible to load very big files instantaneously since it doesn't have to read the file (it just assigns virtual memory to the file on disk).
However there happens to be issues with virtual disks (for example on spot instances), for which memory mapping does a pass over the entire file, and this takes a while. We are discussing about this issue here: #2252

Memory mapping is something handled by the OS so we can't do much about it, though we're still trying to figure out what's causing this behavior exactly to see what we can do.

@avacaondata
Copy link
Author

Okay, that's exactly my case, with spot instances... Therefore this isn't something we can change in any way to be able to load the dataset faster? I mean, what do you do internally at huggingface for being able to use spot instances with datasets efficiently?

@lhoestq
Copy link
Member

lhoestq commented Jun 25, 2021

There are no solutions yet unfortunately.
We're still trying to figure out a way to make the loading instantaneous on such disks, I'll keep you posted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants