Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load a cached dataset as iterable #5481

Open
lhoestq opened this issue Jan 27, 2023 · 18 comments
Open

Load a cached dataset as iterable #5481

lhoestq opened this issue Jan 27, 2023 · 18 comments
Labels
enhancement New feature or request good second issue Issues a bit more difficult than "Good First" issues

Comments

@lhoestq
Copy link
Member

lhoestq commented Jan 27, 2023

The idea would be to allow something like

ds = load_dataset("c4", "en", as_iterable=True)

To be used to train models. It would load an IterableDataset from the cached Arrow files.

Cc @stas00

Edit : from the discussions we may load from cache when streaming=True

@lhoestq lhoestq added the enhancement New feature or request label Jan 27, 2023
@jalajk24
Copy link

Can I work on this issue? I am pretty new to this.

@lhoestq
Copy link
Member Author

lhoestq commented Jan 30, 2023

Hi ! Sure :) you can comment #self-assign to assign yourself to this issue.

I can give you some pointers to get started:

load_dataset works roughly this way:

  1. it instantiate a dataset builder using load_dataset_builder()
  2. the builder download and prepare the dataset as Arrow files in the cache using download_and_prepare()
  3. the builder returns a Dataset object with as_dataset()

One way to approach this would be to implement as_iterable_dataset() in builder.py.

And similarly to as_dataset(), you can use the ArrowReader. It has a get_file_instructions() method that can be helpful. It gives you the files to read as list of dictionaries with those keys: filename, skip and take.

The skip and take arguments are used in case the user wants to load a subset of the dataset, e.g.

load_dataset(..., split="train[:10]")

Let me know if you have questions or if I can help :)

@mariosasko
Copy link
Collaborator

This use-case is a bit specific, and load_dataset already has enough parameters (plus, streaming=True also returns an iterable dataset, so we would have to explain the difference), so I think it would be better to add IterableDataset.from_file to the API (more flexible and aligned with the goal from #3444) instead.

@lhoestq
Copy link
Member Author

lhoestq commented Jan 31, 2023

This use-case is a bit specific

This allows to use datasets for large scale training where map-style datasets are too slow and use too much memory in PyTorch. So I would still consider adding it.

Alternatively we could add this feature one level bellow:

builder = load_dataset_builder(...)
builder.download_and_prepare()
ids = builder.as_iterable_dataset()

@mariosasko
Copy link
Collaborator

Yes, I see how this can be useful. Still, I think Dataset.to_iterable + IterableDataset.from_file would be much cleaner in terms of the API design (and more flexible since load_dataset can only access the "initial" (unprocessed) version of a dataset).

And since it can be tricky to manually find the "initial" version of a dataset in the cache, maybe load_dataset could return an iterable dataset streamed from the cache if streaming=True and the cache is up-to-date.

@stas00
Copy link
Contributor

stas00 commented Jan 31, 2023

This allows to use datasets for large scale training where map-style datasets are too slow and use too much memory in PyTorch.

I second that. e.g. In my last experiment Oscar-en uses 16GB RSS RAM per process and when using multiple processes the host quickly runs out cpu memory.

@stas00
Copy link
Contributor

stas00 commented Jan 31, 2023

And since it can be tricky to manually find the "initial" version of a dataset in the cache, maybe load_dataset could return an iterable dataset streamed from the cache if streaming=True and the cache is up-to-date.

This is exactly the need on JeanZay (HPC) - I have the dataset cache ready, but the compute node is offline, so making streaming work off a local cache would address that need.

If you will have a working POC I can be the tester.

@lhoestq
Copy link
Member Author

lhoestq commented Feb 1, 2023

Yes, I see how this can be useful. Still, I think Dataset.to_iterable + IterableDataset.from_file would be much cleaner in terms of the API design (and more flexible since load_dataset can only access the "initial" (unprocessed) version of a dataset).

I like IterableDataset.from_file as well. On the other hand Dataset.to_iterable first requires to load a Dataset object, which can take time depending on your hardware and your dataset size (sometimes 1h+).

And since it can be tricky to manually find the "initial" version of a dataset in the cache, maybe load_dataset could return an iterable dataset streamed from the cache if streaming=True and the cache is up-to-date.

That would definitely do the job. I was suggesting a different parameter just to make explicit the difference between

  • streaming from the raw data
  • streaming from the local cache

But I'd be fine with streaming from cache is the cache is up-to-date since it's always faster. We could log a message as usual to make it explicit that the cache is used

@mariosasko
Copy link
Collaborator

I was suggesting a different parameter just to make explicit the difference between

MosaicML's streaming library does the same (tries to stream from the local cache if possible), so logging a message should be explicit enough :).

@lhoestq
Copy link
Member Author

lhoestq commented Feb 1, 2023

Ok ! Sounds good then :)

@lhoestq lhoestq added the good second issue Issues a bit more difficult than "Good First" issues label Feb 1, 2023
@hamid-vakilzadeh
Copy link
Contributor

Hi Both! It has been a while since my first issue so I am gonna go for this one ! #self-assign

@hamid-vakilzadeh
Copy link
Contributor

#self-assign

@mariusz-jachimowicz-83
Copy link
Contributor

I like idea of IterableDataset.from_file.

@lhoestq
Copy link
Member Author

lhoestq commented May 15, 2023

#5821 should be helpful to implement IterableDataset.from_file, since it defines a new ArrowExamplesIterable that takes an Arrow tables generator function (e.g. from a file) and can be used in an IterableDataset

@mariusz-jachimowicz-83
Copy link
Contributor

@lhoestq I have just started working on this issue.

@hamid-vakilzadeh hamid-vakilzadeh removed their assignment May 15, 2023
@hamid-vakilzadeh
Copy link
Contributor

@lhoestq Thank you for taking over.

@npuichigo
Copy link
Contributor

npuichigo commented Jun 23, 2023

So what's recommanded usage of IterableDataset.from_file and load_dataset? How about I have multiple arrow files and load_dataset is often convenient to handle that.

@lhoestq
Copy link
Member Author

lhoestq commented Jun 26, 2023

If you have multiple Arrow files you can load them using

from datasets import load_dataset

data_files = {"train": ["path/to/0.arrow", "path/to/1.arrow", ..., "path/to/n.arrow"]}

ds = load_dataset("arrow", data_files=data_files, streaming=True)

This is equivalent to calling IterableDataset.from_file and concatenate_datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good second issue Issues a bit more difficult than "Good First" issues
Projects
None yet
Development

No branches or pull requests

7 participants