-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load a cached dataset as iterable #5481
Comments
Can I work on this issue? I am pretty new to this. |
Hi ! Sure :) you can comment I can give you some pointers to get started:
One way to approach this would be to implement And similarly to The load_dataset(..., split="train[:10]") Let me know if you have questions or if I can help :) |
This use-case is a bit specific, and |
This allows to use Alternatively we could add this feature one level bellow: builder = load_dataset_builder(...)
builder.download_and_prepare()
ids = builder.as_iterable_dataset() |
Yes, I see how this can be useful. Still, I think And since it can be tricky to manually find the "initial" version of a dataset in the cache, maybe |
I second that. e.g. In my last experiment Oscar-en uses 16GB RSS RAM per process and when using multiple processes the host quickly runs out cpu memory. |
This is exactly the need on JeanZay (HPC) - I have the dataset cache ready, but the compute node is offline, so making streaming work off a local cache would address that need. If you will have a working POC I can be the tester. |
I like
That would definitely do the job. I was suggesting a different parameter just to make explicit the difference between
But I'd be fine with streaming from cache is the cache is up-to-date since it's always faster. We could log a message as usual to make it explicit that the cache is used |
MosaicML's |
Ok ! Sounds good then :) |
Hi Both! It has been a while since my first issue so I am gonna go for this one ! #self-assign |
#self-assign |
I like idea of |
#5821 should be helpful to implement |
@lhoestq I have just started working on this issue. |
@lhoestq Thank you for taking over. |
So what's recommanded usage of |
If you have multiple Arrow files you can load them using from datasets import load_dataset
data_files = {"train": ["path/to/0.arrow", "path/to/1.arrow", ..., "path/to/n.arrow"]}
ds = load_dataset("arrow", data_files=data_files, streaming=True) This is equivalent to calling |
The idea would be to allow something like
To be used to train models. It would load an IterableDataset from the cached Arrow files.
Cc @stas00
Edit : from the discussions we may load from cache when streaming=True
The text was updated successfully, but these errors were encountered: