-
Notifications
You must be signed in to change notification settings - Fork 955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve skip_first_batches
method to efficiently support IterableDataset
and StatefulDataloader
#2859
Comments
skip_first_batches
method to efficiently support IterableDataset
and StatefulDataloader
You can ping myself (@muellerzr) or @SunMarc on these things, Sylvain hasn't worked at HF for well over a year or two now :) |
Yes, we are indeed actively looking into this! |
Ran into something annoying while looking at this. Merely importing StatefulDataLoader (i.e. putting the line I suspect it has something to do with
At the moment I don't know why this happens so I can't tell if this some misconfig in my local workspace, a bug somewhere in the Assuming there aren't other traps, writing the rest of the feature doesn't feel like too much work, though the most immediate solution I could think of (that isn't a big refactor) to just create some subclasses e.g. |
@byi8220 Hi, seems that |
Gah, this feature is getting more complicated every second. We're also at the mercy of how
Thank you for mentioning. Is it accurate to call this a related but separate issue? Please correct me if I'm wrong, my understanding of the problem scope is that:
But regarding the breaking test I mentioned above, I'm unsure if it is related. The test which breaks when importing StatefulDataLoader is check_seedable_sampler. What is very strange about this test's breakage is that the test breaks without any changes to the code except by simply importing the package |
Also, just to elaborate on the the problem with StatefulDataLoader I'm running into, in case it's helpful info:
Here's a crude mock of how I think this behavior works: https://pastebin.com/Sk1DfDYz The problem here is that when the wrapper One solution could be to implement |
Took a shot at getting StatefulDataLoader connected to this library in #2895. Seems like way more work than I would have imagined, and admittedly it's experimental and there may be issues. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Don't know if this is closed considering the PR is still open... |
Hi all, Thank you for developing this great project.
Currently, the implementation naively iterates through all batches until the specified number have been consumed, which can be extremely slow for very large datasets.
The latest version of the datasets library now supports resumable iterable datasets, as well as the StatefulDataloader to allow for efficient resumption of training states.
I'm wondering if there are any plans to leverage these new features in Accelerate to make
skip_first_batches
more efficient and compatible with the latest datasets capabilities?If not, are there plans to add support for this in the future?
Efficiently skipping batches on huge datasets would significantly speed up resuming interrupted training runs. Let me know if you need any additional information or have thoughts on the best way to approach this.
Thanks for considering this suggestion!
The text was updated successfully, but these errors were encountered: