-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasets' cache not re-used #3847
Comments
|
Actually this is not because of the order of the splits, but most likely because the tokenizer used to process the second split is in a state that has been modified by the first split. Therefore after reloading the first split from the cache, then the second split can't be reloaded since the tokenizer hasn't seen the first split (and therefore is considered a different tokenizer). This is a bit trickier to fix, we can explore fixing this next week maybe |
Sorry didn't have the bandwidth to take care of this yet - will re-assign when I'm diving into it again ! |
I had this issue with Before:
After:
|
Not sure why the second one would work and not the first one - they're basically the same with respect to hashing. In both cases the function is hashed recursively, and therefore the feature_extractor and the tokenizer are hashed the same way. With which tokenizer or feature extractor are you experiencing this behavior ? Do you also experience this ?
|
Thanks ! Hopefully this can be useful to others, and also to better understand and improve hashing/caching |
I compared the tokenizer items on the first and second runs, there are two different items:
2nd:
On every run of this the special tokens are being added on, and the hash is different on the |
Thanks for investigating ! Does that mean that I think one way to fix this would be to make save/from_pretrained deterministic, or make the pickling of |
Late to the party but everything should be deterministic (afaik at least). But |
We're computing the hash of the pickle dump of the class so it should be fine, as long as the pickle dump is deterministic |
I've ported wav2vec2.0 fine-tuning into Optimum-Graphcore which is where I found the issue. The majority of the script was copied from the Transformers version to keep it similar, here is the tokenizer loading section from the source. In the last comment I have two loaded tokenizers, one from run 'N' of the script and one from 'N+1'. I think what's happening is that when you add special tokens (e.g. PAD and UNK) another AddedToken object is appended when tokenizer is saved regardless of whether special tokens are there already. If there is a AddedTokens cleanup at load/save this could solve the issue, but then is Trie going to cause hash to be different? I'm not sure. |
Which Python version are you using ? The trie is basically a big dict of dics, so deterministic nature depends on python version: Maybe the investigation is actually not finding the right culprit though (the memory id is changed, but |
Similar issue found on
|
Linking in #6179 (comment) with an explanation. |
I got the same problem while using Wav2Vec2CTCTokenizer in a distributed experiment (many processes), and found that the problem was localized in the serialization (pickle dump) of the field
Every time you run the python script (different processes) you get a random result. @lhoestq does it make any sense? |
OK, I assume python's set is just a hash table implementation that uses internally the hash() function. The problem is that python's hash() is not deterministic. I believe that setting the environment variable PYTHONHASHSEED to a fixed value, you can force it to be deterministic. I tried it (file
and now every run (
executed as |
Thanks for reporting. I opened a PR here to propose a fix: #6318 and doesn't require setting Can you try to install |
I patched (*) the file (*): I am running my experiments inside a docker container that depends on |
Fixed by #6318. |
The OP issue hasn't been fixed, re-opening |
I think the Trie()._tokens of PreTrainedTokenizer need to be a sorted set So that the results of |
I believe the issue may be linked to tokenization_utils.py#L507,specifically in the line where self.tokens_trie.add(token.content) is called. The function _update_trie appears to modify an unordered set. Consequently, this line: This, in turn, results in inconsistent outputs for both
|
Alternatively, does the "dumps" function require separate processing for the set? |
We did a fix that does sorting whenever we hash sets. The fix is available on |
Is there a documentation chapter that discusses in which cases you should expect your dataset preprocessing to be cached. Including do's and don'ts for the preprocessing functions? I think Datasets team does amazing job at tacking this issue on their side, but it would be great to have some guidelines on the user side as well. In our current project we have two cases (text-to-text classification and summarization) and in one of them the cache is sometimes reused when it's not supposed to be reused while in the other it's never used at all 😅 |
You can find some docs here :) |
Describe the bug
For most tokenizers I have tested (e.g. the RoBERTa tokenizer), the data preprocessing cache are not fully reused in the first few runs, although their
.arrow
cache files are in the cache directory.Steps to reproduce the bug
Here is a reproducer. The GPT2 tokenizer works perfectly with caching, but not the RoBERTa tokenizer in this example.
Expected results
No tokenization would be required after the 1st run. Everything should be loaded from the cache.
Actual results
Tokenization for some subsets are repeated at the 2nd and 3rd run. Starting from the 4th run, everything are loaded from cache.
Environment info
datasets
version: 1.18.3The text was updated successfully, but these errors were encountered: