-
Notifications
You must be signed in to change notification settings - Fork 132
FileNotFoundError
when updating mtime of files in file cache
#1675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Icemole If this occurs again, kindly post the full log here. :) |
I'm not sure that this really helps us here. But certainly having the full log would be helpful. Btw, why is |
I think the log came from befire #1674. |
I encountered the same issue:
Also, in the same log file, cached training HDF files are being deleted by returnn because there is not enough free disk space, late causing error:
|
Full log |
Hm, so this error keeps occuring. Looking at the following code, one thing stood out to me: returnn/returnn/util/file_cache.py Lines 100 to 134 in cf50800
We only add the files to the mtime update thread after we think we have copied them over to the local disk. In the situation where error reoccured the file was 24h old, so its mtime was stale. One process found the file, thought it would still have access to the file, proceeded to update its mtime and then tried to read it. Meanwhile, a different process deleted it, because it thought the file was stale, and it did not see the mtime update. I think the problem occurs because we no longer use a cache-dir wide lock for performance reasons (which is still the right thing to do). Maybe it's enough to lock globally around the cleanup operation though. In that case, if one process was currently doing cleanup, the other processes would wait, and if their file was deleted, they would know about it. Latest log excerpt:
|
When only locking the cleanup, we need to make sure all processes try to acquire the cleanup lock every time they try to copy a file, to wait for potential other cleanups happening in the meantime. I think as a necessary evil this is fine, as the cleanup is not going to happen very often (and then not take very long), so the locks are going to be held only for very short amounts of time every time. Certainly this will be less under contention than locking the full cache directory every time you try to access a file (cf. #1548). See #1709. |
So, the PR #1709 was merged now. As I understand, this should have fixed this issue here. Please reopen if you think this issue still persists. In case you get similar but different problems, or other problems with the file cache (maybe with DistributeFilesDataset), or you are unsure, please open a new issue. |
Sadly only a part of the error message remains, and the rest of the log is lost:
First we should find out which file it is failing to touch.
The text was updated successfully, but these errors were encountered: