-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regression: ocrd_mets.remove_physical_page broken #957
Comments
BTW, when I set |
|
To sum up: METS caching is seriously broken ATM. But even without it, at least |
I could fix the problems with |
I'd prefer the latter. Re-ordering pages is a very specific and rare use-case, while invalidating the cache probably has other applications, too. (For example during error recovery, if a processor crashed on one page, but may have already gotten to update the METS before it did.) (BTW, if we delegate to a MetsServer, then such reordering should be forbidden altogether.) |
Sorry about deleting the history of
The content of the mets cache updates according to the However, after checking the source code of The cache would break only if the content of the following elements is changed: # Cache for the files (mets:file) - two nested dictionaries
# The outer dictionary's Key: 'fileGrp.USE'
# The outer dictionary's Value: Inner dictionary
# The inner dictionary's Key: 'file.ID'
# The inner dictionary's Value: a 'file' object at some memory location
self._file_cache = {}
# Cache for the pages (mets:div)
# The dictionary's Key: 'div.ID'
# The dictionary's Value: a 'div' object at some memory location
self._page_cache = {}
# Cache for the file pointers (mets:fptr) - two nested dictionaries
# The outer dictionary's Key: 'div.ID'
# The outer dictionary's Value: Inner dictionary
# The inner dictionary's Key: 'fptr.FILEID'
# The inner dictionary's Value: a 'fptr' object at some memory location
self._fptr_cache = {} Since I was not able to run the
Agree. E.g.,
I am also interested in this topic. By Consider these steps and make corrections if needed:
So, AFAIS:
Did I miss something? |
Yes, but look again: it does not in fact break the cache validity at all. And that's the problem really: What is actually tested as a result,
No, I did mean the cached state in memory. Let's say we finally get to have some form of error recovery (e.g. catching anything below
Depends. In the processing server, we should be able to have the METS only in memory throughout the whole workflow. But in a standard CLI run, as soon as a processor is finished, it needs to serialise to disk. With page-wise processing it obviously depends on just how that is implemented: With the current METS splitting, we are in the latter case, whereas with page-parallel API we are in memory-only territory.
Yes. But see above ("continue with next page...")
Definitely. But error handling could always try to undo the last METS action as part of recovery. (So error handling would involve making "backups" available for rollback, as would a MetsServer naturally.)
|
@kba @MehmedGIT so can we close here? |
The problem seems to be resolved now. I think yes. |
See https://github.com/hnesk/browse-ocrd/actions/runs/3573770934/jobs/6008222016
It seems that the (new?) implementation is broken:
Unfortunately, I cannot pinpoint / dissect, because apparently, @MehmedGIT has effectively erased the history of ocrd_mets.py.
The text was updated successfully, but these errors were encountered: