Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression: ocrd_mets.remove_physical_page broken #957

Closed
bertsky opened this issue Nov 29, 2022 · 9 comments
Closed

regression: ocrd_mets.remove_physical_page broken #957

bertsky opened this issue Nov 29, 2022 · 9 comments
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Nov 29, 2022

See https://github.com/hnesk/browse-ocrd/actions/runs/3573770934/jobs/6008222016

It seems that the (new?) implementation is broken:

  File "/home/runner/work/browse-ocrd/browse-ocrd/ocrd_browser/model/document.py", line 383, in delete_page
    self.workspace.mets.remove_physical_page(page_id)
  File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/ocrd_models/ocrd_mets.py", line 689, in remove_physical_page
    mets_div[0].getparent().remove(mets_div[0])
IndexError: list index out of range

Unfortunately, I cannot pinpoint / dissect, because apparently, @MehmedGIT has effectively erased the history of ocrd_mets.py.

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 29, 2022

BTW, when I set OCRD_METS_CACHING=True, then ocrd_browser tests yield 3 failures and 4 errors (not just the 1 above).

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 29, 2022

OCRD_METS_CACHING=true ocrd workspace -d assets/kant_aufklaerung_1784/data/ list-page
Traceback (most recent call last):
  File "/data/ocr-d/ocrd_all/venv/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/decorators.py", line 84, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/ocrd/cli/workspace.py", line 544, in list_pages
    print("\n".join(workspace.mets.physical_pages))
TypeError: sequence item 0: expected str instance, lxml.etree._Element found

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 29, 2022

To sum up: METS caching is seriously broken ATM. But even without it, at least ocrd_mets.remove_physical_page is broken (for the case when the page has already been deleted, so the list returned by xpath is empty, but not none).

@kba
Copy link
Member

kba commented Nov 29, 2022

I could fix the problems with OCRD_METS_CACHING and browse-ocrd's test suite except the test_reorder test. The problem there is that browse-ocrd modifies the underlying XML but the OcrdMets caching does not know about it. We either need to extend the OcrdMets API to offer the functionality (i.e. reordering of pages) or at least a way to let OcrdMets know that it should invalidate and re-fill the cache. @MehmedGIT @hnesk

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 29, 2022

We either need to extend the OcrdMets API to offer the functionality (i.e. reordering of pages) or at least a way to let OcrdMets know that it should invalidate and re-fill the cache.

I'd prefer the latter. Re-ordering pages is a very specific and rare use-case, while invalidating the cache probably has other applications, too. (For example during error recovery, if a processor crashed on one page, but may have already gotten to update the METS before it did.)

(BTW, if we delegate to a MetsServer, then such reordering should be forbidden altogether.)

@MehmedGIT
Copy link
Contributor

MehmedGIT commented Nov 30, 2022

Unfortunately, I cannot pinpoint / dissect, because apparently, @MehmedGIT has effectively erased the history of ocrd_mets.py.

Sorry about deleting the history of ocrd_mets.py. I was not aware I did that. I think it's not completely lost and there should be a way to reverse this.

The problem there is that browse-ocrd modifies the underlying XML but the OcrdMets caching does not know about it.

The content of the mets cache updates according to the OcrdMets class method calls. Not according to the changes of the mets file on the disk or by directly modifying the underlying element tree in the memory. There is no surprise that the cache breaks. browse-ocrd, as kba pointed out, modifies the element tree directly in some methods, e.g., document.py/reorder() . Hence, the test_reorder fails.

However, after checking the source code of reorder(), I am surprised that it breaks the cache. Even when altering the element tree directly. The content of the separate mets:div[@TYPE="page"] elements seems to not change (or does it?), just their order inside the mets:div[@TYPE="physSequence"] element. This must not break the cache since the cache does not depend on order, but on content.

The cache would break only if the content of the following elements is changed:

  # Cache for the files (mets:file) - two nested dictionaries
  # The outer dictionary's Key: 'fileGrp.USE'
  # The outer dictionary's Value: Inner dictionary
  # The inner dictionary's Key: 'file.ID'
  # The inner dictionary's Value: a 'file' object at some memory location
  self._file_cache = {}

  # Cache for the pages (mets:div)
  # The dictionary's Key: 'div.ID'
  # The dictionary's Value: a 'div' object at some memory location
  self._page_cache = {}

  # Cache for the file pointers (mets:fptr) - two nested dictionaries
  # The outer dictionary's Key: 'div.ID'
  # The outer dictionary's Value: Inner dictionary
  # The inner dictionary's Key: 'fptr.FILEID'
  # The inner dictionary's Value: a 'fptr' object at some memory location
  self._fptr_cache = {}

Since I was not able to run the browse-ocrd successfully (neither from source nor pip install), I was not able to reproduce the test error to investigate further.

to let OcrdMets know that it should invalidate and re-fill the cache

Agree. E.g., OcrdMets.refresh_cache() to force the cache to iterate the element tree again and fill its content accordingly. In addition, we could also specify which cache instance/s (file cache, page cache, fptr cache) to force via parameters for flexibility reasons.

(For example during error recovery, if a processor crashed on one page, but may have already gotten to update the METS before it did.)

I am also interested in this topic. By METS do you mean the XML element tree (eTree) in the memory or the mets file on the disk? I assume it's the latter. However, I cannot see how the cache invalidation could be useful in that scenario.

Consider these steps and make corrections if needed:

  1. The mets file on the disk gets loaded in the memory (eTree)
  2. The 3 cache dictionaries are filled by iterating the relevant parts of the eTree
  3. An ocr-d processor indirectly modifies the content of the eTree by calling some OcrdMets method
  4. The method called in step 3 modifies the eTree
  5. The method called in step 3 modifies the cache instance/s
  6. The processor calls Workspace.save_mets() to store the eTree on the disk (not sure if that happens here or how often it does? - ideally, it should not)
  7. Steps between 3-5(6?) are repeated few times till the current page is processed
  8. The processor finishes processing the current page / The processor fails to process the current page
  9. The processor calls Workspace.save_mets() to store the eTree on the disk
  10. The steps between 3-9(9?) are repeated many times till the ocr-d processor finishes processing all pages

So, AFAIS:

  1. Inconsistency between the XML element tree and the cache could happen only if the processor crashes after completing step 4 and before completing step 5. However, this does/should not affect the mets file on the disk.
  2. If the content of the mets file on the disk changes in step 6 and the processor fails to process the current page in step 8, then we end up with a mets file that is broken/incomplete since the content of the eTree and the cache is broken/incomplete.
  3. If there is no step 6, even if the processor fails to process a page the content of the mets file on the disk will be in a good state and without leftovers from a failed page. The content of the mets file is loaded in the memory and the cache gets filled again.

Did I miss something?

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 30, 2022

However, after checking the source code of reorder(), I am surprised that it breaks the cache. Even when altering the element tree directly. The content of the separate mets:div[@TYPE="page"] elements seems to not change (or does it?), just their order inside the mets:div[@TYPE="physSequence"] element. This must not break the cache since the cache does not depend on order, but on content.

Yes, but look again: it does not in fact break the cache validity at all. And that's the problem really: What is actually tested as a result, self.page_ids, delegates to OcrdMets.physical_pages, and thus, your cache. So while self.reorder does reorder the pages in the actual element tree, the cached version still contains the old order (because it has not been invalidated yet).

(For example during error recovery, if a processor crashed on one page, but may have already gotten to update the METS before it did.)

I am also interested in this topic. By METS do you mean the XML element tree (eTree) in the memory or the mets file on the disk? I assume it's the latter. However, I cannot see how the cache invalidation could be useful in that scenario.

No, I did mean the cached state in memory. Let's say we finally get to have some form of error recovery (e.g. catching anything below Processor.process_page). Now if the processor crashes on one page, but already made its METS action prior to that, and then recovery has the program revert to some dummy or copy behaviour and continue with the next page – clearly, it should also invalidate the cache, so whatever is now in the tree is also in the cache.

Consider these steps and make corrections if needed:

  1. The mets file on the disk gets loaded in the memory (eTree)
  2. The 3 cache dictionaries are filled by iterating the relevant parts of the eTree
  3. An ocr-d processor indirectly modifies the content of the eTree by calling some OcrdMets method
  4. The method called in step 3 modifies the eTree
  5. The method called in step 3 modifies the cache instance/s
  6. The processor calls Workspace.save_mets() to store the eTree on the disk (not sure if that happens here or how often it does? - ideally, it should not)

Depends. In the processing server, we should be able to have the METS only in memory throughout the whole workflow. But in a standard CLI run, as soon as a processor is finished, it needs to serialise to disk. With page-wise processing it obviously depends on just how that is implemented: With the current METS splitting, we are in the latter case, whereas with page-parallel API we are in memory-only territory.

  1. Steps between 3-5(6?) are repeated few times till the current page is processed
  2. The processor finishes processing the current page / The processor fails to process the current page
  3. The processor calls Workspace.save_mets() to store the eTree on the disk
  4. The steps between 3-9(9?) are repeated many times till the ocr-d processor finishes processing all pages

So, AFAIS:

  1. Inconsistency between the XML element tree and the cache could happen only if the processor crashes after completing step 4 and before completing step 5. However, this does/should not affect the mets file on the disk.

Yes. But see above ("continue with next page...")

  1. If the content of the mets file on the disk changes in step 6 and the processor fails to process the current page in step 8, then we end up with a mets file that is broken/incomplete since the content of the eTree and the cache is broken/incomplete.

Definitely. But error handling could always try to undo the last METS action as part of recovery. (So error handling would involve making "backups" available for rollback, as would a MetsServer naturally.)

  1. If there is no step 6, even if the processor fails to process a page the content of the mets file on the disk will be in a good state and without leftovers from a failed page. The content of the mets file is loaded in the memory and the cache gets filled again.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 23, 2023

@kba @MehmedGIT so can we close here?

@MehmedGIT
Copy link
Contributor

The problem seems to be resolved now. I think yes.

@bertsky bertsky closed this as completed Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants