regression: ocrd_mets.remove_physical_page broken #957

bertsky · 2022-11-29T11:59:11Z

See https://github.com/hnesk/browse-ocrd/actions/runs/3573770934/jobs/6008222016

It seems that the (new?) implementation is broken:

  File "/home/runner/work/browse-ocrd/browse-ocrd/ocrd_browser/model/document.py", line 383, in delete_page
    self.workspace.mets.remove_physical_page(page_id)
  File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/ocrd_models/ocrd_mets.py", line 689, in remove_physical_page
    mets_div[0].getparent().remove(mets_div[0])
IndexError: list index out of range

Unfortunately, I cannot pinpoint / dissect, because apparently, @MehmedGIT has effectively erased the history of ocrd_mets.py.

The text was updated successfully, but these errors were encountered:

bertsky · 2022-11-29T12:10:37Z

BTW, when I set OCRD_METS_CACHING=True, then ocrd_browser tests yield 3 failures and 4 errors (not just the 1 above).

bertsky · 2022-11-29T12:18:27Z

OCRD_METS_CACHING=true ocrd workspace -d assets/kant_aufklaerung_1784/data/ list-page
Traceback (most recent call last):
  File "/data/ocr-d/ocrd_all/venv/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/decorators.py", line 84, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/ocrd/cli/workspace.py", line 544, in list_pages
    print("\n".join(workspace.mets.physical_pages))
TypeError: sequence item 0: expected str instance, lxml.etree._Element found

bertsky · 2022-11-29T12:24:39Z

To sum up: METS caching is seriously broken ATM. But even without it, at least ocrd_mets.remove_physical_page is broken (for the case when the page has already been deleted, so the list returned by xpath is empty, but not none).

kba · 2022-11-29T17:38:01Z

I could fix the problems with OCRD_METS_CACHING and browse-ocrd's test suite except the test_reorder test. The problem there is that browse-ocrd modifies the underlying XML but the OcrdMets caching does not know about it. We either need to extend the OcrdMets API to offer the functionality (i.e. reordering of pages) or at least a way to let OcrdMets know that it should invalidate and re-fill the cache. @MehmedGIT @hnesk

bertsky · 2022-11-29T17:50:29Z

We either need to extend the OcrdMets API to offer the functionality (i.e. reordering of pages) or at least a way to let OcrdMets know that it should invalidate and re-fill the cache.

I'd prefer the latter. Re-ordering pages is a very specific and rare use-case, while invalidating the cache probably has other applications, too. (For example during error recovery, if a processor crashed on one page, but may have already gotten to update the METS before it did.)

(BTW, if we delegate to a MetsServer, then such reordering should be forbidden altogether.)

MehmedGIT · 2022-11-30T21:44:29Z

Unfortunately, I cannot pinpoint / dissect, because apparently, @MehmedGIT has effectively erased the history of ocrd_mets.py.

Sorry about deleting the history of ocrd_mets.py. I was not aware I did that. I think it's not completely lost and there should be a way to reverse this.

The problem there is that browse-ocrd modifies the underlying XML but the OcrdMets caching does not know about it.

The content of the mets cache updates according to the OcrdMets class method calls. Not according to the changes of the mets file on the disk or by directly modifying the underlying element tree in the memory. There is no surprise that the cache breaks. browse-ocrd, as kba pointed out, modifies the element tree directly in some methods, e.g., document.py/reorder() . Hence, the test_reorder fails.

However, after checking the source code of reorder(), I am surprised that it breaks the cache. Even when altering the element tree directly. The content of the separate mets:div[@TYPE="page"] elements seems to not change (or does it?), just their order inside the mets:div[@TYPE="physSequence"] element. This must not break the cache since the cache does not depend on order, but on content.

The cache would break only if the content of the following elements is changed:

  # Cache for the files (mets:file) - two nested dictionaries
  # The outer dictionary's Key: 'fileGrp.USE'
  # The outer dictionary's Value: Inner dictionary
  # The inner dictionary's Key: 'file.ID'
  # The inner dictionary's Value: a 'file' object at some memory location
  self._file_cache = {}

  # Cache for the pages (mets:div)
  # The dictionary's Key: 'div.ID'
  # The dictionary's Value: a 'div' object at some memory location
  self._page_cache = {}

  # Cache for the file pointers (mets:fptr) - two nested dictionaries
  # The outer dictionary's Key: 'div.ID'
  # The outer dictionary's Value: Inner dictionary
  # The inner dictionary's Key: 'fptr.FILEID'
  # The inner dictionary's Value: a 'fptr' object at some memory location
  self._fptr_cache = {}

Since I was not able to run the browse-ocrd successfully (neither from source nor pip install), I was not able to reproduce the test error to investigate further.

to let OcrdMets know that it should invalidate and re-fill the cache

Agree. E.g., OcrdMets.refresh_cache() to force the cache to iterate the element tree again and fill its content accordingly. In addition, we could also specify which cache instance/s (file cache, page cache, fptr cache) to force via parameters for flexibility reasons.

(For example during error recovery, if a processor crashed on one page, but may have already gotten to update the METS before it did.)

I am also interested in this topic. By METS do you mean the XML element tree (eTree) in the memory or the mets file on the disk? I assume it's the latter. However, I cannot see how the cache invalidation could be useful in that scenario.

Consider these steps and make corrections if needed:

The mets file on the disk gets loaded in the memory (eTree)
The 3 cache dictionaries are filled by iterating the relevant parts of the eTree
An ocr-d processor indirectly modifies the content of the eTree by calling some OcrdMets method
The method called in step 3 modifies the eTree
The method called in step 3 modifies the cache instance/s
The processor calls Workspace.save_mets() to store the eTree on the disk (not sure if that happens here or how often it does? - ideally, it should not)
Steps between 3-5(6?) are repeated few times till the current page is processed
The processor finishes processing the current page / The processor fails to process the current page
The processor calls Workspace.save_mets() to store the eTree on the disk
The steps between 3-9(9?) are repeated many times till the ocr-d processor finishes processing all pages

So, AFAIS:

Inconsistency between the XML element tree and the cache could happen only if the processor crashes after completing step 4 and before completing step 5. However, this does/should not affect the mets file on the disk.
If the content of the mets file on the disk changes in step 6 and the processor fails to process the current page in step 8, then we end up with a mets file that is broken/incomplete since the content of the eTree and the cache is broken/incomplete.
If there is no step 6, even if the processor fails to process a page the content of the mets file on the disk will be in a good state and without leftovers from a failed page. The content of the mets file is loaded in the memory and the cache gets filled again.

Did I miss something?

bertsky · 2022-11-30T23:44:47Z

However, after checking the source code of reorder(), I am surprised that it breaks the cache. Even when altering the element tree directly. The content of the separate mets:div[@TYPE="page"] elements seems to not change (or does it?), just their order inside the mets:div[@TYPE="physSequence"] element. This must not break the cache since the cache does not depend on order, but on content.

Yes, but look again: it does not in fact break the cache validity at all. And that's the problem really: What is actually tested as a result, self.page_ids, delegates to OcrdMets.physical_pages, and thus, your cache. So while self.reorder does reorder the pages in the actual element tree, the cached version still contains the old order (because it has not been invalidated yet).

(For example during error recovery, if a processor crashed on one page, but may have already gotten to update the METS before it did.)

I am also interested in this topic. By METS do you mean the XML element tree (eTree) in the memory or the mets file on the disk? I assume it's the latter. However, I cannot see how the cache invalidation could be useful in that scenario.

No, I did mean the cached state in memory. Let's say we finally get to have some form of error recovery (e.g. catching anything below Processor.process_page). Now if the processor crashes on one page, but already made its METS action prior to that, and then recovery has the program revert to some dummy or copy behaviour and continue with the next page – clearly, it should also invalidate the cache, so whatever is now in the tree is also in the cache.

Consider these steps and make corrections if needed:

The mets file on the disk gets loaded in the memory (eTree)

The 3 cache dictionaries are filled by iterating the relevant parts of the eTree

An ocr-d processor indirectly modifies the content of the eTree by calling some OcrdMets method

The method called in step 3 modifies the eTree

The method called in step 3 modifies the cache instance/s

The processor calls Workspace.save_mets() to store the eTree on the disk (not sure if that happens here or how often it does? - ideally, it should not)

Depends. In the processing server, we should be able to have the METS only in memory throughout the whole workflow. But in a standard CLI run, as soon as a processor is finished, it needs to serialise to disk. With page-wise processing it obviously depends on just how that is implemented: With the current METS splitting, we are in the latter case, whereas with page-parallel API we are in memory-only territory.

Steps between 3-5(6?) are repeated few times till the current page is processed

The processor finishes processing the current page / The processor fails to process the current page

The processor calls Workspace.save_mets() to store the eTree on the disk

The steps between 3-9(9?) are repeated many times till the ocr-d processor finishes processing all pages

So, AFAIS:

Inconsistency between the XML element tree and the cache could happen only if the processor crashes after completing step 4 and before completing step 5. However, this does/should not affect the mets file on the disk.

Yes. But see above ("continue with next page...")

If the content of the mets file on the disk changes in step 6 and the processor fails to process the current page in step 8, then we end up with a mets file that is broken/incomplete since the content of the eTree and the cache is broken/incomplete.

Definitely. But error handling could always try to undo the last METS action as part of recovery. (So error handling would involve making "backups" available for rollback, as would a MetsServer naturally.)

If there is no step 6, even if the processor fails to process a page the content of the mets file on the disk will be in a good state and without leftovers from a failed page. The content of the mets file is loaded in the memory and the cache gets filled again.

bertsky · 2023-01-23T12:27:44Z

@kba @MehmedGIT so can we close here?

MehmedGIT · 2023-01-23T12:34:42Z

The problem seems to be resolved now. I think yes.

bertsky mentioned this issue Nov 29, 2022

Dockerfile: use local pkg, not PyPI hnesk/browse-ocrd#56

Merged

kba self-assigned this Nov 29, 2022

kba added a commit that referenced this issue Nov 29, 2022

OcrdMets: fix (return) types when caching is on, #957

d0b5513

kba mentioned this issue Dec 1, 2022

Document.reorder: invalidate OcrdMets cache before save hnesk/browse-ocrd#58

Merged

bertsky closed this as completed Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regression: ocrd_mets.remove_physical_page broken #957

regression: ocrd_mets.remove_physical_page broken #957

bertsky commented Nov 29, 2022

bertsky commented Nov 29, 2022

bertsky commented Nov 29, 2022

bertsky commented Nov 29, 2022

kba commented Nov 29, 2022

bertsky commented Nov 29, 2022

MehmedGIT commented Nov 30, 2022 •

edited

Loading

bertsky commented Nov 30, 2022

bertsky commented Jan 23, 2023

MehmedGIT commented Jan 23, 2023

regression: ocrd_mets.remove_physical_page broken #957

regression: ocrd_mets.remove_physical_page broken #957

Comments

bertsky commented Nov 29, 2022

bertsky commented Nov 29, 2022

bertsky commented Nov 29, 2022

bertsky commented Nov 29, 2022

kba commented Nov 29, 2022

bertsky commented Nov 29, 2022

MehmedGIT commented Nov 30, 2022 • edited Loading

bertsky commented Nov 30, 2022

bertsky commented Jan 23, 2023

MehmedGIT commented Jan 23, 2023

MehmedGIT commented Nov 30, 2022 •

edited

Loading