Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checksum and/or file size of models in .PAGE.xml #1183

Open
jbarth-ubhd opened this issue Feb 8, 2024 · 7 comments
Open

checksum and/or file size of models in .PAGE.xml #1183

jbarth-ubhd opened this issue Feb 8, 2024 · 7 comments

Comments

@jbarth-ubhd
Copy link

for reproducibility, it would be nice to have a checksum and/or file size of models used in XML.

@bertsky
Copy link
Collaborator

bertsky commented Feb 8, 2024

You mean as in

    <mets:agent TYPE="OTHER" OTHERTYPE="SOFTWARE" ROLE="OTHER" OTHERROLE="layout/segmentation/region">
      <mets:name>ocrd-tesserocr-recognize v0.17.0 (tesseract 5.3.1-25-gcf23)</mets:name>
      <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="input-file-grp">OCR-D-BIN</mets:note>
      <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="output-file-grp">OCR-D-BIN-OCR-TESS-frak2021</mets:note>
      <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="parameter">{"model": "frak2021", "dpi": 0, "padding": 0, "segmentat
ion_level": "word", "textequiv_level": "word", "overwrite_segments": false, "overwrite_text": true, "shrink_polygons": false, "
block_polygons": false, "find_tables": true, "find_staves": false, "sparse_text": false, "raw_lines": false, "char_whitelist": 
"", "char_blacklist": "", "char_unblacklist": "", "tesseract_parameters": {}, "xpath_parameters": {}, "xpath_model": {}, "auto_
model": false, "oem": "DEFAULT"}</mets:note>
      <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:cksum="1509050540 3421140"/>
      <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="page-id"/>
    </mets:agent>

@jbarth-ubhd, or did you mean the PAGE XML?

In METS, we could also use some information on processing dates, e.g. mets:agent/mets:note/@ocrd:date (of xsd:dateTime). What do you think @kba?

@kba
Copy link
Member

kba commented Feb 9, 2024

That's a great idea!

Incidentally, we're in the process of dealing with the reality of mass OCR, i.e. what to throw away to keep the amount of data manageable while still retaining as much reproducibility information as possible. This would help.

The tricky part is how and what to hash.

A simple solution would be to assume the checksum is related to the raw data that ocrd resmgr retrieves, i.e. the models or zipped models as they are downloaded via HTTP. We could add the checksum to the resources section of the ocrd-tool.json schema (and therefore the ocrd resmgr schema).

A helpful side effect would be that we notice when models are updated at the same URL (e.g. the messy situation with eynollah currently).

@kba kba added the enhancement label Feb 9, 2024
@kba
Copy link
Member

kba commented Feb 9, 2024

In METS, we could also use some information on processing dates, e.g. mets:agent/mets:note/@ocrd:date (of xsd:dateTime).

Indeed, I will need to improve the page-to-alto conversion soon-ish and find better solution for dates (I haven't forgotten about the feedback on kba/page-to-alto#37 btw) and other metadata. If we had more granular and easier to interpret date info that would help a lot.

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Feb 9, 2024

@bertsky: I don't have cksum in mets.xml (and not in OCR-D-OCR_00001.xml) (installed ocrd/all docker a few weeks ago):

<mets:agent TYPE="OTHER" OTHERTYPE="SOFTWARE" ROLE="OTHER" OTHERROLE=
"layout/segmentation/region">
  <mets:name>ocrd-tesserocr-recognize v0.17.0 (tesseract 5.3.3)</mets:name>
  <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option=
"input-file-grp">OCR-D-005</mets:note>
  <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option=
"output-file-grp">OCR-D-OCR</mets:note>
  <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="parameter">{"textequiv_level":
"word", "segmentation_level": "region", "overwrite_segments": true, "model": "frak2021",
"dpi": 0, "padding": 0, "overwrite_text": true, "shrink_polygons": false,
"block_polygons": false, "find_tables": true, "find_staves": false, "sparse_text": false,
"raw_lines": false, "char_whitelist": "", "char_blacklist": "", "char_unblacklist": "",
"tesseract_parameters": {}, "xpath_parameters": {}, "xpath_model": {}, "auto_model":
false, "oem": "DEFAULT"}</mets:note>
  <mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="page-id"/>
</mets:agent>

@bertsky
Copy link
Collaborator

bertsky commented Feb 9, 2024

A simple solution would be to assume the checksum is related to the raw data that ocrd resmgr retrieves, i.e. the models or zipped models as they are downloaded via HTTP. We could add the checksum to the resources section of the ocrd-tool.json schema (and therefore the ocrd resmgr schema).

I don't understand – wouldn't that be the repository side (ocrd-tool.json), rather than the user side (resources.yml)?

We could certainly have resmgr store that information, but what about manual (cp) or existing installations?

I would rather like the processor to look at the file exactly when it is used, i.e. during resolve_resource. Since we have that as a method of the Processor class, how about a little side effect: determining the checksum of the retrieved file and storing it in a hidden attribute of the processor instance, like say self._resources? Then our run_processor could automatically add the checksum info during its workspace.mets.add_agent call – no further code changes required!

@bertsky
Copy link
Collaborator

bertsky commented Feb 9, 2024

In METS, we could also use some information on processing dates, e.g. mets:agent/mets:note/@ocrd:date (of xsd:dateTime).

Indeed, I will need to improve the page-to-alto conversion soon-ish and find better solution for dates (I haven't forgotten about the feedback on kba/page-to-alto#37 btw) and other metadata. If we had more granular and easier to interpret date info that would help a lot.

Isn't that a separate issue though? In ocrd_modelfactory.page_from_image, we do set PAGE's Created and LastChange – but we do not set the latter whenever we add annotation via a processor's save_xml.

The METS side is independent, though.

@kba
Copy link
Member

kba commented Feb 9, 2024

I would rather like the processor to look at the file exactly when it is used, i.e. during resolve_resource. Since we have that as a method of the Processor class, how about a little side effect: determining the checksum of the retrieved file and storing it in a hidden attribute of the processor instance, like say self._resources? Then our run_processor could automatically add the checksum info during its workspace.mets.add_agent call – no further code changes required!

Yeah, that's the more robust and elegant solution 👍

In METS, we could also use some information on processing dates, e.g. mets:agent/mets:note/@ocrd:date (of xsd:dateTime).

Indeed, I will need to improve the page-to-alto conversion soon-ish and find better solution for dates (I haven't forgotten about the feedback on kba/page-to-alto#37 btw) and other metadata. If we had more granular and easier to interpret date info that would help a lot.

Isn't that a separate issue though? In ocrd_modelfactory.page_from_image, we do set PAGE's Created and LastChange – but we do not set the latter whenever we add annotation via a processor's save_xml.

The METS side is independent, though.

Yeah, sry, it's late. We had a call on that subject (getting OCR and metadata into digital library) today, so it came to mind.

@bertsky: I don't have cksum in mets.xml (and not in OCR-D-OCR_00001.xml) (installed ocrd/all docker a few weeks ago):

@jbarth-ubhd This was just a proposal by @bertsky how it could finally look, not the current situation. We'll still need to implement chksum of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants