Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running in a docker volume doesn't work #84

Closed
kba opened this issue Oct 27, 2019 · 9 comments
Closed

Running in a docker volume doesn't work #84

kba opened this issue Oct 27, 2019 · 9 comments
Labels
question Further information is requested

Comments

@kba
Copy link
Member

kba commented Oct 27, 2019

wget 'https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/736a2f9a-92c6-4fe3-a457-edfa3eab1fe3/data/wundt_grundriss_1896.ocrd.zip'
unzip wundt_grundriss_1896.ocrd.zip
cd data
docker run -u $(id -u) -w /data -v $PWD:/data -- ocrd/tesserocr:edge ocrd-tesserocr-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN-DOCKER

This will run ocrd-tesserocr-binarize but will only change the serialization of the mets.xml and add the agent but not do the actual work. What am I doing wrong?

@mikegerber @bertsky @wrznr Input appreciated, thanks!

@kba kba added the question Further information is requested label Oct 27, 2019
@bertsky
Copy link
Collaborator

bertsky commented Oct 27, 2019

@kba, I assume you meant ocrd/tesserocr:edge, not ocrd/tesserocr – the latter does not even contain ocrd-tesserocr-binarize yet (at least on dockerhub).

BTW, ocrd-tesserocr-binarize is not going to do any actual work on the page level (neither for input PAGE nor for PAGE generated from image). That's because Tesseract's API does not allow binarization on the page level. So no efforts have been invested in this CLI to apply the method with a fake PSM.SINGLE_BLOCK image of the page. And this binarization method is really not worth any actual effort (it merely offers global Otsu). But the way it fails is an error IMO.

You should at least see a new PAGE output.

I think what's happening is that the runtime parameters do not get passed to the processor somehow. Here's why:

  1. I always get the No output file group for images specified, falling back to 'OCR-D-IMG-BIN' warning, regardless of whether I actually provided one.
  2. any --log-level setting is ignored.
  3. It explains that no output is written – because there is no INPUT file group.

Here is a log output (obtained only via ocrd_logging.py):

19:57:23.571 DEBUG ocrd.processor - Running processor <class 'ocrd_tesserocr.binarize.TesserocrBinarize'>
19:57:23.572 INFO processor.TesserocrBinarize - No output file group for images specified, falling back to 'OCR-D-IMG-BIN'
19:57:23.572 DEBUG ocrd.processor - Processor instance <ocrd_tesserocr.binarize.TesserocrBinarize object at 0x7f2d7cb61cd0> (ocrd-tesserocr-binarize v0.4.1 doing preprocessing/optimization/binarization)
19:57:23.666 INFO ocrd.workspace - Saving mets '/data/mets.xml'

@bertsky
Copy link
Collaborator

bertsky commented Oct 27, 2019

But as far as I can see the decorators and processor class are all set up correctly. Something wrong with your Dockerfile, at least in the edge version, perhaps?

@mikegerber
Copy link
Contributor

I gave up debugging this because these files are not the same:

@kba
Copy link
Member Author

kba commented Oct 31, 2019

@kba, I assume you meant ocrd/tesserocr:edge, not ocrd/tesserocr – the latter does not even contain ocrd-tesserocr-binarize yet (at least on dockerhub).

Yeah, I should have been clearer: I built ocrd/tesserocr locally from the edge branch.

But the way it fails is an error IMO.

Yeah, I just want to ensure that the behavior for pip-installed and docker-run is the same. Binarization is a bad example, I agree.

I think what's happening is that the runtime parameters do not get passed to the processor somehow.

That could well be, thanks, it's a lead.

@kba
Copy link
Member Author

kba commented Oct 31, 2019

I gave up debugging this because these files are not the same:

https://github.com/OCR-D/ocrd_tesserocr/blob/master/Dockerfile

https://hub.docker.com/r/ocrd/tesserocr/dockerfile

It's confusing. The first link should be

https://github.com/OCR-D/ocrd_tesserocr/blob/edge/Dockerfile

(i.e. built from the edge branch)

DockerHub only displays the dockerfile (and README) of the master branch but is configured to build master -> latest and edge -> edge.

If you are still willing to debug: The dockerfile in the edge branch builds this image on dockerhub: https://hub.docker.com/layers/ocrd/tesserocr/edge/images/sha256-1f2a30d2f2c2dfc81ba97387a51678c557f24fea672c1ac3670f70ea49f7d153

@mikegerber
Copy link
Contributor

I'll check it next week!

@mikegerber
Copy link
Contributor

mikegerber commented Nov 4, 2019

This looks better (note the quotes):

$ docker run -u $(id -u) -w /data -v $PWD:/data -- ocrd/tesserocr:edge "ocrd-tesserocr-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN-DOCKER -m mets.xml"
12:29:05.205 INFO processor.TesserocrBinarize - No output file group for images specified, falling back to 'OCR-D-IMG-BIN'
12:29:05.276 INFO processor.TesserocrBinarize - INPUT FILE 0 / phys_0001
12:29:05.282 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0001'
12:29:05.282 WARNING processor.TesserocrBinarize - Page 'phys_0001' contains no text regions
12:29:05.283 INFO processor.TesserocrBinarize - INPUT FILE 1 / phys_0002
12:29:05.284 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0002'
12:29:05.284 WARNING processor.TesserocrBinarize - Page 'phys_0002' contains no text regions
12:29:05.284 INFO processor.TesserocrBinarize - INPUT FILE 2 / phys_0003
12:29:05.285 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0003'
12:29:05.285 WARNING processor.TesserocrBinarize - Page 'phys_0003' contains no text regions
12:29:05.286 INFO processor.TesserocrBinarize - INPUT FILE 3 / phys_0004
12:29:05.286 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0004'
12:29:05.286 WARNING processor.TesserocrBinarize - Page 'phys_0004' contains no text regions
12:29:05.289 INFO ocrd.workspace - Saving mets '/data/mets.xml'

Suggested fix (so the quotes aren't needed anymore):

diff --git a/Dockerfile b/Dockerfile
index c7b5888..0a84f03 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -21,4 +21,4 @@ RUN apt-get update && \
 RUN pip3 install --upgrade pip
 RUN make PYTHON=python3 PIP=pip3 deps install
 
-ENTRYPOINT ["/bin/sh", "-c"]
+ENTRYPOINT []

@bertsky
Copy link
Collaborator

bertsky commented Nov 4, 2019

This looks better (note the quotes):

That was it! You have to put all arguments into a single shell-expanded argument.

-ENTRYPOINT ["/bin/sh", "-c"]
+ENTRYPOINT []

Great! That's not going to work with our process substitution expressions (for ad-hoc parameter JSON files), but we should have the immediate JSON syntax by now.

@kba Can you recommend that for module projects' docker files in general?

@kba
Copy link
Member Author

kba commented Nov 4, 2019

Suggested fix (so the quotes aren't needed anymore):

Thanks!

@kba Can you recommend that for module projects' docker files in general?

Indeed we should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants