Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR_ENGINE=None Doesn't work #256

Open
svmrw opened this issue Aug 16, 2024 · 2 comments
Open

OCR_ENGINE=None Doesn't work #256

svmrw opened this issue Aug 16, 2024 · 2 comments

Comments

@svmrw
Copy link

svmrw commented Aug 16, 2024

Hello. The Readme says the following:

By default, marker will use surya for OCR. Surya is slower on CPU, but more accurate than tesseract. If you want faster OCR, set OCR_ENGINE to ocrmypdf. This also requires external dependencies (see above).
If you don't want OCR at all, set OCR_ENGINE to None.

export OCR_ENGINE=None
marker_single ./file.pdf ./marker

Running the command gives the following:

pydantic_core._pydantic_core.ValidationError: 1 validation error for Settings
OCR_ENGINE
  Input should be 'surya' or 'ocrmypdf' [type=literal_error, input_value='None', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/literal_error

I really want to convert pdf to markdown, but not use OCR.
Almost all pdf files have text that can be selected and copied, and embedded images need to be kept original. It seems to me that the whole document does not need to be recognized as an image if the text is easy to copy.

Please tell me, is this somehow possible or impossible?
Maybe it was supported before, but now it is not?
Or maybe I am doing something wrong?
Thanks.

@svmrw
Copy link
Author

svmrw commented Aug 18, 2024

#257
I tried to make changes manually based on your commit.
The error is no longer displayed, but...
OCR Surya still loads and recognizes the whole file.
Ie: OCR_ENGINE=None and OCR_ENGINE=Surya work the same. No changes are visible.
I most likely assume that I am doing something wrong, so I ask you to check it yourself.

@kyr0
Copy link

kyr0 commented Sep 17, 2024

Running into the same and as OCR runs my machine into max memory, I need to use a different software now.. dead end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants