-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extra spaces in any output except txt for non space delimited languages #2702
Comments
I can confirm this with tesseract 5.0.0-alpha-479-g247c Not only This same problem was reported earlier for Original text output (with break after each character can be reproduced by using Similar behavior is also seen for kor, chi_tra, chi_sim etc. |
Update: I fixed the broken link. |
You mean https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/resultiterator.cpp
So, the original intent of this variable is different. Somehow it worked in fixing the problem with CJK text output. I think this issue is related to these being non space delimited languages. |
PDF Output also have the same problem |
Reproducible with chi_tra language + PDF output combination. Suggested new issue name: "Extra spaces in HOCR/PDF output for non space delimited languages" |
#991 is a related issue. |
Affected languages: |
Issue is still present in Japanese recognition. I investigated with latest master(12e0fb4) of tesseract and following image. With latest tessdata_best:
With latest tessdata_best and
With tessdata in old repository(https://github.com/tesseract-ocr/tessdata):
With tessdata_fast:
It seems |
https://github.com/tesseract-ocr/langdata/blob/master/jpn/jpn.config |
Report the issue with jpn_vert.config here: Check also the jpn.traineddata from tessdata_best by unpacking it. If the parameter is missing from this traineddata you should report the issue here: https://github.com/tesseract-ocr/tessdata_best/issues. |
|
Using |
Does The command line I used is: I tried
|
https://en.wikipedia.org/wiki/Scriptio_continua#Decline
|
But it does not inherit it, according to @eighttails, so |
Changes for jpn were made in
Since tessdata was updated with the integer version of tessdata_best before the config file change was made, it has space after each character. |
If the actual cause is determined, can the tessdata be regenerated with the fix included? Also for chi_* languages as well, as I can still reproduce this issue with the languages. |
I think the issue's name is a bit confusing.
It should be "Extra spaces in non space delimited languages" Here is a simple trigger in SubtitleEdit: |
A little confused by all the github issues surorunding this. To be clear, for me PDF output contains extra spaces when CJK languages are OCRed but the plain text does not. Using Is this something that is: (a) fixable; and (b) expected to be fixed? This basically makes OCRing CJK languages to PDF unusable because nothing more than single glyphs can be searched. FYI Acrobat Pro itself OCRs the same documents correctly without the spaces. Thanks! |
It's probably fixable. There no timeline for fixing this issue. |
I wonder whether |
Hello, I would like to ask which component that this bug most likely to be reside in. I'm not really familiar with C++ nor Tesseract but I really like to have this bug fixed (even by myself) so any pointers would be appreciated. |
Is there any plan to fix this bug? I should have encountered the same problem. I used tesseract to parse the Chinese in the screenshot and all the Chinese characters had spaces between them, while the real one or more spaces were ignored.
|
When can this be fixed? |
When someone will decide to fix it and send a PR. |
Tesseract Version: v5.0.0-alpha.20190623
Platform: Windows 10 64-bit
Current Behavior: For the Thai language (almost) every individual character in hOCR output is a word
Expected Behavior: Words (or at least groups of characters) are correctly identified in the regular text output. I would expect the hOCR to show the same.
Original image:
data:image/s3,"s3://crabby-images/7e5ec/7e5ecfd1d42838cedeaa8fdcb08397e26d3528bf" alt="0_Thai pdf"
hOCR output:
output.hocr.txt
TXT output
output.txt
The text was updated successfully, but these errors were encountered: