Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra spaces in any output except txt for non space delimited languages #2702

Open
FrkBo opened this issue Oct 9, 2019 · 25 comments
Open

Extra spaces in any output except txt for non space delimited languages #2702

FrkBo opened this issue Oct 9, 2019 · 25 comments
Labels
bug non spaced words output issues related output formats
Milestone

Comments

@FrkBo
Copy link

FrkBo commented Oct 9, 2019

Tesseract Version: v5.0.0-alpha.20190623
Platform: Windows 10 64-bit
Current Behavior: For the Thai language (almost) every individual character in hOCR output is a word
Expected Behavior: Words (or at least groups of characters) are correctly identified in the regular text output. I would expect the hOCR to show the same.

Original image:
0_Thai pdf

hOCR output:
output.hocr.txt

TXT output
output.txt

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Oct 17, 2019

I can confirm this with

tesseract 5.0.0-alpha-479-g247c
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

Not only HOCR, but tsv and alto output also treat each character as a WORD.

This same problem was reported earlier for text output (Korean etc) and the change which worked for all CJK languages was to use the config
-c preserve_interword_spaces=1 and change was made to all related traineddata files by adding this in config files within the traineddata. However, that does not seem to have fixed hocr, tsv and alto.

Original text output (with break after each character can be reproduced by using -c preserve_interword_spaces=0.

Similar behavior is also seen for kor, chi_tra, chi_sim etc.

@amitdo
Copy link
Collaborator

amitdo commented Oct 23, 2019

preserve_interword_spaces is only used here:
https://github.com/tesseract-ocr/tesseract/blob/cb0c024a6f/src/ccmain/resultiterator.cpp

Update: I fixed the broken link.

@Shreeshrii
Copy link
Collaborator

You mean https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/resultiterator.cpp

BOOL_VAR_H(preserve_interword_spaces, false,
 "Preserve multiple interword spaces");

So, the original intent of this variable is different. Somehow it worked in fixing the problem with CJK text output.

I think this issue is related to these being non space delimited languages.

@tonyfkchu
Copy link

PDF Output also have the same problem

@brlin-tw
Copy link
Contributor

Reproducible with chi_tra language + PDF output combination.

Suggested new issue name: "Extra spaces in HOCR/PDF output for non space delimited languages"

@amitdo amitdo changed the title Extra spaces in HOCR output for the Thai language Extra spaces in any output except txt for the Thai language Apr 22, 2020
@amitdo amitdo changed the title Extra spaces in any output except txt for the Thai language Extra spaces in any output except txt for non space delimited languages Apr 22, 2020
@amitdo
Copy link
Collaborator

amitdo commented Apr 22, 2020

#991 is a related issue.

@amitdo
Copy link
Collaborator

amitdo commented May 2, 2020

Affected languages:
Chinese, Japanese, Thai.

@eighttails
Copy link
Contributor

Issue is still present in Japanese recognition.

I investigated with latest master(12e0fb4) of tesseract and following image.
sample

With latest tessdata_best:
(Spaces between every tokens)

種々 の 帳票 文書 か ら 正 確 に 各 文書 に 対応 する 書式 構
造 を 認識 し , 個々 の 項目 デー タ を 抽出 する に は , 帳票
文書 クラ ス を 識別 する こと が 必要 不可 欠 で ある , ここ
で , 帳票 文書 クラ ス は 共通 の 書式 構造 の も と に 作成 さ
れ た 同一 , また は 同種 の 帳票 文書 の 集合 と 定義 する .


With latest tessdata_best and -c preserve_interword_spaces=1 command line option:
(Expected result)

種々の帳票文書から正確に各文書に対応する書式構
造を認識し, 個々の項目データを抽出するには, 帳票
文書クラスを識別することが必要不可欠である, ここ
で, 帳票文書クラスは共通の書式構造のもとに作成さ
れた同一, または同種の帳票文書の集合と定義する.


With tessdata in old repository(https://github.com/tesseract-ocr/tessdata):
(Spaces between every characters)

種 々 の 帳 票 文 書 か ら 正 確 に 各 文 書 に 対 応 す る 書 式 構
造 を 認 識 し , 個 々 の 項 目 デ ー タ を 抽 出 す る に は , 帳 票
文 書 ク ラ ス を 識 別 す る こ と が 必 要 不 可 欠 で あ る . こ こ
で , 帳 票 文 書 ク ラ ス は 共 通 の 書 式 構 造 の も と に 作 成 さ
れ た 同 一 , ま た は 同 種 の 帳 票 文 書 の 集 合 と 定 義 す る .


With tessdata_fast:
(Almost expected result)

種々の帳票文書から正確に各文書に対応する書式構
造を認識し, 個々の項目データを抽出するには, 帳票
文書クラスを識別することが必要不可欠である, ここ
で, 帳票文書クラスは共通の書式構造のもとに作成さ

れた同一, または同種の帳票文書の集合と定義する.


It seems preserve_interword_spaces option set in tessdata_best is not working.
I expect text with no spaces without -c preserve_interword_spaces=1 command line option.

@eighttails
Copy link
Contributor

preserve_interword_spaces is written in only jpn_vert.config and not in jpn.config,
preserve_interword_spaces should be written in jpn.config as well.

https://github.com/tesseract-ocr/langdata/blob/master/jpn/jpn.config
https://github.com/tesseract-ocr/langdata/blob/master/jpn_vert/jpn_vert.config

@amitdo
Copy link
Collaborator

amitdo commented Aug 22, 2021

Report the issue with jpn_vert.config here:
https://github.com/tesseract-ocr/langdata/issues

Check also the jpn.traineddata from tessdata_best by unpacking it. If the parameter is missing from this traineddata you should report the issue here: https://github.com/tesseract-ocr/tessdata_best/issues.

@stweil
Copy link
Member

stweil commented Aug 22, 2021

jpn.config loads jpn_vert.config (tessedit_load_sublangs jpn_vert) , so it should get preserve_interword_spaces=1 from there.

@stweil stweil added this to the 5.0.0 milestone Aug 22, 2021
@stweil
Copy link
Member

stweil commented Aug 22, 2021

Using preserve_interword_spaces=1 not only affects jpn or other models which seem to require it, but also languages or scripts which don't need it. So I'm afraid that setting it in some model files is not a good idea and that a different solution is required.

@eighttails
Copy link
Contributor

eighttails commented Aug 23, 2021

Does tessedit_load_sublangs jpn_vert work like #include ?
According to the OCR result in #2702 (comment) preserve_interword_spaces in config seems not working or ignored.

The command line I used is:
tessreact sample.png stdout -l jpn
Does jpn_vert and its config be loaded automatically when I specify -l jpn ?

I tried tesseract.exe sample.png stdout -l jpn+jpn_vert but no luck.

種々 の 帳票 文書 か ら 正 確 に 各 文書 に 対応 する 書式 構
造 を 認識 し , 個々 の 項目 デー タ を 抽出 する に は , 帳票
文書 クラ ス を 識別 する こと が 必要 不可 欠 で ある , ここ
で , 帳票 文書 クラ ス は 共通 の 書式 構造 の も と に 作成 さ
れ た 同一 , また は 同種 の 帳票 文書 の 集合 と 定義 する .

@amitdo
Copy link
Collaborator

amitdo commented Aug 24, 2021

https://en.wikipedia.org/wiki/Scriptio_continua#Decline

Scriptio continua is still in use in Thai script, other Southeast Asian abugidas: (Burmese, Khmer, Javanese, Balinese, Sundanese script), Lao, and in languages that use Chinese characters (Chinese and Japanese).

@amitdo
Copy link
Collaborator

amitdo commented Nov 17, 2021

@stweil,

jpn.config loads jpn_vert.config (tessedit_load_sublangs jpn_vert) , so it should get preserve_interword_spaces=1 from there.

But it does not inherit it, according to @eighttails, so preserve_interword_spaces 1 should also be added to jpn.traineddata.

@Shreeshrii
Copy link
Collaborator

Changes for jpn were made in

Since tessdata was updated with the integer version of tessdata_best before the config file change was made, it has space after each character.

@brlin-tw
Copy link
Contributor

Since tessdata was updated with the integer version of tessdata_best before the config file change was made, it has space after each character.

If the actual cause is determined, can the tessdata be regenerated with the fix included?

Also for chi_* languages as well, as I can still reproduce this issue with the languages.

@cyzs233
Copy link

cyzs233 commented Feb 5, 2022

I think the issue's name is a bit confusing.

In English, words are separated by a space character. In languages such as Chinese, Japanese and Thai, however, there is often no delimiter between words.

It should be "Extra spaces in non space delimited languages"

Here is a simple trigger in SubtitleEdit:

Untitled

Source.png

@legistek
Copy link

A little confused by all the github issues surorunding this. To be clear, for me PDF output contains extra spaces when CJK languages are OCRed but the plain text does not. Using preserve_interword_spaces helps for the plain TXT output but not for PDF.

Is this something that is: (a) fixable; and (b) expected to be fixed? This basically makes OCRing CJK languages to PDF unusable because nothing more than single glyphs can be searched.

FYI Acrobat Pro itself OCRs the same documents correctly without the spaces.

Thanks!

@amitdo
Copy link
Collaborator

amitdo commented Jul 17, 2023

Is this something that is: (a) fixable; and (b) expected to be fixed?

It's probably fixable. There no timeline for fixing this issue.

@amitdo amitdo added bug output issues related output formats labels Jul 17, 2023
@stweil
Copy link
Member

stweil commented Jul 17, 2023

I wonder whether preserve_interword_spaces should exist at all for LSTM results. The code has to be fixed for ALTO, hOCR and PDF output of CJK text, maybe without using that parameter.

@brlin-tw
Copy link
Contributor

brlin-tw commented Jul 14, 2024

Hello, I would like to ask which component that this bug most likely to be reside in. I'm not really familiar with C++ nor Tesseract but I really like to have this bug fixed (even by myself) so any pointers would be appreciated.

@fengwk
Copy link

fengwk commented Aug 31, 2024

Is there any plan to fix this bug? I should have encountered the same problem. I used tesseract to parse the Chinese in the screenshot and all the Chinese characters had spaces between them, while the real one or more spaces were ignored.

世界 和平 -> 世 界 和 平

@stweil stweil modified the milestones: 5.0.0, 5.1.0 Oct 17, 2024
@free-150
Copy link

When can this be fixed?

@amitdo
Copy link
Collaborator

amitdo commented Nov 10, 2024

When can this be fixed?

When someone will decide to fix it and send a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug non spaced words output issues related output formats
Projects
None yet
Development

No branches or pull requests