Extra spaces in any output except txt for non space delimited languages #2702

FrkBo · 2019-10-09T19:31:06Z

Tesseract Version: v5.0.0-alpha.20190623
Platform: Windows 10 64-bit
Current Behavior: For the Thai language (almost) every individual character in hOCR output is a word
Expected Behavior: Words (or at least groups of characters) are correctly identified in the regular text output. I would expect the hOCR to show the same.

Original image:

hOCR output:
output.hocr.txt

TXT output
output.txt

Shreeshrii · 2019-10-17T09:46:16Z

I can confirm this with

tesseract 5.0.0-alpha-479-g247c
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

Not only HOCR, but tsv and alto output also treat each character as a WORD.

This same problem was reported earlier for text output (Korean etc) and the change which worked for all CJK languages was to use the config
-c preserve_interword_spaces=1 and change was made to all related traineddata files by adding this in config files within the traineddata. However, that does not seem to have fixed hocr, tsv and alto.

Original text output (with break after each character can be reproduced by using -c preserve_interword_spaces=0.

Similar behavior is also seen for kor, chi_tra, chi_sim etc.

amitdo · 2019-10-23T09:54:07Z

preserve_interword_spaces is only used here:
https://github.com/tesseract-ocr/tesseract/blob/cb0c024a6f/src/ccmain/resultiterator.cpp

Update: I fixed the broken link.

Shreeshrii · 2019-10-23T10:17:59Z

You mean https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/resultiterator.cpp

BOOL_VAR_H(preserve_interword_spaces, false,
 "Preserve multiple interword spaces");

So, the original intent of this variable is different. Somehow it worked in fixing the problem with CJK text output.

I think this issue is related to these being non space delimited languages.

tonyfkchu · 2020-03-29T15:01:04Z

PDF Output also have the same problem

brlin-tw · 2020-04-22T02:31:53Z

Reproducible with chi_tra language + PDF output combination.

Suggested new issue name: "Extra spaces in HOCR/PDF output for non space delimited languages"

amitdo · 2020-04-22T15:45:09Z

#991 is a related issue.

amitdo · 2020-05-02T14:26:31Z

Affected languages:
Chinese, Japanese, Thai.

eighttails · 2021-07-11T10:35:13Z

Issue is still present in Japanese recognition.

I investigated with latest master(12e0fb4) of tesseract and following image.

With latest tessdata_best:
(Spaces between every tokens)

種々 の 帳票 文書 か ら 正 確 に 各 文書 に 対応 する 書式 構
造 を 認識 し , 個々 の 項目 デー タ を 抽出 する に は , 帳票
文書 クラ ス を 識別 する こと が 必要 不可 欠 で ある , ここ
で , 帳票 文書 クラ ス は 共通 の 書式 構造 の も と に 作成 さ
れ た 同一 , また は 同種 の 帳票 文書 の 集合 と 定義 する .

With latest tessdata_best and -c preserve_interword_spaces=1 command line option:
(Expected result)

種々の帳票文書から正確に各文書に対応する書式構
造を認識し, 個々の項目データを抽出するには, 帳票
文書クラスを識別することが必要不可欠である, ここ
で, 帳票文書クラスは共通の書式構造のもとに作成さ
れた同一, または同種の帳票文書の集合と定義する.

With tessdata in old repository(https://github.com/tesseract-ocr/tessdata):
(Spaces between every characters)

種 々 の 帳 票 文 書 か ら 正 確 に 各 文 書 に 対 応 す る 書 式 構
造 を 認 識 し , 個 々 の 項 目 デ ー タ を 抽 出 す る に は , 帳 票
文 書 ク ラ ス を 識 別 す る こ と が 必 要 不 可 欠 で あ る . こ こ
で , 帳 票 文 書 ク ラ ス は 共 通 の 書 式 構 造 の も と に 作 成 さ
れ た 同 一 , ま た は 同 種 の 帳 票 文 書 の 集 合 と 定 義 す る .

With tessdata_fast:
(Almost expected result)

種々の帳票文書から正確に各文書に対応する書式構
造を認識し, 個々の項目データを抽出するには, 帳票
文書クラスを識別することが必要不可欠である, ここ
で, 帳票文書クラスは共通の書式構造のもとに作成さ

れた同一, または同種の帳票文書の集合と定義する.

It seems preserve_interword_spaces option set in tessdata_best is not working.
I expect text with no spaces without -c preserve_interword_spaces=1 command line option.

eighttails · 2021-08-22T13:50:37Z

preserve_interword_spaces is written in only jpn_vert.config and not in jpn.config,
preserve_interword_spaces should be written in jpn.config as well.

https://github.com/tesseract-ocr/langdata/blob/master/jpn/jpn.config
https://github.com/tesseract-ocr/langdata/blob/master/jpn_vert/jpn_vert.config

amitdo · 2021-08-22T14:26:11Z

Report the issue with jpn_vert.config here:
https://github.com/tesseract-ocr/langdata/issues

Check also the jpn.traineddata from tessdata_best by unpacking it. If the parameter is missing from this traineddata you should report the issue here: https://github.com/tesseract-ocr/tessdata_best/issues.

stweil · 2021-08-22T15:25:01Z

jpn.config loads jpn_vert.config (tessedit_load_sublangs jpn_vert) , so it should get preserve_interword_spaces=1 from there.

stweil · 2021-08-22T15:59:53Z

Using preserve_interword_spaces=1 not only affects jpn or other models which seem to require it, but also languages or scripts which don't need it. So I'm afraid that setting it in some model files is not a good idea and that a different solution is required.

eighttails · 2021-08-23T10:29:16Z

Does tessedit_load_sublangs jpn_vert work like #include ?
According to the OCR result in #2702 (comment) preserve_interword_spaces in config seems not working or ignored.

The command line I used is:
tessreact sample.png stdout -l jpn
Does jpn_vert and its config be loaded automatically when I specify -l jpn ?

I tried tesseract.exe sample.png stdout -l jpn+jpn_vert but no luck.

種々 の 帳票 文書 か ら 正 確 に 各 文書 に 対応 する 書式 構
造 を 認識 し , 個々 の 項目 デー タ を 抽出 する に は , 帳票
文書 クラ ス を 識別 する こと が 必要 不可 欠 で ある , ここ
で , 帳票 文書 クラ ス は 共通 の 書式 構造 の も と に 作成 さ
れ た 同一 , また は 同種 の 帳票 文書 の 集合 と 定義 する .

amitdo · 2021-08-24T16:46:21Z

https://en.wikipedia.org/wiki/Scriptio_continua#Decline

Scriptio continua is still in use in Thai script, other Southeast Asian abugidas: (Burmese, Khmer, Javanese, Balinese, Sundanese script), Lao, and in languages that use Chinese characters (Chinese and Japanese).

amitdo · 2021-11-17T05:34:14Z

@stweil,

jpn.config loads jpn_vert.config (tessedit_load_sublangs jpn_vert) , so it should get preserve_interword_spaces=1 from there.

But it does not inherit it, according to @eighttails, so preserve_interword_spaces 1 should also be added to jpn.traineddata.

Shreeshrii · 2021-11-17T12:09:10Z

Changes for jpn were made in

Since tessdata was updated with the integer version of tessdata_best before the config file change was made, it has space after each character.

brlin-tw · 2021-11-17T14:30:28Z

Since tessdata was updated with the integer version of tessdata_best before the config file change was made, it has space after each character.

If the actual cause is determined, can the tessdata be regenerated with the fix included?

Also for chi_* languages as well, as I can still reproduce this issue with the languages.

cyzs233 · 2022-02-05T21:52:57Z

I think the issue's name is a bit confusing.

In English, words are separated by a space character. In languages such as Chinese, Japanese and Thai, however, there is often no delimiter between words.

It should be "Extra spaces in non space delimited languages"

Here is a simple trigger in SubtitleEdit:

Source.png

legistek · 2023-07-17T15:42:24Z

A little confused by all the github issues surorunding this. To be clear, for me PDF output contains extra spaces when CJK languages are OCRed but the plain text does not. Using preserve_interword_spaces helps for the plain TXT output but not for PDF.

Is this something that is: (a) fixable; and (b) expected to be fixed? This basically makes OCRing CJK languages to PDF unusable because nothing more than single glyphs can be searched.

FYI Acrobat Pro itself OCRs the same documents correctly without the spaces.

Thanks!

amitdo · 2023-07-17T17:30:35Z

Is this something that is: (a) fixable; and (b) expected to be fixed?

It's probably fixable. There no timeline for fixing this issue.

stweil · 2023-07-17T17:39:50Z

I wonder whether preserve_interword_spaces should exist at all for LSTM results. The code has to be fixed for ALTO, hOCR and PDF output of CJK text, maybe without using that parameter.

brlin-tw · 2024-07-14T07:26:30Z

Hello, I would like to ask which component that this bug most likely to be reside in. I'm not really familiar with C++ nor Tesseract but I really like to have this bug fixed (even by myself) so any pointers would be appreciated.

fengwk · 2024-08-31T11:07:21Z

Is there any plan to fix this bug? I should have encountered the same problem. I used tesseract to parse the Chinese in the screenshot and all the Chinese characters had spaces between them, while the real one or more spaces were ignored.

世界和平 -> 世界和平

free-150 · 2024-11-10T13:40:56Z

When can this be fixed?

amitdo · 2024-11-10T15:00:07Z

When can this be fixed?

When someone will decide to fix it and send a PR.

amitdo changed the title ~~Extra spaces in HOCR output for the Thai language~~ Extra spaces in any output except txt for the Thai language Apr 22, 2020

amitdo changed the title ~~Extra spaces in any output except txt for the Thai language~~ Extra spaces in any output except txt for non space delimited languages Apr 22, 2020

amitdo mentioned this issue Apr 24, 2020

Chinese recognition was incorrectly segmented by spaces #2814

Closed

stweil added this to the 5.0.0 milestone Aug 22, 2021

amitdo added the non spaced words label Aug 24, 2021

amitdo mentioned this issue Nov 16, 2021

Extra spaces between characters with Japanese and PDF output #3645

Closed

dynobo mentioned this issue Dec 18, 2021

Spaces between Chinese characters dynobo/normcap#158

Closed

amitdo mentioned this issue Dec 28, 2022

Spaces in Japanese ocrmypdf/OCRmyPDF#1041

Closed

amitdo added bug output issues related output formats labels Jul 17, 2023

tenpai-git mentioned this issue Mar 20, 2024

Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR eikek/docspell#2505

Merged

stweil modified the milestones: 5.0.0, 5.1.0 Oct 17, 2024

asukaminato0721 mentioned this issue Nov 21, 2024

[bug] blank space between characters in chinese OCR mediar-ai/screenpipe#612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra spaces in any output except txt for non space delimited languages #2702

Extra spaces in any output except txt for non space delimited languages #2702

FrkBo commented Oct 9, 2019

Shreeshrii commented Oct 17, 2019 •

edited

Loading

amitdo commented Oct 23, 2019 •

edited

Loading

Shreeshrii commented Oct 23, 2019

tonyfkchu commented Mar 29, 2020

brlin-tw commented Apr 22, 2020

amitdo commented Apr 22, 2020

amitdo commented May 2, 2020

eighttails commented Jul 11, 2021

eighttails commented Aug 22, 2021

amitdo commented Aug 22, 2021

stweil commented Aug 22, 2021

stweil commented Aug 22, 2021

eighttails commented Aug 23, 2021 •

edited

Loading

amitdo commented Aug 24, 2021

amitdo commented Nov 17, 2021 •

edited

Loading

Shreeshrii commented Nov 17, 2021

brlin-tw commented Nov 17, 2021

cyzs233 commented Feb 5, 2022 •

edited

Loading

legistek commented Jul 17, 2023

amitdo commented Jul 17, 2023

stweil commented Jul 17, 2023

brlin-tw commented Jul 14, 2024 •

edited

Loading

fengwk commented Aug 31, 2024

free-150 commented Nov 10, 2024

amitdo commented Nov 10, 2024

Extra spaces in any output except txt for non space delimited languages #2702

Extra spaces in any output except txt for non space delimited languages #2702

Comments

FrkBo commented Oct 9, 2019

Shreeshrii commented Oct 17, 2019 • edited Loading

amitdo commented Oct 23, 2019 • edited Loading

Shreeshrii commented Oct 23, 2019

tonyfkchu commented Mar 29, 2020

brlin-tw commented Apr 22, 2020

amitdo commented Apr 22, 2020

amitdo commented May 2, 2020

eighttails commented Jul 11, 2021

eighttails commented Aug 22, 2021

amitdo commented Aug 22, 2021

stweil commented Aug 22, 2021

stweil commented Aug 22, 2021

eighttails commented Aug 23, 2021 • edited Loading

amitdo commented Aug 24, 2021

amitdo commented Nov 17, 2021 • edited Loading

Shreeshrii commented Nov 17, 2021

brlin-tw commented Nov 17, 2021

cyzs233 commented Feb 5, 2022 • edited Loading

legistek commented Jul 17, 2023

amitdo commented Jul 17, 2023

stweil commented Jul 17, 2023

brlin-tw commented Jul 14, 2024 • edited Loading

fengwk commented Aug 31, 2024

free-150 commented Nov 10, 2024

amitdo commented Nov 10, 2024

Shreeshrii commented Oct 17, 2019 •

edited

Loading

amitdo commented Oct 23, 2019 •

edited

Loading

eighttails commented Aug 23, 2021 •

edited

Loading

amitdo commented Nov 17, 2021 •

edited

Loading

cyzs233 commented Feb 5, 2022 •

edited

Loading

brlin-tw commented Jul 14, 2024 •

edited

Loading