merge_unicharsets.cpp #1024

amitdo · 2017-07-05T10:27:07Z

https://github.com/tesseract-ocr/tesseract/blob/master/training/merge_unicharsets.cpp

Should we add it to training/Makefile.am ?

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2017-07-05T15:38:56Z

Amit,

Which kind of unicharsets does it merge?

The script based ones given in langdata
The training text based ones created during training process

Does it just append or also eliminate duplicates?

Where would a merged unicharset be used?

amitdo · 2017-07-05T16:47:10Z

Good questions, Shree. Unfortunately, I simply don't have the answers...

amitdo · 2017-07-05T16:50:21Z

I just found this file, and saw it isn't in the Makefile.am which means it won't be compiled, so you can't actually use it.

Shreeshrii · 2017-07-05T16:54:07Z

In that case, we should add it to Makefile.am so that we can test and figure out what it does :-)

amitdo · 2017-07-05T18:52:23Z

It takes two or more unicharset files with this format:
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#the-unicharset-file-format

I don't know when you are supposed to use it.

amitdo · 2017-07-05T18:59:16Z

It calls this function to do the merge:

tesseract/ccutil/unicharset.cpp

Line 439 in 29f3de9

void UNICHARSET::AppendOtherUnicharset(const UNICHARSET& src) {

theraysmith · 2017-08-03T19:00:07Z

It could be used to create a combined unicharset for a script-level engine, like the new Latin or Devanagari.
It isn't referenced by the current training documentation, but it might be useful to someone, so it should probably be added to Makefile.am.

Shreeshrii · 2017-09-09T09:30:06Z

@theraysmith

Is there a similar merge_language_model program, used for building a script-level engine?

Recently someone asked me:

Let us consider language as lat+san+guj

Where lat is IAST or the roman transliteration of Sanskrit, in Latin script + English
san is Sanskrit in Devanagari script
guj is Gujarati in Gujarati script

So, something like this needs a combining of Devanagari + Gujarati + san_latn or IAST or Latin

What would be the best way to do this?

Can multiple training_files.txt for different languages be given as input for lstmtraining or do they need to be all merged in one big file?

Sample image below:

Here is the merged unicharset for these languages:

deva-iast-guj.lstm-unicharset.txt

Shreeshrii · 2017-09-10T16:25:39Z

It could be used to create a combined unicharset for a script-level engine, like the new Latin or Devanagari.
It isn't referenced by the current training documentation, but it might be useful to someone, so it should probably be added to Makefile.am.

Added by commit 9a038f8 as part of PR #1116

amitdo · 2017-09-11T07:15:28Z

For next time, I suggest not to mix unrelated commits in one PR.

amitdo · 2017-09-11T08:19:04Z

Thanks anyway!

amitdo mentioned this issue Aug 3, 2017

Unused function PrepareDistortedPix() #1052

Closed

amitdo closed this as completed Sep 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge_unicharsets.cpp #1024

merge_unicharsets.cpp #1024

amitdo commented Jul 5, 2017

Shreeshrii commented Jul 5, 2017

amitdo commented Jul 5, 2017

amitdo commented Jul 5, 2017

Shreeshrii commented Jul 5, 2017

amitdo commented Jul 5, 2017

amitdo commented Jul 5, 2017

theraysmith commented Aug 3, 2017

Shreeshrii commented Sep 9, 2017 •

edited

Loading

Shreeshrii commented Sep 10, 2017 •

edited

Loading

amitdo commented Sep 11, 2017

amitdo commented Sep 11, 2017

merge_unicharsets.cpp #1024

merge_unicharsets.cpp #1024

Comments

amitdo commented Jul 5, 2017

Shreeshrii commented Jul 5, 2017

amitdo commented Jul 5, 2017

amitdo commented Jul 5, 2017

Shreeshrii commented Jul 5, 2017

amitdo commented Jul 5, 2017

amitdo commented Jul 5, 2017

theraysmith commented Aug 3, 2017

Shreeshrii commented Sep 9, 2017 • edited Loading

Shreeshrii commented Sep 10, 2017 • edited Loading

amitdo commented Sep 11, 2017

amitdo commented Sep 11, 2017

Shreeshrii commented Sep 9, 2017 •

edited

Loading

Shreeshrii commented Sep 10, 2017 •

edited

Loading