Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_unicharsets.cpp #1024

Closed
amitdo opened this issue Jul 5, 2017 · 11 comments
Closed

merge_unicharsets.cpp #1024

amitdo opened this issue Jul 5, 2017 · 11 comments

Comments

@amitdo
Copy link
Collaborator

amitdo commented Jul 5, 2017

https://github.com/tesseract-ocr/tesseract/blob/master/training/merge_unicharsets.cpp

Should we add it to training/Makefile.am ?

@Shreeshrii
Copy link
Collaborator

Amit,

Which kind of unicharsets does it merge?

  • The script based ones given in langdata
  • The training text based ones created during training process

Does it just append or also eliminate duplicates?

Where would a merged unicharset be used?

@amitdo
Copy link
Collaborator Author

amitdo commented Jul 5, 2017

Good questions, Shree. Unfortunately, I simply don't have the answers...

@amitdo
Copy link
Collaborator Author

amitdo commented Jul 5, 2017

I just found this file, and saw it isn't in the Makefile.am which means it won't be compiled, so you can't actually use it.

@Shreeshrii
Copy link
Collaborator

In that case, we should add it to Makefile.am so that we can test and figure out what it does :-)

@amitdo
Copy link
Collaborator Author

amitdo commented Jul 5, 2017

It takes two or more unicharset files with this format:
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#the-unicharset-file-format

I don't know when you are supposed to use it.

@amitdo
Copy link
Collaborator Author

amitdo commented Jul 5, 2017

It calls this function to do the merge:

void UNICHARSET::AppendOtherUnicharset(const UNICHARSET& src) {

@theraysmith
Copy link
Contributor

It could be used to create a combined unicharset for a script-level engine, like the new Latin or Devanagari.
It isn't referenced by the current training documentation, but it might be useful to someone, so it should probably be added to Makefile.am.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Sep 9, 2017

@theraysmith

Is there a similar merge_language_model program, used for building a script-level engine?

Recently someone asked me:

Let us consider language as lat+san+guj 

Where lat is IAST or the roman transliteration of Sanskrit, in Latin script + English
san is Sanskrit in Devanagari script
guj is Gujarati in Gujarati script

So, something like this needs a combining of Devanagari + Gujarati + san_latn or IAST or Latin

What would be the best way to do this?

Can multiple training_files.txt for different languages be given as input for lstmtraining or do they need to be all merged in one big file?

Sample image below:
multi-language

Here is the merged unicharset for these languages:

deva-iast-guj.lstm-unicharset.txt

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Sep 10, 2017

It could be used to create a combined unicharset for a script-level engine, like the new Latin or Devanagari.
It isn't referenced by the current training documentation, but it might be useful to someone, so it should probably be added to Makefile.am.

Added by commit 9a038f8 as part of PR #1116

@amitdo
Copy link
Collaborator Author

amitdo commented Sep 11, 2017

For next time, I suggest not to mix unrelated commits in one PR.

@amitdo
Copy link
Collaborator Author

amitdo commented Sep 11, 2017

Thanks anyway!

@amitdo amitdo closed this as completed Sep 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants