Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use smaller .traineddata files by default #750

Closed
Balearica opened this issue May 1, 2023 · 1 comment
Closed

Use smaller .traineddata files by default #750

Balearica opened this issue May 1, 2023 · 1 comment

Comments

@Balearica
Copy link
Member

For certain applications, by far the largest performance bottleneck is downloading the .traineddata file. The default files are very large because (1) we use files that contain both a Legacy and LSTM model and (2) we use an integerized version of the "tessdata_best" models, which are larger. Some comparisons showing the potential savings of using different .traineddata files are below.

  1. English (eng)
    1. Current default: 10.4MB
    2. LSTM-only, "fast" version: 1.9 MB
  2. Simplified Chinese (chi_sim)
    1. Current default: 19.2MB
    2. LSTM-only, "fast" version: 1.6MB

I have not experimented with the "fast" vs "best" models, so more research would need to be done before switching the default to a potentially less accurate model. However, simply removing the Legacy model when not specifically requested may result in significantly smaller files with minimal downsides.

In addition to being slow to download a >10MB file before performing a (potentially) small recognition task, large files appear to increase the risk of errors due to network issues. Despite English (likely) being the most popular language, searching Git Issues shows that most of the issues with language data come with using Simplified Chinese (the .traineddata for which is ~2x larger than English).

@Balearica
Copy link
Member Author

Closing as this is covered by #806.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant