You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For certain applications, by far the largest performance bottleneck is downloading the .traineddata file. The default files are very large because (1) we use files that contain both a Legacy and LSTM model and (2) we use an integerized version of the "tessdata_best" models, which are larger. Some comparisons showing the potential savings of using different .traineddata files are below.
English (eng)
Current default: 10.4MB
LSTM-only, "fast" version: 1.9 MB
Simplified Chinese (chi_sim)
Current default: 19.2MB
LSTM-only, "fast" version: 1.6MB
I have not experimented with the "fast" vs "best" models, so more research would need to be done before switching the default to a potentially less accurate model. However, simply removing the Legacy model when not specifically requested may result in significantly smaller files with minimal downsides.
In addition to being slow to download a >10MB file before performing a (potentially) small recognition task, large files appear to increase the risk of errors due to network issues. Despite English (likely) being the most popular language, searching Git Issues shows that most of the issues with language data come with using Simplified Chinese (the .traineddata for which is ~2x larger than English).
The text was updated successfully, but these errors were encountered:
For certain applications, by far the largest performance bottleneck is downloading the
.traineddata
file. The default files are very large because (1) we use files that contain both a Legacy and LSTM model and (2) we use an integerized version of the "tessdata_best" models, which are larger. Some comparisons showing the potential savings of using different.traineddata
files are below.eng
)10.4MB
1.9 MB
chi_sim
)19.2MB
1.6MB
I have not experimented with the "fast" vs "best" models, so more research would need to be done before switching the default to a potentially less accurate model. However, simply removing the Legacy model when not specifically requested may result in significantly smaller files with minimal downsides.
In addition to being slow to download a >10MB file before performing a (potentially) small recognition task, large files appear to increase the risk of errors due to network issues. Despite English (likely) being the most popular language, searching Git Issues shows that most of the issues with language data come with using Simplified Chinese (the .traineddata for which is ~2x larger than English).
The text was updated successfully, but these errors were encountered: