-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Data Files
- Special Data Files
- Updated LSTM Data Files for Version 4.00
- Data Files for Version 4.00
- Data Files for Version 3.04/3.05
- Cube Data Files for Version 3.04/3.05
- Fraktur Data Files
- Data Files for Version 3.02
- Data Files for Version 2.0x
- Format of traineddata files
Lang Code | Description | 4.0/3.0x traineddata |
---|---|---|
osd | Orientation and script detection | osd.traineddata |
equ | Math / equation detection | equ.traineddata |
Note: These two data files are compatible with older versions of Tesseract. osd
is compatible with version 3.01 and up, and equ
is compatible with version 3.02 and up.
We have three sets of .traineddata files on GitHub in three separate repositories.
- https://github.com/tesseract-ocr/tessdata_best
- https://github.com/tesseract-ocr/tessdata_fast
- https://github.com/tesseract-ocr/tessdata
Most users will want tessdata_fast
and that is what will be shipped as part of Linux distributions.
tessdata_best
is for people willing to trade a lot of speed for slightly better accuracy. It is also
the only set of files which can be used for certain retraining scenarios for advanced users.
The third set in tessdata
is the only one that supports the legacy recognizer. The 4.00 files from November 2016 have both legacy and older LSTM models. The current set of files in tessdata
have the legacy models and newer LSTM models (integer versions of 4.00.00 alpha models in tessdata_best).
Note: When using the new models in the tessdata_best
and tessdata_fast
repositories, only the new LSTM-based OCR engine is supported. The legacy engine is not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.
This set of traineddata files has support for the legacy recognizer with --oem 0 and for LSTM models with --oem 1.
Note: The kur
data file was not updated from 3.04. For Fraktur, see the section Fraktur Data Files, or use the newer data files from the tessdata_fast or tessdata_best repositories.
Lang Code | Language | 4.0 traineddata |
---|---|---|
afr | Afrikaans | afr.traineddata |
amh | Amharic | amh.traineddata |
ara | Arabic | ara.traineddata |
asm | Assamese | asm.traineddata |
aze | Azerbaijani | aze.traineddata |
aze_cyrl | Azerbaijani - Cyrillic | aze_cyrl.traineddata |
bel | Belarusian | bel.traineddata |
ben | Bengali | ben.traineddata |
bod | Tibetan | bod.traineddata |
bos | Bosnian | bos.traineddata |
bul | Bulgarian | bul.traineddata |
cat | Catalan; Valencian | cat.traineddata |
ceb | Cebuano | ceb.traineddata |
ces | Czech | ces.traineddata |
chi_sim | Chinese - Simplified | chi_sim.traineddata |
chi_tra | Chinese - Traditional | chi_tra.traineddata |
chr | Cherokee | chr.traineddata |
cym | Welsh | cym.traineddata |
dan | Danish | dan.traineddata |
deu | German | deu.traineddata |
dzo | Dzongkha | dzo.traineddata |
ell | Greek, Modern (1453-) | ell.traineddata |
eng | English | eng.traineddata |
enm | English, Middle (1100-1500) | enm.traineddata |
epo | Esperanto | epo.traineddata |
est | Estonian | est.traineddata |
eus | Basque | eus.traineddata |
fas | Persian | fas.traineddata |
fin | Finnish | fin.traineddata |
fra | French | fra.traineddata |
frk | Frankish | frk.traineddata |
frm | French, Middle (ca. 1400-1600) | frm.traineddata |
gle | Irish | gle.traineddata |
glg | Galician | glg.traineddata |
grc | Greek, Ancient (-1453) | grc.traineddata |
guj | Gujarati | guj.traineddata |
hat | Haitian; Haitian Creole | hat.traineddata |
heb | Hebrew | heb.traineddata |
hin | Hindi | hin.traineddata |
hrv | Croatian | hrv.traineddata |
hun | Hungarian | hun.traineddata |
iku | Inuktitut | iku.traineddata |
ind | Indonesian | ind.traineddata |
isl | Icelandic | isl.traineddata |
ita | Italian | ita.traineddata |
ita_old | Italian - Old | ita_old.traineddata |
jav | Javanese | jav.traineddata |
jpn | Japanese | jpn.traineddata |
kan | Kannada | kan.traineddata |
kat | Georgian | kat.traineddata |
kat_old | Georgian - Old | kat_old.traineddata |
kaz | Kazakh | kaz.traineddata |
khm | Central Khmer | khm.traineddata |
kir | Kirghiz; Kyrgyz | kir.traineddata |
kor | Korean | kor.traineddata |
kur | Kurdish | kur.traineddata |
lao | Lao | lao.traineddata |
lat | Latin | lat.traineddata |
lav | Latvian | lav.traineddata |
lit | Lithuanian | lit.traineddata |
mal | Malayalam | mal.traineddata |
mar | Marathi | mar.traineddata |
mkd | Macedonian | mkd.traineddata |
mlt | Maltese | mlt.traineddata |
msa | Malay | msa.traineddata |
mya | Burmese | mya.traineddata |
nep | Nepali | nep.traineddata |
nld | Dutch; Flemish | nld.traineddata |
nor | Norwegian | nor.traineddata |
ori | Oriya | ori.traineddata |
pan | Panjabi; Punjabi | pan.traineddata |
pol | Polish | pol.traineddata |
por | Portuguese | por.traineddata |
pus | Pushto; Pashto | pus.traineddata |
ron | Romanian; Moldavian; Moldovan | ron.traineddata |
rus | Russian | rus.traineddata |
san | Sanskrit | san.traineddata |
sin | Sinhala; Sinhalese | sin.traineddata |
slk | Slovak | slk.traineddata |
slv | Slovenian | slv.traineddata |
spa | Spanish; Castilian | spa.traineddata |
spa_old | Spanish; Castilian - Old | spa_old.traineddata |
sqi | Albanian | sqi.traineddata |
srp | Serbian | srp.traineddata |
srp_latn | Serbian - Latin | srp_latn.traineddata |
swa | Swahili | swa.traineddata |
swe | Swedish | swe.traineddata |
syr | Syriac | syr.traineddata |
tam | Tamil | tam.traineddata |
tel | Telugu | tel.traineddata |
tgk | Tajik | tgk.traineddata |
tgl | Tagalog | tgl.traineddata |
tha | Thai | tha.traineddata |
tir | Tigrinya | tir.traineddata |
tur | Turkish | tur.traineddata |
uig | Uighur; Uyghur | uig.traineddata |
ukr | Ukrainian | ukr.traineddata |
urd | Urdu | urd.traineddata |
uzb | Uzbek | uzb.traineddata |
uzb_cyrl | Uzbek - Cyrillic | uzb_cyrl.traineddata |
vie | Vietnamese | vie.traineddata |
yid | Yiddish | yid.traineddata |
Note: For Arabic and Hindi you need both the traineddata file and the cube data files.
Lang Code | Language | 3.04 traineddata |
---|---|---|
afr | Afrikaans | afr.traineddata |
amh | Amharic | amh.traineddata |
ara | Arabic | ara.traineddata |
asm | Assamese | asm.traineddata |
aze | Azerbaijani | aze.traineddata |
aze_cyrl | Azerbaijani - Cyrillic | aze_cyrl.traineddata |
bel | Belarusian | bel.traineddata |
ben | Bengali | ben.traineddata |
bod | Tibetan | bod.traineddata |
bos | Bosnian | bos.traineddata |
bul | Bulgarian | bul.traineddata |
cat | Catalan; Valencian | cat.traineddata |
ceb | Cebuano | ceb.traineddata |
ces | Czech | ces.traineddata |
chi_sim | Chinese - Simplified | chi_sim.traineddata |
chi_tra | Chinese - Traditional | chi_tra.traineddata |
chr | Cherokee | chr.traineddata |
cym | Welsh | cym.traineddata |
dan | Danish | dan.traineddata |
deu | German | deu.traineddata |
dzo | Dzongkha | dzo.traineddata |
ell | Greek, Modern (1453-) | ell.traineddata |
eng | English | eng.traineddata |
enm | English, Middle (1100-1500) | enm.traineddata |
epo | Esperanto | epo.traineddata |
est | Estonian | est.traineddata |
eus | Basque | eus.traineddata |
fas | Persian | fas.traineddata |
fin | Finnish | fin.traineddata |
fra | French | fra.traineddata |
frk | Frankish | frk.traineddata |
frm | French, Middle (ca. 1400-1600) | frm.traineddata |
gle | Irish | gle.traineddata |
glg | Galician | glg.traineddata |
grc | Greek, Ancient (-1453) | grc.traineddata |
guj | Gujarati | guj.traineddata |
hat | Haitian; Haitian Creole | hat.traineddata |
heb | Hebrew | heb.traineddata |
hin | Hindi | hin.traineddata |
hrv | Croatian | hrv.traineddata |
hun | Hungarian | hun.traineddata |
iku | Inuktitut | iku.traineddata |
ind | Indonesian | ind.traineddata |
isl | Icelandic | isl.traineddata |
ita | Italian | ita.traineddata |
ita_old | Italian - Old | ita_old.traineddata |
jav | Javanese | jav.traineddata |
jpn | Japanese | jpn.traineddata |
kan | Kannada | kan.traineddata |
kat | Georgian | kat.traineddata |
kat_old | Georgian - Old | kat_old.traineddata |
kaz | Kazakh | kaz.traineddata |
khm | Central Khmer | khm.traineddata |
kir | Kirghiz; Kyrgyz | kir.traineddata |
kor | Korean | kor.traineddata |
kur | Kurdish | kur.traineddata |
lao | Lao | lao.traineddata |
lat | Latin | lat.traineddata |
lav | Latvian | lav.traineddata |
lit | Lithuanian | lit.traineddata |
mal | Malayalam | mal.traineddata |
mar | Marathi | mar.traineddata |
mkd | Macedonian | mkd.traineddata |
mlt | Maltese | mlt.traineddata |
msa | Malay | msa.traineddata |
mya | Burmese | mya.traineddata |
nep | Nepali | nep.traineddata |
nld | Dutch; Flemish | nld.traineddata |
nor | Norwegian | nor.traineddata |
ori | Oriya | ori.traineddata |
pan | Panjabi; Punjabi | pan.traineddata |
pol | Polish | pol.traineddata |
por | Portuguese | por.traineddata |
pus | Pushto; Pashto | pus.traineddata |
ron | Romanian; Moldavian; Moldovan | ron.traineddata |
rus | Russian | rus.traineddata |
san | Sanskrit | san.traineddata |
sin | Sinhala; Sinhalese | sin.traineddata |
slk | Slovak | slk.traineddata |
slv | Slovenian | slv.traineddata |
spa | Spanish; Castilian | spa.traineddata |
spa_old | Spanish; Castilian - Old | spa_old.traineddata |
sqi | Albanian | sqi.traineddata |
srp | Serbian | srp.traineddata |
srp_latn | Serbian - Latin | srp_latn.traineddata |
swa | Swahili | swa.traineddata |
swe | Swedish | swe.traineddata |
syr | Syriac | syr.traineddata |
tam | Tamil | tam.traineddata |
tel | Telugu | tel.traineddata |
tgk | Tajik | tgk.traineddata |
tgl | Tagalog | tgl.traineddata |
tha | Thai | tha.traineddata |
tir | Tigrinya | tir.traineddata |
tur | Turkish | tur.traineddata |
uig | Uighur; Uyghur | uig.traineddata |
ukr | Ukrainian | ukr.traineddata |
urd | Urdu | urd.traineddata |
uzb | Uzbek | uzb.traineddata |
uzb_cyrl | Uzbek - Cyrillic | uzb_cyrl.traineddata |
vie | Vietnamese | vie.traineddata |
yid | Yiddish | yid.traineddata |
In Tesseract 3.0x Arabic and Hindi use the Cube OCR engine. You need to download the cube files and move them to the same folder where the <ara/hin>.traineddata file is located.
In Tesseract 4.0 the Cube OCR engine was removed from the codebase, so if you are using 4.0 or a newer version these files are not needed.
Hindi:
hin.cube.bigrams,
hin.cube.fold,
hin.cube.lm,
hin.cube.nn,
hin.cube.params,
hin.cube.word-freq,
hin.tesseract_cube.nn
Arabic:
ara.cube.bigrams,
ara.cube.fold,
ara.cube.lm,
ara.cube.nn,
ara.cube.params,
ara.cube.word-freq,
ara.cube.size,
ara.tesseract_cube.nn
These data files were prepared by @paalberti for some old versions of Tesseract. dan_frak
, deu_frak
and swe_frak
were prepared for version 3.00, slk_frak
was prepared for 3.01. Updates to these files are available at paalberti/tesseract-dan-fraktur.
Lang Code | Language | 3.0x traineddata |
---|---|---|
dan_frak | Danish - Fraktur | dan_frak.traineddata |
deu_frak | German - Fraktur | deu_frak.traineddata |
slk_frak | Slovak - Fraktur | slk_frak.traineddata |
swe_frak | Swedish - Fraktur | swe-frak.traineddata |
Lang Code | Language | 2.0x traineddata |
---|---|---|
deu | German | tesseract-2.00.deu.tar.gz |
deu-f | German - Fraktur | tesseract-2.01.deu-f.tar.gz |
eng | English | tesseract-2.00.eng.tar.gz |
eus | Basque | tesseract-2.04-eus.tar.gz |
fra | French | tesseract-2.00.fra.tar.gz |
ita | Italian | tesseract-2.00.ita.tar.gz |
nld | Dutch; Flemish | tesseract-2.00.nld.tar.gz |
por | Portuguese | tesseract-2.01.por.tar.gz |
spa | Spanish; Castilian | tesseract-2.00.spa.tar.gz |
vie | Vietnamese | tesseract-2.01.vie.tar.gz |
The traineddata
file for each language is an archive file in a Tesseract specific format. It contains several uncompressed component files which are needed by the Tesseract OCR process. The program combine_tessdata
is used to create a tessdata
file from the component files and can also extract them again like in the following examples:
combine_tessdata -u eng.traineddata eng.
Extracting tessdata components from eng.traineddata
Wrote eng.unicharset
Wrote eng.unicharambigs
Wrote eng.inttemp
Wrote eng.pffmtable
Wrote eng.normproto
Wrote eng.punc-dawg
Wrote eng.word-dawg
Wrote eng.number-dawg
Wrote eng.freq-dawg
Wrote eng.cube-unicharset
Wrote eng.cube-word-dawg
Wrote eng.shapetable
Wrote eng.bigram-dawg
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
Wrote eng.version
Version string:Pre-4.0.0
1:unicharset:size=7477, offset=192
2:unicharambigs:size=1047, offset=7669
3:inttemp:size=976552, offset=8716
4:pffmtable:size=844, offset=985268
5:normproto:size=13408, offset=986112
6:punc-dawg:size=4322, offset=999520
7:word-dawg:size=1082890, offset=1003842
8:number-dawg:size=6426, offset=2086732
9:freq-dawg:size=1410, offset=2093158
11:cube-unicharset:size=1511, offset=2094568
12:cube-word-dawg:size=1062106, offset=2096079
13:shapetable:size=63346, offset=3158185
14:bigram-dawg:size=16109842, offset=3221531
17:lstm:size=5390718, offset=19331373
18:lstm-punc-dawg:size=4322, offset=24722091
19:lstm-word-dawg:size=7143578, offset=24726413
20:lstm-number-dawg:size=3530, offset=31869991
23:version:size=9, offset=31873521
combine_tessdata -u eng.traineddata eng.
Extracting tessdata components from eng.traineddata
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
Wrote eng.lstm-unicharset
Wrote eng.lstm-recoder
Wrote eng.version
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
There are some proposals to replace the Tesseract archive format by a standard archive format which could also support compression. A discussion on the tesseract-dev forum proposed the ZIP format already in 2014. In 2017 an experimental implementation was provided as a pull request.
Old wiki - no longer maintained. The pages were moved, see the new documentation.
These wiki pages are no longer maintained.
All pages were moved to tesseract-ocr/tessdoc.
The latest documentation is available at https://tesseract-ocr.github.io/.