-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Data Files
- Special Data Files
- Updated LSTM Data Files for Version 4.00
- Data Files for Version 4.00
- Data Files for Version 3.04/3.05
- Cube Data Files for Version 3.04/3.05
- Fraktur Data Files
- Data Files for Version 3.02
- Data Files for Version 2.0x
- Format of traineddata files
Lang Code | Description | 4.0/3.0x traineddata |
---|---|---|
osd | Orientation and script detection | osd.traineddata |
equ | Math / equation detection | equ.traineddata |
Note: These two data files are compatible with older versions of Tesseract. osd
is compatible with version 3.01 and up, and equ
is compatible with version 3.02 and up.
We have three sets of .traineddata files on GitHub in three separate repositories. Most users
will want tessdata_fast
and that is what will be shipped as part of Linux distributions.
tessdata_best
is for people willing to trade a lot of speed for slightly better accuracy. It is also
better for certain retraining scenarios for advanced users.
The third set (original 4.00 files from November 2016) is for the legacy recognizer.
Note: When using the new models in the 'tessdata_best' and 'tessdata_fast' repositories, only the new LSTM-based OCR engine is supported. The legacy engine is not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.
https://github.com/tesseract-ocr/tessdata_best https://github.com/tesseract-ocr/tessdata_fast
This set of traineddata files has support for the legacy recognizer, with --oem 0.
Note: The kur
data file was not updated from 3.04. For Fraktur, see the section Fraktur Data Files, or use the new Fraktur.traineddata.
Lang Code | Language | 4.0 traineddata |
---|---|---|
afr | Afrikaans | afr.traineddata |
amh | Amharic | amh.traineddata |
ara | Arabic | ara.traineddata |
asm | Assamese | asm.traineddata |
aze | Azerbaijani | aze.traineddata |
aze_cyrl | Azerbaijani - Cyrillic | aze_cyrl.traineddata |
bel | Belarusian | bel.traineddata |
ben | Bengali | ben.traineddata |
bod | Tibetan | bod.traineddata |
bos | Bosnian | bos.traineddata |
bul | Bulgarian | bul.traineddata |
cat | Catalan; Valencian | cat.traineddata |
ceb | Cebuano | ceb.traineddata |
ces | Czech | ces.traineddata |
chi_sim | Chinese - Simplified | chi_sim.traineddata |
chi_tra | Chinese - Traditional | chi_tra.traineddata |
chr | Cherokee | chr.traineddata |
cym | Welsh | cym.traineddata |
dan | Danish | dan.traineddata |
deu | German | deu.traineddata |
dzo | Dzongkha | dzo.traineddata |
ell | Greek, Modern (1453-) | ell.traineddata |
eng | English | eng.traineddata |
enm | English, Middle (1100-1500) | enm.traineddata |
epo | Esperanto | epo.traineddata |
est | Estonian | est.traineddata |
eus | Basque | eus.traineddata |
fas | Persian | fas.traineddata |
fin | Finnish | fin.traineddata |
fra | French | fra.traineddata |
frk | Frankish | frk.traineddata |
frm | French, Middle (ca. 1400-1600) | frm.traineddata |
gle | Irish | gle.traineddata |
glg | Galician | glg.traineddata |
grc | Greek, Ancient (-1453) | grc.traineddata |
guj | Gujarati | guj.traineddata |
hat | Haitian; Haitian Creole | hat.traineddata |
heb | Hebrew | heb.traineddata |
hin | Hindi | hin.traineddata |
hrv | Croatian | hrv.traineddata |
hun | Hungarian | hun.traineddata |
iku | Inuktitut | iku.traineddata |
ind | Indonesian | ind.traineddata |
isl | Icelandic | isl.traineddata |
ita | Italian | ita.traineddata |
ita_old | Italian - Old | ita_old.traineddata |
jav | Javanese | jav.traineddata |
jpn | Japanese | jpn.traineddata |
kan | Kannada | kan.traineddata |
kat | Georgian | kat.traineddata |
kat_old | Georgian - Old | kat_old.traineddata |
kaz | Kazakh | kaz.traineddata |
khm | Central Khmer | khm.traineddata |
kir | Kirghiz; Kyrgyz | kir.traineddata |
kor | Korean | kor.traineddata |
kur | Kurdish | kur.traineddata |
lao | Lao | lao.traineddata |
lat | Latin | lat.traineddata |
lav | Latvian | lav.traineddata |
lit | Lithuanian | lit.traineddata |
mal | Malayalam | mal.traineddata |
mar | Marathi | mar.traineddata |
mkd | Macedonian | mkd.traineddata |
mlt | Maltese | mlt.traineddata |
msa | Malay | msa.traineddata |
mya | Burmese | mya.traineddata |
nep | Nepali | nep.traineddata |
nld | Dutch; Flemish | nld.traineddata |
nor | Norwegian | nor.traineddata |
ori | Oriya | ori.traineddata |
pan | Panjabi; Punjabi | pan.traineddata |
pol | Polish | pol.traineddata |
por | Portuguese | por.traineddata |
pus | Pushto; Pashto | pus.traineddata |
ron | Romanian; Moldavian; Moldovan | ron.traineddata |
rus | Russian | rus.traineddata |
san | Sanskrit | san.traineddata |
sin | Sinhala; Sinhalese | sin.traineddata |
slk | Slovak | slk.traineddata |
slv | Slovenian | slv.traineddata |
spa | Spanish; Castilian | spa.traineddata |
spa_old | Spanish; Castilian - Old | spa_old.traineddata |
sqi | Albanian | sqi.traineddata |
srp | Serbian | srp.traineddata |
srp_latn | Serbian - Latin | srp_latn.traineddata |
swa | Swahili | swa.traineddata |
swe | Swedish | swe.traineddata |
syr | Syriac | syr.traineddata |
tam | Tamil | tam.traineddata |
tel | Telugu | tel.traineddata |
tgk | Tajik | tgk.traineddata |
tgl | Tagalog | tgl.traineddata |
tha | Thai | tha.traineddata |
tir | Tigrinya | tir.traineddata |
tur | Turkish | tur.traineddata |
uig | Uighur; Uyghur | uig.traineddata |
ukr | Ukrainian | ukr.traineddata |
urd | Urdu | urd.traineddata |
uzb | Uzbek | uzb.traineddata |
uzb_cyrl | Uzbek - Cyrillic | uzb_cyrl.traineddata |
vie | Vietnamese | vie.traineddata |
yid | Yiddish | yid.traineddata |
Note: For Arabic and Hindi you need both the traineddata file and the cube data files.
Lang Code | Language | 3.04 traineddata |
---|---|---|
afr | Afrikaans | afr.traineddata |
amh | Amharic | amh.traineddata |
ara | Arabic | ara.traineddata |
asm | Assamese | asm.traineddata |
aze | Azerbaijani | aze.traineddata |
aze_cyrl | Azerbaijani - Cyrillic | aze_cyrl.traineddata |
bel | Belarusian | bel.traineddata |
ben | Bengali | ben.traineddata |
bod | Tibetan | bod.traineddata |
bos | Bosnian | bos.traineddata |
bul | Bulgarian | bul.traineddata |
cat | Catalan; Valencian | cat.traineddata |
ceb | Cebuano | ceb.traineddata |
ces | Czech | ces.traineddata |
chi_sim | Chinese - Simplified | chi_sim.traineddata |
chi_tra | Chinese - Traditional | chi_tra.traineddata |
chr | Cherokee | chr.traineddata |
cym | Welsh | cym.traineddata |
dan | Danish | dan.traineddata |
deu | German | deu.traineddata |
dzo | Dzongkha | dzo.traineddata |
ell | Greek, Modern (1453-) | ell.traineddata |
eng | English | eng.traineddata |
enm | English, Middle (1100-1500) | enm.traineddata |
epo | Esperanto | epo.traineddata |
est | Estonian | est.traineddata |
eus | Basque | eus.traineddata |
fas | Persian | fas.traineddata |
fin | Finnish | fin.traineddata |
fra | French | fra.traineddata |
frk | Frankish | frk.traineddata |
frm | French, Middle (ca. 1400-1600) | frm.traineddata |
gle | Irish | gle.traineddata |
glg | Galician | glg.traineddata |
grc | Greek, Ancient (-1453) | grc.traineddata |
guj | Gujarati | guj.traineddata |
hat | Haitian; Haitian Creole | hat.traineddata |
heb | Hebrew | heb.traineddata |
hin | Hindi | hin.traineddata |
hrv | Croatian | hrv.traineddata |
hun | Hungarian | hun.traineddata |
iku | Inuktitut | iku.traineddata |
ind | Indonesian | ind.traineddata |
isl | Icelandic | isl.traineddata |
ita | Italian | ita.traineddata |
ita_old | Italian - Old | ita_old.traineddata |
jav | Javanese | jav.traineddata |
jpn | Japanese | jpn.traineddata |
kan | Kannada | kan.traineddata |
kat | Georgian | kat.traineddata |
kat_old | Georgian - Old | kat_old.traineddata |
kaz | Kazakh | kaz.traineddata |
khm | Central Khmer | khm.traineddata |
kir | Kirghiz; Kyrgyz | kir.traineddata |
kor | Korean | kor.traineddata |
kur | Kurdish | kur.traineddata |
lao | Lao | lao.traineddata |
lat | Latin | lat.traineddata |
lav | Latvian | lav.traineddata |
lit | Lithuanian | lit.traineddata |
mal | Malayalam | mal.traineddata |
mar | Marathi | mar.traineddata |
mkd | Macedonian | mkd.traineddata |
mlt | Maltese | mlt.traineddata |
msa | Malay | msa.traineddata |
mya | Burmese | mya.traineddata |
nep | Nepali | nep.traineddata |
nld | Dutch; Flemish | nld.traineddata |
nor | Norwegian | nor.traineddata |
ori | Oriya | ori.traineddata |
pan | Panjabi; Punjabi | pan.traineddata |
pol | Polish | pol.traineddata |
por | Portuguese | por.traineddata |
pus | Pushto; Pashto | pus.traineddata |
ron | Romanian; Moldavian; Moldovan | ron.traineddata |
rus | Russian | rus.traineddata |
san | Sanskrit | san.traineddata |
sin | Sinhala; Sinhalese | sin.traineddata |
slk | Slovak | slk.traineddata |
slv | Slovenian | slv.traineddata |
spa | Spanish; Castilian | spa.traineddata |
spa_old | Spanish; Castilian - Old | spa_old.traineddata |
sqi | Albanian | sqi.traineddata |
srp | Serbian | srp.traineddata |
srp_latn | Serbian - Latin | srp_latn.traineddata |
swa | Swahili | swa.traineddata |
swe | Swedish | swe.traineddata |
syr | Syriac | syr.traineddata |
tam | Tamil | tam.traineddata |
tel | Telugu | tel.traineddata |
tgk | Tajik | tgk.traineddata |
tgl | Tagalog | tgl.traineddata |
tha | Thai | tha.traineddata |
tir | Tigrinya | tir.traineddata |
tur | Turkish | tur.traineddata |
uig | Uighur; Uyghur | uig.traineddata |
ukr | Ukrainian | ukr.traineddata |
urd | Urdu | urd.traineddata |
uzb | Uzbek | uzb.traineddata |
uzb_cyrl | Uzbek - Cyrillic | uzb_cyrl.traineddata |
vie | Vietnamese | vie.traineddata |
yid | Yiddish | yid.traineddata |
In Tesseract 3.0x Arabic and Hindi use the Cube OCR engine. You need to download the cube files and move them to the same folder where the <ara/hin>.traineddata file is located.
In Tesseract 4.0 the Cube OCR engine was removed from the codebase, so if you are using 4.0 or a newer version these files are not needed.
Hindi:
hin.cube.bigrams,
hin.cube.fold,
hin.cube.lm,
hin.cube.nn,
hin.cube.params,
hin.cube.word-freq,
hin.tesseract_cube.nn
Arabic:
ara.cube.bigrams,
ara.cube.fold,
ara.cube.lm,
ara.cube.nn,
ara.cube.params,
ara.cube.word-freq,
ara.cube.size,
ara.tesseract_cube.nn
These data files were prepared by @paalberti for some old versions of Tesseract. dan_frak
, deu_frak
and swe_frak
were prepared for version 3.00, slk_frak
was prepared for 3.01. Updates to these files are available at paalberti/tesseract-dan-fraktur.
Lang Code | Language | 3.0x traineddata |
---|---|---|
dan_frak | Danish - Fraktur | dan_frak.traineddata |
deu_frak | German - Fraktur | deu_frak.traineddata |
slk_frak | Slovak - Fraktur | slk_frak.traineddata |
swe_frak | Swedish - Fraktur | swe-frak.traineddata |
Lang Code | Language | 2.0x traineddata |
---|---|---|
deu | German | tesseract-2.00.deu.tar.gz |
deu-f | German - Fraktur | tesseract-2.01.deu-f.tar.gz |
eng | English | tesseract-2.00.eng.tar.gz |
eus | Basque | tesseract-2.04-eus.tar.gz |
fra | French | tesseract-2.00.fra.tar.gz |
ita | Italian | tesseract-2.00.ita.tar.gz |
nld | Dutch; Flemish | tesseract-2.00.nld.tar.gz |
por | Portuguese | tesseract-2.01.por.tar.gz |
spa | Spanish; Castilian | tesseract-2.00.spa.tar.gz |
vie | Vietnamese | tesseract-2.01.vie.tar.gz |
The traineddata
file for each language is an archive file in a Tesseract specific format. It contains several uncompressed component files which are needed by the Tesseract OCR process. The program combine_tessdata
is used to create a tessdata
file from the component files and can also extract them again like in this example:
combine_tessdata -u test/tessdata/eng.traineddata eng.
Extracting tessdata components from test/tessdata/eng.traineddata
Wrote eng.unicharset
Wrote eng.unicharambigs
Wrote eng.inttemp
Wrote eng.pffmtable
Wrote eng.normproto
Wrote eng.punc-dawg
Wrote eng.word-dawg
Wrote eng.number-dawg
Wrote eng.freq-dawg
Wrote eng.cube-unicharset
Wrote eng.cube-word-dawg
Wrote eng.shapetable
Wrote eng.bigram-dawg
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
There are some proposals to replace the Tesseract archive format by a standard archive format which could also support compression. A discussion on the tesseract-dev forum proposed the ZIP format already in 2014. In 2017 an experimental implementation was provided as a pull request.
Old wiki - no longer maintained. The pages were moved, see the new documentation.
These wiki pages are no longer maintained.
All pages were moved to tesseract-ocr/tessdoc.
The latest documentation is available at https://tesseract-ocr.github.io/.