Release 1.2.0
Description
After lot of unsuccessful experimentations, I'm glad to have find a way to improve the accuracy and release it.
I decided to focus on accuracy over quantity for the moment. Making sure the algorithm work properly before trying to scale it up.
With this version 1.2.0
:
- Both
tinyld
andtinyld-light
are over 97% accuracy on 16 most common languages tinyld
global accuracy on all language (64) is over 95% and each language has an accuracy > 80%- This change cause a small disk size increase
Change
Change to the algorithm
- Remove the word ranking step
- Improve the n-gram ranking (based on a variable number of gram)
- Per language coefficient to more accurately specify how much ngram to store per language (optimize space storage)
- use 4-gram and 5-gram more often (as a replacement of word)
New API
Few new API to get the list of supported language and their names
import { supportedLanguages, langName, langRegion } from 'tinyld'
// all supported languages (ISO3 format)
supportedLanguages // ['jpn', 'cmn', ...]
// and few utils about langs
langName('jpn') // Japanese
langRegion('jpn') // east-asia
Language support
- Few languages were disabled
- Few languages were added
- The total number of language is now 64, for the ones removed it's mostly because of bad accuracy (often because of a not good enough training dataset). I will try to bring them back as soon a possible when their accuracy pass over the 80% accuracy threshold.
Per language Detection Accuracy
- Greek (ell) - 100%
- Hindi (hin) - 100%
- Bengali (ben) - 100%
- Thai (tha) - 100%
- Telugu (tel) - 100%
- Gujarati (guj) - 100%
- Tamil (tam) - 100%
- Amharic (amh) - 100%
- Kannada (kan) - 100%
- Burmese (mya) - 100%
- Armenian (hye) - 99.9555%
- Japanese (jpn) - 99.9333%
- Vietnamese (vie) - 99.9067%
- Korean (kor) - 99.8134%
- Khmer (khm) - 99.7354%
- Urdu (urd) - 99.2537%
- Hebrew (heb) - 99.1068%
- Berber (ber) - 99.0135%
- German (deu) - 98.9601%
- Toki Pona (toki) - 98.8801%
- Russian (rus) - 98.8268%
- Persian (pes) - 98.8135%
- Polish (pol) - 98.8002%
- Chinese (cmn) - 98.7602%
- French (fra) - 98.7068%
- Arabic (ara) - 98.4669%
- Finnish (fin) - 98.0936%
- English (eng) - 98.0136%
- Yiddish (yid) - 97.9869%
- Romanian (ron) - 97.9336%
- Mongolian (mon) - 97.8058%
- Lithuanian (lit) - 97.8003%
- Icelandic (isl) - 97.7203%
- Klingon (tlh) - 97.6803%
- Hungarian (hun) - 97.5603%
- Kazakh (kaz) - 97.4214%
- Indonesian (ind) - 97.267%
- Dutch (nld) - 96.8937%
- Tatar (tat) - 96.8271%
- Latvian (lvs) - 96.4734%
- Tagalog (tgl) - 95.8539%
- Ukrainian (ukr) - 95.4673%
- Turkish (tur) - 95.214%
- Portuguese (por) - 95.054%
- Kirundi (run) - 94.6058%
- Turkmen (tuk) - 94.5193%
- Italian (ita) - 94.4541%
- Belarusian (bel) - 94.2808%
- Esperanto (epo) - 93.9475%
- Spanish (spa) - 93.4009%
- Volapuk (vol) - 92.6978%
- Swedish (swe) - 91.9344%
- Irish (gle) - 89.6735%
- Latin (lat) - 89.0948%
- Estonian (est) - 88.6921%
- Czech (ces) - 88.5749%
- Catalan (cat) - 88.0949%
- Danish (dan) - 87.375%
- Afrikaans (afr) - 86.578%
- Bulgarian (bul) - 84.5754%
- Slovak (slk) - 83.4555%
- Serbian (srp) - 83.0823%
- Macedonian (mkd) - 82.709%
- Norwegian (nob) - 81.5358%