Release Release 1.2.0 · komodojp/tinyld

Description

After lot of unsuccessful experimentations, I'm glad to have find a way to improve the accuracy and release it.
I decided to focus on accuracy over quantity for the moment. Making sure the algorithm work properly before trying to scale it up.

With this version 1.2.0:

Both tinyld and tinyld-light are over 97% accuracy on 16 most common languages
tinyld global accuracy on all language (64) is over 95% and each language has an accuracy > 80%
This change cause a small disk size increase

Change

Change to the algorithm

Remove the word ranking step
Improve the n-gram ranking (based on a variable number of gram)
Per language coefficient to more accurately specify how much ngram to store per language (optimize space storage)
use 4-gram and 5-gram more often (as a replacement of word)

New API

Few new API to get the list of supported language and their names

import { supportedLanguages, langName, langRegion } from 'tinyld'

// all supported languages (ISO3 format)
supportedLanguages // ['jpn', 'cmn', ...]

// and few utils about langs
langName('jpn') // Japanese
langRegion('jpn') // east-asia

Language support

Few languages were disabled
Few languages were added
The total number of language is now 64, for the ones removed it's mostly because of bad accuracy (often because of a not good enough training dataset). I will try to bring them back as soon a possible when their accuracy pass over the 80% accuracy threshold.

Per language Detection Accuracy

 - Greek (ell) - 100%
 - Hindi (hin) - 100%
 - Bengali (ben) - 100%
 - Thai (tha) - 100%
 - Telugu (tel) - 100%
 - Gujarati (guj) - 100%
 - Tamil (tam) - 100%
 - Amharic (amh) - 100%
 - Kannada (kan) - 100%
 - Burmese (mya) - 100%
 - Armenian (hye) - 99.9555%
 - Japanese (jpn) - 99.9333%
 - Vietnamese (vie) - 99.9067%
 - Korean (kor) - 99.8134%
 - Khmer (khm) - 99.7354%
 - Urdu (urd) - 99.2537%
 - Hebrew (heb) - 99.1068%
 - Berber (ber) - 99.0135%
 - German (deu) - 98.9601%
 - Toki Pona (toki) - 98.8801%
 - Russian (rus) - 98.8268%
 - Persian (pes) - 98.8135%
 - Polish (pol) - 98.8002%
 - Chinese (cmn) - 98.7602%
 - French (fra) - 98.7068%
 - Arabic (ara) - 98.4669%
 - Finnish (fin) - 98.0936%
 - English (eng) - 98.0136%
 - Yiddish (yid) - 97.9869%
 - Romanian (ron) - 97.9336%
 - Mongolian (mon) - 97.8058%
 - Lithuanian (lit) - 97.8003%
 - Icelandic (isl) - 97.7203%
 - Klingon (tlh) - 97.6803%
 - Hungarian (hun) - 97.5603%
 - Kazakh (kaz) - 97.4214%
 - Indonesian (ind) - 97.267%
 - Dutch (nld) - 96.8937%
 - Tatar (tat) - 96.8271%
 - Latvian (lvs) - 96.4734%
 - Tagalog (tgl) - 95.8539%
 - Ukrainian (ukr) - 95.4673%
 - Turkish (tur) - 95.214%
 - Portuguese (por) - 95.054%
 - Kirundi (run) - 94.6058%
 - Turkmen (tuk) - 94.5193%
 - Italian (ita) - 94.4541%
 - Belarusian (bel) - 94.2808%
 - Esperanto (epo) - 93.9475%
 - Spanish (spa) - 93.4009%
 - Volapuk (vol) - 92.6978%
 - Swedish (swe) - 91.9344%
 - Irish (gle) - 89.6735%
 - Latin (lat) - 89.0948%
 - Estonian (est) - 88.6921%
 - Czech (ces) - 88.5749%
 - Catalan (cat) - 88.0949%
 - Danish (dan) - 87.375%
 - Afrikaans (afr) - 86.578%
 - Bulgarian (bul) - 84.5754%
 - Slovak (slk) - 83.4555%
 - Serbian (srp) - 83.0823%
 - Macedonian (mkd) - 82.709%
 - Norwegian (nob) - 81.5358%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 1.2.0