Skip to content

Release 1.2.0

Compare
Choose a tag to compare
@kefniark kefniark released this 05 Jan 15:49
· 40 commits to develop since this release

Description

After lot of unsuccessful experimentations, I'm glad to have find a way to improve the accuracy and release it.
I decided to focus on accuracy over quantity for the moment. Making sure the algorithm work properly before trying to scale it up.

With this version 1.2.0:

  • Both tinyld and tinyld-light are over 97% accuracy on 16 most common languages
  • tinyld global accuracy on all language (64) is over 95% and each language has an accuracy > 80%
  • This change cause a small disk size increase

Change

Change to the algorithm

  • Remove the word ranking step
  • Improve the n-gram ranking (based on a variable number of gram)
  • Per language coefficient to more accurately specify how much ngram to store per language (optimize space storage)
  • use 4-gram and 5-gram more often (as a replacement of word)

New API

Few new API to get the list of supported language and their names

import { supportedLanguages, langName, langRegion } from 'tinyld'

// all supported languages (ISO3 format)
supportedLanguages // ['jpn', 'cmn', ...]

// and few utils about langs
langName('jpn') // Japanese
langRegion('jpn') // east-asia

Language support

  • Few languages were disabled
  • Few languages were added
  • The total number of language is now 64, for the ones removed it's mostly because of bad accuracy (often because of a not good enough training dataset). I will try to bring them back as soon a possible when their accuracy pass over the 80% accuracy threshold.

Per language Detection Accuracy

 - Greek (ell) - 100%
 - Hindi (hin) - 100%
 - Bengali (ben) - 100%
 - Thai (tha) - 100%
 - Telugu (tel) - 100%
 - Gujarati (guj) - 100%
 - Tamil (tam) - 100%
 - Amharic (amh) - 100%
 - Kannada (kan) - 100%
 - Burmese (mya) - 100%
 - Armenian (hye) - 99.9555%
 - Japanese (jpn) - 99.9333%
 - Vietnamese (vie) - 99.9067%
 - Korean (kor) - 99.8134%
 - Khmer (khm) - 99.7354%
 - Urdu (urd) - 99.2537%
 - Hebrew (heb) - 99.1068%
 - Berber (ber) - 99.0135%
 - German (deu) - 98.9601%
 - Toki Pona (toki) - 98.8801%
 - Russian (rus) - 98.8268%
 - Persian (pes) - 98.8135%
 - Polish (pol) - 98.8002%
 - Chinese (cmn) - 98.7602%
 - French (fra) - 98.7068%
 - Arabic (ara) - 98.4669%
 - Finnish (fin) - 98.0936%
 - English (eng) - 98.0136%
 - Yiddish (yid) - 97.9869%
 - Romanian (ron) - 97.9336%
 - Mongolian (mon) - 97.8058%
 - Lithuanian (lit) - 97.8003%
 - Icelandic (isl) - 97.7203%
 - Klingon (tlh) - 97.6803%
 - Hungarian (hun) - 97.5603%
 - Kazakh (kaz) - 97.4214%
 - Indonesian (ind) - 97.267%
 - Dutch (nld) - 96.8937%
 - Tatar (tat) - 96.8271%
 - Latvian (lvs) - 96.4734%
 - Tagalog (tgl) - 95.8539%
 - Ukrainian (ukr) - 95.4673%
 - Turkish (tur) - 95.214%
 - Portuguese (por) - 95.054%
 - Kirundi (run) - 94.6058%
 - Turkmen (tuk) - 94.5193%
 - Italian (ita) - 94.4541%
 - Belarusian (bel) - 94.2808%
 - Esperanto (epo) - 93.9475%
 - Spanish (spa) - 93.4009%
 - Volapuk (vol) - 92.6978%
 - Swedish (swe) - 91.9344%
 - Irish (gle) - 89.6735%
 - Latin (lat) - 89.0948%
 - Estonian (est) - 88.6921%
 - Czech (ces) - 88.5749%
 - Catalan (cat) - 88.0949%
 - Danish (dan) - 87.375%
 - Afrikaans (afr) - 86.578%
 - Bulgarian (bul) - 84.5754%
 - Slovak (slk) - 83.4555%
 - Serbian (srp) - 83.0823%
 - Macedonian (mkd) - 82.709%
 - Norwegian (nob) - 81.5358%