Skip to content
gignu edited this page Mar 23, 2021 · 43 revisions

How it works

This wiki is supposed to you a short overview of how things work under the hood. It is not a detailed description but rather an approximation and simplification with no aim of being 100% accurate.

Encoding Detection

Lorem Ipsum

Language Detection

To detect the language, files are read either in UTF-8 or ISO-8859-1 depending on whether UTF-8 has been detected previously.

Files are scanned for specific words that are unique to only one language. Each language has one to three of those words. When a file contains the word "the" for example it is a strong indication that the language is English. If we can find "the" 150 times and we're unable to find words that indicate other languages, we can be pretty sure that the file is written in English.

The words for each language are carefully chosen, to make sure that they appear with a similar frequency. If "the" appears on average 150 times in an English text with 30000 characters, we need to make sure that "c'est" which is a strong indication for French, appears about 150 as well in a typical French text with 30000 characters.

After counting the matches for each language we assume that the language with the most matches must be the language that the file is written in. What we still need to determine is the likelihood for our assumption to be true. That's were the confidence score comes into play.

Confidence Score

Lorem Ipsum

Clone this wiki locally