Charset detection #81

aurelien-baudet · 2019-07-06T08:41:44Z

Add real implementations for charset detection.
Use charset detection everytime getBytes is used.
Possible libraries or tools:

Apache TikaEncodingDetector / http://mvnrepository.com/artifact/org.apache.any23/apache-any23-encoding
https://www.turro.org/publications/detect-the-charset-in-java-strings
https://github.com/BYVoid/uchardet
juniversalchardet
ICU4J: http://site.icu-project.org/
https://github.com/codehaus/guessencoding
http://jchardet.sourceforge.net/

Also provide a way for users to override automatic guessing for particular file.

Development branch is features/charset/detection

As far as I know, there is no general library in this context to be suitable for all types of problems. So, for each problem you should test the existing libraries and select the best one which satisfies your problem’s constraints, but often none of them is appropriate. In these cases you can write your own Encoding Detector! As I have wrote ...

I’ve wrote a meta java tool for detecting charset encoding of HTML Web pages, using IBM ICU4j and Mozilla JCharDet as the built-in components. Here you can find my tool, please read the README section before anything else. Also, you can find some basic concepts of this problem in my paper and in its references.

Bellow I provided some helpful comments which I’ve experienced in my work:

Charset detection is not a foolproof process, because it is essentially based on statistical data and what actually happens is guessing not detecting

icu4j is the main tool in this context by IBM, imho

Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests (at most %1, as I remember)

icu4j is much more general than jchardet, icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8

Due to the widespread use of UTF-8 in HTML-world; jchardet is a better choice than icu4j in overall, but is not the best choice!

icu4j is great for East Asian specific encodings like EUC-KR, EUC-JP, SHIFT_JIS, BIG5 and the GB family encodings

Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings. Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic

Almost all encoding detection tools are using statistical methods, so the accuracy of output strongly depends on the size and the contents of the input

Some encodings are essentially the same just with a partial differences, so in some cases the guessed or detected encoding may be false but at the same time be true! As about Windows-1252 and ISO-8859-1. (refer to the last paragraph under the 5.2 section of my paper)

The text was updated successfully, but these errors were encountered:

aurelien-baudet added enhancement P2 labels Jul 6, 2019

aurelien-baudet added this to the Release v3.0.0 milestone Jul 6, 2019

aurelien-baudet removed this from the Release v3.0.0 milestone Feb 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Charset detection #81

Charset detection #81

aurelien-baudet commented Jul 6, 2019 •

edited

Loading

Charset detection #81

Charset detection #81

Comments

aurelien-baudet commented Jul 6, 2019 • edited Loading

aurelien-baudet commented Jul 6, 2019 •

edited

Loading