You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Also provide a way for users to override automatic guessing for particular file.
Development branch is features/charset/detection
As far as I know, there is no general library in this context to be suitable for all types of problems. So, for each problem you should test the existing libraries and select the best one which satisfies your problem’s constraints, but often none of them is appropriate. In these cases you can write your own Encoding Detector! As I have wrote ...
I’ve wrote a meta java tool for detecting charset encoding of HTML Web pages, using IBM ICU4j and Mozilla JCharDet as the built-in components. Here you can find my tool, please read the README section before anything else. Also, you can find some basic concepts of this problem in my paper and in its references.
Bellow I provided some helpful comments which I’ve experienced in my work:
Charset detection is not a foolproof process, because it is essentially based on statistical data and what actually happens is guessing not detecting
icu4j is the main tool in this context by IBM, imho
Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests (at most %1, as I remember)
icu4j is much more general than jchardet, icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8
Due to the widespread use of UTF-8 in HTML-world; jchardet is a better choice than icu4j in overall, but is not the best choice!
icu4j is great for East Asian specific encodings like EUC-KR, EUC-JP, SHIFT_JIS, BIG5 and the GB family encodings
Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings. Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic
Almost all encoding detection tools are using statistical methods, so the accuracy of output strongly depends on the size and the contents of the input
Some encodings are essentially the same just with a partial differences, so in some cases the guessed or detected encoding may be false but at the same time be true! As about Windows-1252 and ISO-8859-1. (refer to the last paragraph under the 5.2 section of my paper)
The text was updated successfully, but these errors were encountered:
Add real implementations for charset detection.
Use charset detection everytime getBytes is used.
Possible libraries or tools:
Also provide a way for users to override automatic guessing for particular file.
Development branch is
features/charset/detection
The text was updated successfully, but these errors were encountered: