-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EncodingDetect #1
Comments
Do you have any suggestions how to detect the correct encoding? Pull requests are welcome. As a proposal I could check the HTML header and use that one. |
Could you provide me some test data with source html, actual output and expected output? I checked some sites like http://www.sina.com.cn and http://www.huanqiu.com, and both tell their charset is utf-8. For example for http://news.sina.com.cn/gov/xlxw/2018-07-17/doc-ihfkffam3728018.shtml Readability4J generates this output: https://dankito.net/test/sina-output.html. May you only have to wrap the output in
Does it then work as you expect? |
I subscribe a website is: http://www.shgjj.com/html/zyxw/index.html. |
Sorry for letting you wait so long! I just tried it with this url I just wrapped the Readability4J output in
(not charset="GBK" as suggested in my last post) so that a browser shows the characters correctly. As I don't understand Chinese that well, what would you say, is the output OK? |
Thank you, this output is normal! |
This is great! But can you improve the automatic identification webpage's charset? If the encode is GB2312 or GBK, it will cause error.
The text was updated successfully, but these errors were encountered: