EncodingDetect #1

wizos · 2018-07-06T14:09:21Z

This is great! But can you improve the automatic identification webpage's charset? If the encode is GB2312 or GBK, it will cause error.

dankito · 2018-07-11T21:27:26Z

Do you have any suggestions how to detect the correct encoding? Pull requests are welcome.

As a proposal I could check the HTML header and use that one.
But give me some time, I moved recently and in my new apartment there's still a lot to do.

dankito · 2018-07-17T22:51:07Z

Could you provide me some test data with source html, actual output and expected output?

I checked some sites like http://www.sina.com.cn and http://www.huanqiu.com, and both tell their charset is utf-8.

For example for http://news.sina.com.cn/gov/xlxw/2018-07-17/doc-ihfkffam3728018.shtml Readability4J generates this output: https://dankito.net/test/sina-output.html.

May you only have to wrap the output in

<html>
 <head>
  <meta charset="GBK" /> 
 </head>
 <body>
 <!-- output here -->
 </body>
</html>

Does it then work as you expect?

wizos · 2018-07-18T03:17:26Z

I subscribe a website is: http://www.shgjj.com/html/zyxw/index.html.
Its output charset is .

dankito · 2018-07-22T21:00:59Z

Sorry for letting you wait so long!

I just tried it with this url
http://www.shgjj.com/html/zyxw/101770.html
and it produced that output
https://dankito.net/test/shgjj-output.html.

I just wrapped the Readability4J output in

<html>
 <head>
  <meta charset="utf-8" /> 
 </head>
 <body>
 <!-- output here -->
 </body>
</html>

(not charset="GBK" as suggested in my last post) so that a browser shows the characters correctly.

As I don't understand Chinese that well, what would you say, is the output OK?

wizos · 2018-07-23T01:28:46Z

Thank you, this output is normal!

… Issue 1 and 2 (#1).

dankito closed this as completed Jul 23, 2018

dankito mentioned this issue Jul 23, 2018

Can't get content from zhihu.com #2

Closed

dankito added a commit that referenced this issue Aug 12, 2018

Implemented wrapping content in <html> structure so set encoding, see…

8d4b042

… Issue 1 and 2 (#1).

dankito added the v1.0.1 label Aug 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EncodingDetect #1

EncodingDetect #1

wizos commented Jul 6, 2018

dankito commented Jul 11, 2018

dankito commented Jul 17, 2018

wizos commented Jul 18, 2018

dankito commented Jul 22, 2018

wizos commented Jul 23, 2018

EncodingDetect #1

EncodingDetect #1

Comments

wizos commented Jul 6, 2018

dankito commented Jul 11, 2018

dankito commented Jul 17, 2018

wizos commented Jul 18, 2018

dankito commented Jul 22, 2018

wizos commented Jul 23, 2018