Can't get content from zhihu.com #2

wudizhuo · 2018-07-23T07:12:42Z

Hi, Readability4J is a nice library, I found a website that can't working with Readability4J, please check it thank you.

" https://zhuanlan.zhihu.com/p/22049205 "

Readability4J can't get content from this URL, but Mozilla‘s Readability.js is working, please check this, thank you.

wudizhuo · 2018-07-23T07:13:10Z

please let me know if you need further information.

dankito · 2018-07-23T17:30:29Z

May the same issue as #1. Did you check if wrapping the output in

<html>
 <head>
  <meta charset="utf-8" /> 
 </head>
 <body>
 <!-- output here -->
 </body>
</html>

solves the issue?

Wrapping it in above HTML code produced that output for me: https://dankito.net/test/zhihu-output.html.

The reason why Article.getContent() returns content in a <div> and not in <html> is that Readability.js does the same.
In the future I may add a method getContentWrappedInHtmlBody() to Article to make it clearer.

Another issue is that the images aren't displayed.
Reason for that is that the site you mentioned sets the real image url in data-original attribute and not in src.
I pushed a commit for Readability4JExtended that fixes that.
Readability4JExtended then produces that output: https://dankito.net/test/zhihu-output-extended.html.

You can use it in this way:

Readability4JExtended readabilityExtended = Readability4JExtended(/* ... */);
Article article = readabilityExtended.parse();

Do you think the above output is OK and solves the issue?

wudizhuo · 2018-07-30T00:49:22Z

thanks for your reply, but I think the Mozilla library output is with the tag, but your explanation is very helpful, thanks, I'll close the issue.

looking forward to getContentWrappedInHtmlBody method ^^

dankito · 2018-08-12T20:05:57Z

Just released version 1.0.1.

Article now has the method article.getContentWithUtf8Encoding() to get content wrapped in html body and encoding set to UTF-8.

wudizhuo closed this as completed Jul 30, 2018

dankito added the v1.0.1 label Aug 12, 2018

dankito mentioned this issue Jun 27, 2021

[Bug] Characters like äüö are output incorrectly #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't get content from zhihu.com #2

Can't get content from zhihu.com #2

wudizhuo commented Jul 23, 2018

wudizhuo commented Jul 23, 2018

dankito commented Jul 23, 2018 •

edited

Loading

wudizhuo commented Jul 30, 2018 •

edited

Loading

dankito commented Aug 12, 2018

Can't get content from zhihu.com #2

Can't get content from zhihu.com #2

Comments

wudizhuo commented Jul 23, 2018

wudizhuo commented Jul 23, 2018

dankito commented Jul 23, 2018 • edited Loading

wudizhuo commented Jul 30, 2018 • edited Loading

dankito commented Aug 12, 2018

dankito commented Jul 23, 2018 •

edited

Loading

wudizhuo commented Jul 30, 2018 •

edited

Loading