Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get content from zhihu.com #2

Closed
wudizhuo opened this issue Jul 23, 2018 · 4 comments
Closed

Can't get content from zhihu.com #2

wudizhuo opened this issue Jul 23, 2018 · 4 comments
Labels

Comments

@wudizhuo
Copy link

Hi, Readability4J is a nice library, I found a website that can't working with Readability4J, please check it thank you.

" https://zhuanlan.zhihu.com/p/22049205 "

Readability4J can't get content from this URL, but Mozilla‘s Readability.js is working, please check this, thank you.

@wudizhuo
Copy link
Author

please let me know if you need further information.

@dankito
Copy link
Owner

dankito commented Jul 23, 2018

May the same issue as #1. Did you check if wrapping the output in

<html>
 <head>
  <meta charset="utf-8" /> 
 </head>
 <body>
 <!-- output here -->
 </body>
</html>

solves the issue?

Wrapping it in above HTML code produced that output for me: https://dankito.net/test/zhihu-output.html.

The reason why Article.getContent() returns content in a <div> and not in <html> is that Readability.js does the same.
In the future I may add a method getContentWrappedInHtmlBody() to Article to make it clearer.

Another issue is that the images aren't displayed.
Reason for that is that the site you mentioned sets the real image url in data-original attribute and not in src.
I pushed a commit for Readability4JExtended that fixes that.
Readability4JExtended then produces that output: https://dankito.net/test/zhihu-output-extended.html.

You can use it in this way:

Readability4JExtended readabilityExtended = Readability4JExtended(/* ... */);
Article article = readabilityExtended.parse();

Do you think the above output is OK and solves the issue?

@wudizhuo
Copy link
Author

wudizhuo commented Jul 30, 2018

thanks for your reply, but I think the Mozilla library output is with the tag, but your explanation is very helpful, thanks, I'll close the issue.

looking forward to getContentWrappedInHtmlBody method ^^

@dankito
Copy link
Owner

dankito commented Aug 12, 2018

Just released version 1.0.1.

Article now has the method article.getContentWithUtf8Encoding() to get content wrapped in html body and encoding set to UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants