-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decodeEntities false code tag #1198
Comments
It seems that you need to set it to true.
|
@Jasonlhy if i set true, the chinese will be decoded, eg: const $ = cheerio.load(`<p>哈哈哈哈</p><code><script src></code>`, {
decodeEntities: true
});
console.log($.html()); output: <html><head></head><body><p>哈哈哈哈</p><code><script src></code></body></html> |
After a little bit of research and testing. Parser5 (The HTML parser) does not support the option: Both cases will print out 哈哈哈哈.
I think the reason of this problem is the serializer, it is located at https://github.com/cheeriojs/dom-serializer/blob/master/index.js It seems to change all decoded entities (Including Chinese text) back to encoded one if
|
This is the biggest problem of the project |
I had documents containing html like I kept diff --git a/node_modules/dom-serializer/index.js b/node_modules/dom-serializer/index.js
index 4f41e8f..8f9f14f 100644
--- a/node_modules/dom-serializer/index.js
+++ b/node_modules/dom-serializer/index.js
@@ -136,6 +136,8 @@ function renderText(elem, opts) {
data = entities.encodeXML(data);
}
+ data = data.replace(/</g, '<').replace(/>/g, '>')
+
return data;
} Since that only affects text nodes, it should not mess with anything, but caveat emptor. I wonder if the clean solution here is replacing that |
This issue still exists now and I am crazy because I am building a static site generator with cheerio. This problem behaves in two ways:
Can anybody locate the problem and fix it? Thanks a lot. |
After reading codes for 3 hours I believe I find where problem begins. @bard I think your patch is not correct, not only innerHTMLs but attrs also has this problem. The problem is because cheerio is switching from If we use
But whether
We'd better not modify It's a little bit long because it's complex, but please read and fix it, we cannot always lock version to |
Based on jsdom/jsdom#2501 and inikulin/parse5#261 (comment) it's unlikely that this will be handled in The second thread suggests that it's possible to grab the raw, undecoded text from the input data based on node position, though that would require modifying dom-serializer, too. |
@bard |
due to inikulin/parse5#75 we know Edit: I got a better idea, we just need to remove entities re-encode in Edit: However, this will break compatibility, needs an version bump... Edit: PR is here cheeriojs/dom-serializer#80 |
Thanks @AlynxZhou. Great find. Your 1st Solution worked for me, but you are right that we can't be permanently be bound to version 0.22.0. |
I send a PR to dom-serializer to fix 1.0.0, however no one replied...
…On Mon, Jun 3, 2019, 08:55 Iqbal Ahmed ***@***.***> wrote:
Thanks @AlynxZhou <https://github.com/AlynxZhou>. Great find. Your 1st
Solution worked for me, but you are right that we can't be permanently be
bound to version 0.22.0.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1198?email_source=notifications&email_token=ACANVR42P4FOIMW62TLNKKDPYRTWVA5CNFSM4E7WNJK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWYBZXQ#issuecomment-498080990>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACANVR7VPU5TRUT4COB57HDPYRTWVANCNFSM4E7WNJKQ>
.
|
@fb55 @jugglinmike this seems to be causing some folks pain. If i recall, we were going to continue supporting both |
This is because parse5 ignores decodeEntity, I modified dom-serializer to fix this, but it needs a version bump. |
I wonder what breaks exactly with decoded entities. The only thing that comes to my mind is that resulting pages are served in encoding other than UTF-8. Otherwise, nowadays there is no reason to prefer HTML entities (with an exception for safety-related HTML escaping) over code points. |
@inikulin The issue is serialization. Some people have entities as plain text in their HTML or want to maintain the encoding style they've used. |
@fb55 But for people who don't want HTML entities in their files, cheerio 1.0.0 is totally broken because it cannot handle things correctly (decodeEntities false, I hope you can consider my PR cheeriojs/dom-serializer#80, I am using cheerio in my tools and I cannot always stick to 0.22.0. |
@AlynxZhou this workaround should help. I've been using it in production since April with no ill effects. |
@claviska Thank you. But what I want is not a workaround, I know I can solve it by a simple hook. However, the problem is a bug in fact, it is because new library's behavior is different from the old one, and it needs to be fixed by remove unsupported option, it's complex and needs a change of logic, workaround can only hide the problem, but the problem still exists and may break other part. Anyway thank you for your help. |
@AlynxZhou I'd like to see it fixed too, but this is open source so our options are to:
What I've proposed can be used as a separate module that you can easily remove once the bug is properly fixed. If you want to use the latest version without this encoding issue, it's probably your best bet. Cheers |
@claviska I have send a PR to dom-serializer which removed decodeEntities because parser5 always do decoding and this options is useless. However, still no one replied. |
To unblock, you can use the old parser: const cheerio = require('cheerio')
const $ = cheerio.load(`<code><script src></code>`, {
_useHtmlParser2: true,
decodeEntities: false,
})
console.log($.html())
// <code><script src></code> This is undocumented and may be removed when we go to 1.0.0. However, since I'm telling people they can use it, we'll probably need to support it. I hope this is okay, @jugglinmike @fb55. @AlynxZhou thanks for taking the time to share your findings. Unfortunately, it's not easy for me to tell if the snippet above solves your problem or not. Can you please reframe your problem in the following way:
This will help us debug your issue. You can refer to the original poster for a decent example of this. |
@matthewmueller No need to check via code, I can tell you this can definitely solve problem, however, as you said, Such snippet is just let 1.0.0 works like 0.22.0, to solve problem we need to consider differences between parse5 and htmlparser2, and I need to say again, encode all things into XML entities is a BAD option and we may consider to drop it because parse5 always decode input. |
It's highly dependent on your use case, which is why we need to provide an option to switch. parse5 is a better supported parser at this point and I believe it was @fb55's idea to switch to it over his own parser. FYI, for the final public API I'd probably try and have a parser interface like this: const parser = {
parse(code: string): AST
}
cheerio.load("...", { parser: parser }) What do you think @jugglinmike @fb55 ? |
@matthewmueller Sorry, I am reading code again now and find that the problem has been fixed by using parse5's serializer instead of dom-serializer, so I am trying to test it, I am REALLY happy to see that it may be fixed! UPDATE: I am sure this is fixed in const cheerio = require('cheerio')
const $ = cheerio.load('<p><哈哈哈</p>')
$('body').html() // output -> '<p><哈哈哈</p>'
( Line 106 in c635bea
|
So, #1307 fixed this issue? |
I think it was fixed before this, you may do a test. |
@matthewmueller Any plan for releasing 1.0.0? Or could you please consider to release another rc version to fix this problem? We really want 1.0.0 branch because it fixed this bug. Thanks! |
I can confirm that the problem is fixed on the 1.0.0 branch 👍 I installed it this way:
This is not ideal and hope that a new RC will soon be released. |
This fixes a bug (cheeriojs/cheerio#1198) that decodes entities
is be fixed ? I also have this problem
emmm , it's ok
|
two years past and they still don't update version on npm, so you will still face this problem if you install 1.0.0-rc3 from npm. You need to install 1.0.0 branch from github. |
Related: - cheeriojs/cheerio#1198 - cheeriojs/cheerio#1219
|
looks like for xhtml output + cjk charaters, import { load } from 'cheerio' // 1.0.0-rc.12
const html = `
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Document</title>
</head>
<body>
<br />
<pre>
// 哈哈哈
// swift generic
var a5 = MemoryLayout<UInt>.size
</pre>
</body>
</html>
`
console.log('-'.repeat(20))
{
const $ = load(html, {})
const y = $.xml().trim()
console.log(y)
// 哈哈哈 <--------- wrong, so `decodeEntities` is needed
// swift generic
// var a5 = MemoryLayout<UInt>.size
}
console.log('-'.repeat(20))
{
const $ = load(html, {
decodeEntities: false,
})
const y = $.xml().trim()
console.log(y)
// 哈哈哈 <--------- correct
// swift generic
// var a5 = MemoryLayout<UInt>.size <--------- wrong, it's code in a `<pre>` tag
}
console.log('-'.repeat(20))
{
const $ = load(html, {
// @ts-ignore
_useHtmlParser2: true,
decodeEntities: false,
})
const y = $.xml().trim()
console.log(y)
// 哈哈哈 <--------- correct
// swift generic
// var a5 = MemoryLayout<UInt>.size <--------- correct
} |
output
you can see that the string
<script src>
has been unescapedThe text was updated successfully, but these errors were encountered: