Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decodeEntities false code tag #1198

Closed
huanz opened this issue May 14, 2018 · 35 comments
Closed

decodeEntities false code tag #1198

huanz opened this issue May 14, 2018 · 35 comments

Comments

@huanz
Copy link

huanz commented May 14, 2018

const $ = cheerio.load(`<code>&lt;script src&gt;</code>`, {
  decodeEntities: false
});
console.log($.html());

output

<html><head></head><body><code><script src></code></body></html>

you can see that the string &lt;script src&gt; has been unescaped

@huanz huanz changed the title decodeEntities false code tag decodeEntities false code tag May 14, 2018
@Jasonlhy
Copy link

It seems that you need to set it to true.

$ = cheerio.load(`<code>&lt;script src&gt;=></code>`, {   decodeEntities: true }); console.log($.html());
<html><head></head><body><code>&lt;script src&gt;=&gt;</code></body></html>

@huanz
Copy link
Author

huanz commented May 29, 2018

@Jasonlhy if i set true, the chinese will be decoded, eg:

const $ = cheerio.load(`<p>哈哈哈哈</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($.html());

output:

<html><head></head><body><p>&#x54C8;&#x54C8;&#x54C8;&#x54C8;</p><code>&lt;script src&gt;</code></body></html>

@Jasonlhy
Copy link

Jasonlhy commented May 29, 2018

After a little bit of research and testing. Parser5 (The HTML parser) does not support the option: decodeEntities now, the result from parser5 is default decoded (unescaped).

Both cases will print out 哈哈哈哈.

const $ = cheerio.load(`<p>哈哈哈哈</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($('p').text());
const $ = cheerio.load(`<p>哈哈哈哈</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($('p').text());

I think the reason of this problem is the serializer, it is located at https://github.com/cheeriojs/dom-serializer/blob/master/index.js

It seems to change all decoded entities (Including Chinese text) back to encoded one if decodeEntities set to true

function renderText(elem, opts) {
  var data = elem.data || '';

  // if entities weren't decoded, no need to encode them back
  if (opts.decodeEntities && !(elem.parent && elem.parent.name in unencodedElements)) {
    data = entities.encodeXML(data);
  }

  return data;
}

@senwzz
Copy link

senwzz commented Aug 7, 2018

This is the biggest problem of the project

@bard
Copy link

bard commented Apr 10, 2019

I had documents containing html like <pre>&lt;span&gt;a = 1&lt;/span&gt;</pre> (i.e. code samples of html itself) and wanted to process them with cheerio before displaying sending them to the browser. decodeEntities: false obviously was not working due to the problem reported in this issue, and decodeEntities: true was encoding far more than < and >.

I kept decodeEntities: false and used https://www.npmjs.com/package/patch-package to generate and automatically apply this patch to dom-serializer upon every install:

diff --git a/node_modules/dom-serializer/index.js b/node_modules/dom-serializer/index.js
index 4f41e8f..8f9f14f 100644
--- a/node_modules/dom-serializer/index.js
+++ b/node_modules/dom-serializer/index.js
@@ -136,6 +136,8 @@ function renderText(elem, opts) {
     data = entities.encodeXML(data);
   }
 
+  data = data.replace(/</g, '&lt;').replace(/>/g, '&gt;')
+  
   return data;
 }

Since that only affects text nodes, it should not mess with anything, but caveat emptor.

I wonder if the clean solution here is replacing that encodeXML with something more context-aware, i.e. encode depending on the document type and charset.

@AlynxZhou
Copy link

AlynxZhou commented May 19, 2019

This issue still exists now and I am crazy because I am building a static site generator with cheerio. This problem behaves in two ways:

  • First I am using marked.js and it will encode symbols like < in code blocks (<pre></pre>) into &lt, that's what expected. However, if I set decodeEntities: false, cheerio will replace &lt back with <, which is ridiculous!!! Why parser decode &lt into <? I think it should not change this.
  • But if I set decodeEntities: true, all CJK chars will be replaced into HTML entities, which is hard to deal with because some JavaScript codes needs to get string length, HTML entities is OK to render but not easy to process.

Can anybody locate the problem and fix it? Thanks a lot.

@AlynxZhou
Copy link

After reading codes for 3 hours I believe I find where problem begins.

@bard I think your patch is not correct, not only innerHTMLs but attrs also has this problem.

The problem is because cheerio is switching from htmlparser2 to parse5, and decodeEntities is an option to htmlparser2, but not an option to parse5, in version 1.0.0-rc3, they did not pass this option to parse5 too.

If we use htmlparser2 with decodeEntities: false, all things works correctly. However if you install cheerio with npm i -s cheerio, you got 1.0.0-rc3, which uses parse5.

parse5 will parse &lt; into < when it reading input, while htmparser2 with decodeEntities: false won't. And when we get result with .html(), cheerio will use dom-serializer to render strings.

dom-serializer supposes parser understanding decodeEntities, suppose we have &lt; as input, if decodeEntities is true, dom-serializer should get &, else it should get <. When we are using htmlparser2, it works. With decodeEntities: false, it looks like &lt;(input) -> &lt;(parsed) -> &lt;(before and after serialized). With decodeEntities: true, it looks like &lt;(input) -> <(parsed) -> <(before serialized) -> &lt; (after serialized).

But whether decodeEntities is true or not, parse5 always parse &lt; into <, if we set it to true, it looks like &lt;(input) -> <(parsed) -> <(before serialized) -> &lt; (after serialized), however, chars like Chinese and Japanese are ALSO encoded, which is BAD. But if we set it to false, it looks like &lt;(input) -> <(parsed) -> <(before serialized) -> < (after serialized because decodeEntities is false), which is totally a mess.

dom-serializer is right, it has no fault in its behavior. There are two ways to fix:

  • Stop use parse5, let us use htmlparser2 again by install 0.22.0 with npm i -s cheerio@0.22.0 (code in master branch is still 0.22.0, confusing). @fb55 @jugglinmike @matthewmueller
  • Make parse5 able to understand decodeEntites, because sometimes we want to keep &lt; or &amp; in innerHTML, always un-escape them will cause a mess. @inikulin

We'd better not modify dom-serializer, because it's a parse5 issue. parse5 does NOT always un-escape &amp;, it replace &amp in innerHTML into &, but won't replace &amp; in attributes.

It's a little bit long because it's complex, but please read and fix it, we cannot always lock version to 0.22.0.

@bard
Copy link

bard commented May 19, 2019

Based on jsdom/jsdom#2501 and inikulin/parse5#261 (comment) it's unlikely that this will be handled in parse5.

The second thread suggests that it's possible to grab the raw, undecoded text from the input data based on node position, though that would require modifying dom-serializer, too.

@AlynxZhou
Copy link

@bard dom-serializer is used to generate strings from dom, not dom from string. We shouldn't touch it because it works correctly.

@AlynxZhou
Copy link

AlynxZhou commented May 19, 2019

due to inikulin/parse5#75 we know parse5 always decode entities, so decodeEntites is useless, and then we should add another option for encode entities or not, problems are cheerio currently support both parse5(for HTML) and htmlparser2(for XML)...

Edit: I got a better idea, we just need to remove entities re-encode in dom-serializer when working with parse5, I will send a pr.

Edit: However, this will break compatibility, needs an version bump...

Edit: PR is here cheeriojs/dom-serializer#80

@propattern
Copy link

Thanks @AlynxZhou. Great find. Your 1st Solution worked for me, but you are right that we can't be permanently be bound to version 0.22.0.

@AlynxZhou
Copy link

AlynxZhou commented Jun 3, 2019 via email

@matthewmueller
Copy link
Member

@fb55 @jugglinmike this seems to be causing some folks pain. If i recall, we were going to continue supporting both dom-serializer and parse5. do we have a snippet that we can point people to for the old behavior?

@AlynxZhou
Copy link

This is because parse5 ignores decodeEntity, I modified dom-serializer to fix this, but it needs a version bump.

@inikulin
Copy link
Contributor

I wonder what breaks exactly with decoded entities. The only thing that comes to my mind is that resulting pages are served in encoding other than UTF-8. Otherwise, nowadays there is no reason to prefer HTML entities (with an exception for safety-related HTML escaping) over code points.

@fb55
Copy link
Member

fb55 commented Jun 27, 2019

@inikulin The issue is serialization. Some people have entities as plain text in their HTML or want to maintain the encoding style they've used.

@AlynxZhou
Copy link

@fb55 But for people who don't want HTML entities in their files, cheerio 1.0.0 is totally broken because it cannot handle things correctly (decodeEntities false, &lt; becomes <, decodeEntities true, CJK chars becomes entities). The problem is not in cheerio, it's because parse5 always decode entities while parsing. Using decodeEntities to decide whether we should encode is not a good choice, because we cannot control whether decode entities when using parse5, we'd better not to encode entities, just escape HTML is enough. cheerio is expected to be a parser, not encoder, encoding all entities makes it really hard for further string editing.

I hope you can consider my PR cheeriojs/dom-serializer#80, I am using cheerio in my tools and I cannot always stick to 0.22.0.

@AlynxZhou
Copy link

One month passed and this problem still exists:

Screenshot from 2019-08-07 14-30-46

I am afraid that as a Chinese user I have to keep using cheerio 0.22.0 in my project forever because it seems no one will take a look on the reason I found.

@claviska
Copy link

claviska commented Aug 7, 2019

@AlynxZhou this workaround should help. I've been using it in production since April with no ill effects.

@AlynxZhou
Copy link

@claviska Thank you. But what I want is not a workaround, I know I can solve it by a simple hook. However, the problem is a bug in fact, it is because new library's behavior is different from the old one, and it needs to be fixed by remove unsupported option, it's complex and needs a change of logic, workaround can only hide the problem, but the problem still exists and may break other part.

Anyway thank you for your help.

@claviska
Copy link

claviska commented Aug 8, 2019

@AlynxZhou I'd like to see it fixed too, but this is open source so our options are to:

  1. Contribute a fix
  2. Use a workaround
  3. Wait patiently
  4. Use an old version that probably won't get updated 😕

What I've proposed can be used as a separate module that you can easily remove once the bug is properly fixed. If you want to use the latest version without this encoding issue, it's probably your best bet.

Cheers

@AlynxZhou
Copy link

@claviska I have send a PR to dom-serializer which removed decodeEntities because parser5 always do decoding and this options is useless. However, still no one replied.

@matthewmueller
Copy link
Member

matthewmueller commented Aug 18, 2019

To unblock, you can use the old parser:

const cheerio = require('cheerio')
const $ = cheerio.load(`<code>&lt;script src&gt;</code>`, {
  _useHtmlParser2: true,
  decodeEntities: false,
})
console.log($.html())
// <code>&lt;script src&gt;</code>

This is undocumented and may be removed when we go to 1.0.0. However, since I'm telling people they can use it, we'll probably need to support it. I hope this is okay, @jugglinmike @fb55.


@AlynxZhou thanks for taking the time to share your findings. Unfortunately, it's not easy for me to tell if the snippet above solves your problem or not. Can you please reframe your problem in the following way:

  1. Create a minimal code example that illustrates the problem.
  2. What you expect to see.
  3. What you actually see.

This will help us debug your issue. You can refer to the original poster for a decent example of this.

@AlynxZhou
Copy link

@matthewmueller No need to check via code, I can tell you this can definitely solve problem, however, as you said, _useHtmlParser2 is undocumented and may be removed, so I am afraid that I can't use such a unstable feature in my project to fix bugs.

Such snippet is just let 1.0.0 works like 0.22.0, to solve problem we need to consider differences between parse5 and htmlparser2, and I need to say again, encode all things into XML entities is a BAD option and we may consider to drop it because parse5 always decode input.

@matthewmueller
Copy link
Member

matthewmueller commented Aug 19, 2019

It's highly dependent on your use case, which is why we need to provide an option to switch. parse5 is a better supported parser at this point and I believe it was @fb55's idea to switch to it over his own parser.

FYI, for the final public API I'd probably try and have a parser interface like this:

const parser = {
  parse(code: string): AST
}

cheerio.load("...", { parser: parser })

What do you think @jugglinmike @fb55 ?

@AlynxZhou
Copy link

AlynxZhou commented Aug 19, 2019

@matthewmueller Sorry, I am reading code again now and find that the problem has been fixed by using parse5's serializer instead of dom-serializer, so I am trying to test it, I am REALLY happy to see that it may be fixed!

UPDATE: I am sure this is fixed in v1.0.0 branch and now waiting this to be updated to npm happily! If you want to test I write my code here:

const cheerio = require('cheerio')

const $ = cheerio.load('<p>&lt;哈哈哈</p>')

$('body').html() // output -> '<p>&lt;哈哈哈</p>'

'<p><哈哈哈</p>' or you get XML entities means it was not solved, but in my test v1.0.0 works correct!

(

return parse5.serialize(
is the code that fixed the problem.)

@curbengh
Copy link
Contributor

So, #1307 fixed this issue?

curbengh added a commit to curbengh/hexo-nofollow that referenced this issue Aug 19, 2019
@AlynxZhou
Copy link

So, #1307 fixed this issue?

I think it was fixed before this, you may do a test.

@curbengh
Copy link
Contributor

I previously had an issue with embedding xml code block in my blog. I can confirm it's fixed when I use v1.0.0 branch.

OP @huanz can you confirm?

@AlynxZhou
Copy link

@matthewmueller Any plan for releasing 1.0.0? Or could you please consider to release another rc version to fix this problem? We really want 1.0.0 branch because it fixed this bug. Thanks!

@michaelperrin
Copy link

I can confirm that the problem is fixed on the 1.0.0 branch 👍 I installed it this way:

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

This is not ideal and hope that a new RC will soon be released.

michaelperrin added a commit to michaelperrin/miam.js that referenced this issue Aug 1, 2020
This fixes a bug (cheeriojs/cheerio#1198)
that decodes entities
@superhuai
Copy link

superhuai commented Dec 6, 2020

is be fixed ? I also have this problem

const $ = cheerio.load('<p>&lt;哈哈哈</p>')

console.log($('body').html())
npm i cheerio@1.0.0-rc.3

emmm , it's ok

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

@AlynxZhou
Copy link

is be fixed ? I also have this problem

const $ = cheerio.load('<p>&lt;哈哈哈</p>')

console.log($('body').html())
npm i cheerio@1.0.0-rc.3

emmm , it's ok

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

two years past and they still don't update version on npm, so you will still face this problem if you install 1.0.0-rc3 from npm. You need to install 1.0.0 branch from github.

AndrewKvalheim added a commit to AndrewKvalheim/Typeset that referenced this issue Dec 10, 2020
@fb55
Copy link
Member

fb55 commented Dec 21, 2020

cheerio@1.0.0-rc.5 is now on npm. Hoping this is resolved now!

@magicdawn
Copy link

magicdawn commented Apr 26, 2023

looks like for xhtml output + cjk charaters, _useHtmlParser2 is still needed.

import { load } from 'cheerio' // 1.0.0-rc.12

const html = `
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Document</title>
  </head>
  <body>
    <br />
    <pre>
      // 哈哈哈
      // swift generic
      var a5 = MemoryLayout&lt;UInt&gt;.size
    </pre>
  </body>
</html>

`

console.log('-'.repeat(20))
{
  const $ = load(html, {})
  const y = $.xml().trim()
  console.log(y)
  // &#x54c8;&#x54c8;&#x54c8;                    <--------- wrong, so `decodeEntities` is needed
  // swift generic
  // var a5 = MemoryLayout&lt;UInt&gt;.size
}

console.log('-'.repeat(20))
{
  const $ = load(html, {
    decodeEntities: false,
  })
  const y = $.xml().trim()
  console.log(y)
  // 哈哈哈                                      <--------- correct
  // swift generic
  // var a5 = MemoryLayout<UInt>.size           <--------- wrong, it's code in a `<pre>` tag
}

console.log('-'.repeat(20))
{
  const $ = load(html, {
    // @ts-ignore
    _useHtmlParser2: true,
    decodeEntities: false,
  })
  const y = $.xml().trim()
  console.log(y)
  // 哈哈哈                                      <--------- correct
  // swift generic
  // var a5 = MemoryLayout&lt;UInt&gt;.size     <--------- correct
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests