decodeEntities false code tag #1198

huanz · 2018-05-14T11:18:00Z

const $ = cheerio.load(`<code>&lt;script src&gt;</code>`, {
  decodeEntities: false
});
console.log($.html());

output

<html><head></head><body><code><script src></code></body></html>

you can see that the string <script src> has been unescaped

The text was updated successfully, but these errors were encountered:

Jasonlhy · 2018-05-29T06:14:17Z

It seems that you need to set it to true.

$ = cheerio.load(`<code>&lt;script src&gt;=></code>`, {   decodeEntities: true }); console.log($.html());
<html><head></head><body><code>&lt;script src&gt;=&gt;</code></body></html>

huanz · 2018-05-29T08:37:29Z

@Jasonlhy if i set true, the chinese will be decoded, eg:

const $ = cheerio.load(`<p>哈哈哈哈</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($.html());

output:

<html><head></head><body><p>&#x54C8;&#x54C8;&#x54C8;&#x54C8;</p><code>&lt;script src&gt;</code></body></html>

Jasonlhy · 2018-05-29T17:55:53Z

After a little bit of research and testing. Parser5 (The HTML parser) does not support the option: decodeEntities now, the result from parser5 is default decoded (unescaped).

Both cases will print out 哈哈哈哈.

const $ = cheerio.load(`<p>哈哈哈哈</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($('p').text());

const $ = cheerio.load(`<p>哈哈哈哈</p><code>&lt;script src&gt;</code>`, {
  decodeEntities: true
});
console.log($('p').text());

I think the reason of this problem is the serializer, it is located at https://github.com/cheeriojs/dom-serializer/blob/master/index.js

It seems to change all decoded entities (Including Chinese text) back to encoded one if decodeEntities set to true

function renderText(elem, opts) {
  var data = elem.data || '';

  // if entities weren't decoded, no need to encode them back
  if (opts.decodeEntities && !(elem.parent && elem.parent.name in unencodedElements)) {
    data = entities.encodeXML(data);
  }

  return data;
}

senwzz · 2018-08-07T05:43:49Z

This is the biggest problem of the project

bard · 2019-04-10T08:21:55Z

I had documents containing html like <pre><span>a = 1</span></pre> (i.e. code samples of html itself) and wanted to process them with cheerio before displaying sending them to the browser. decodeEntities: false obviously was not working due to the problem reported in this issue, and decodeEntities: true was encoding far more than < and >.

I kept decodeEntities: false and used https://www.npmjs.com/package/patch-package to generate and automatically apply this patch to dom-serializer upon every install:

diff --git a/node_modules/dom-serializer/index.js b/node_modules/dom-serializer/index.js
index 4f41e8f..8f9f14f 100644
--- a/node_modules/dom-serializer/index.js
+++ b/node_modules/dom-serializer/index.js
@@ -136,6 +136,8 @@ function renderText(elem, opts) {
     data = entities.encodeXML(data);
   }
 
+  data = data.replace(/</g, '&lt;').replace(/>/g, '&gt;')
+  
   return data;
 }

Since that only affects text nodes, it should not mess with anything, but caveat emptor.

I wonder if the clean solution here is replacing that encodeXML with something more context-aware, i.e. encode depending on the document type and charset.

AlynxZhou · 2019-05-19T03:41:56Z

This issue still exists now and I am crazy because I am building a static site generator with cheerio. This problem behaves in two ways:

First I am using marked.js and it will encode symbols like < in code blocks (<pre></pre>) into &lt, that's what expected. However, if I set decodeEntities: false, cheerio will replace &lt back with <, which is ridiculous!!! Why parser decode &lt into <? I think it should not change this.
But if I set decodeEntities: true, all CJK chars will be replaced into HTML entities, which is hard to deal with because some JavaScript codes needs to get string length, HTML entities is OK to render but not easy to process.

Can anybody locate the problem and fix it? Thanks a lot.

AlynxZhou · 2019-05-19T07:11:52Z

After reading codes for 3 hours I believe I find where problem begins.

@bard I think your patch is not correct, not only innerHTMLs but attrs also has this problem.

The problem is because cheerio is switching from htmlparser2 to parse5, and decodeEntities is an option to htmlparser2, but not an option to parse5, in version 1.0.0-rc3, they did not pass this option to parse5 too.

If we use htmlparser2 with decodeEntities: false, all things works correctly. However if you install cheerio with npm i -s cheerio, you got 1.0.0-rc3, which uses parse5.

parse5 will parse < into < when it reading input, while htmparser2 with decodeEntities: false won't. And when we get result with .html(), cheerio will use dom-serializer to render strings.

dom-serializer supposes parser understanding decodeEntities, suppose we have < as input, if decodeEntities is true, dom-serializer should get &, else it should get <. When we are using htmlparser2, it works. With decodeEntities: false, it looks like <(input) -> <(parsed) -> <(before and after serialized). With decodeEntities: true, it looks like <(input) -> <(parsed) -> <(before serialized) -> < (after serialized).

But whether decodeEntities is true or not, parse5 always parse < into <, if we set it to true, it looks like <(input) -> <(parsed) -> <(before serialized) -> < (after serialized), however, chars like Chinese and Japanese are ALSO encoded, which is BAD. But if we set it to false, it looks like <(input) -> <(parsed) -> <(before serialized) -> < (after serialized because decodeEntities is false), which is totally a mess.

dom-serializer is right, it has no fault in its behavior. There are two ways to fix:

Stop use parse5, let us use htmlparser2 again by install 0.22.0 with npm i -s cheerio@0.22.0 (code in master branch is still 0.22.0, confusing). @fb55 @jugglinmike @matthewmueller
Make parse5 able to understand decodeEntites, because sometimes we want to keep < or & in innerHTML, always un-escape them will cause a mess. @inikulin

We'd better not modify dom-serializer, because it's a parse5 issue. parse5 does NOT always un-escape &, it replace &amp in innerHTML into &, but won't replace & in attributes.

It's a little bit long because it's complex, but please read and fix it, we cannot always lock version to 0.22.0.

bard · 2019-05-19T07:58:33Z

Based on jsdom/jsdom#2501 and inikulin/parse5#261 (comment) it's unlikely that this will be handled in parse5.

The second thread suggests that it's possible to grab the raw, undecoded text from the input data based on node position, though that would require modifying dom-serializer, too.

AlynxZhou · 2019-05-19T08:15:39Z

@bard dom-serializer is used to generate strings from dom, not dom from string. We shouldn't touch it because it works correctly.

AlynxZhou · 2019-05-19T08:48:28Z

due to inikulin/parse5#75 we know parse5 always decode entities, so decodeEntites is useless, and then we should add another option for encode entities or not, problems are cheerio currently support both parse5(for HTML) and htmlparser2(for XML)...

Edit: I got a better idea, we just need to remove entities re-encode in dom-serializer when working with parse5, I will send a pr.

Edit: However, this will break compatibility, needs an version bump...

Edit: PR is here cheeriojs/dom-serializer#80

propattern · 2019-06-03T00:54:50Z

Thanks @AlynxZhou. Great find. Your 1st Solution worked for me, but you are right that we can't be permanently be bound to version 0.22.0.

AlynxZhou · 2019-06-03T01:12:38Z

I send a PR to dom-serializer to fix 1.0.0, however no one replied...

…

On Mon, Jun 3, 2019, 08:55 Iqbal Ahmed ***@***.***> wrote: Thanks @AlynxZhou <https://github.com/AlynxZhou>. Great find. Your 1st Solution worked for me, but you are right that we can't be permanently be bound to version 0.22.0. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1198?email_source=notifications&email_token=ACANVR42P4FOIMW62TLNKKDPYRTWVA5CNFSM4E7WNJK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWYBZXQ#issuecomment-498080990>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACANVR7VPU5TRUT4COB57HDPYRTWVANCNFSM4E7WNJKQ> .

matthewmueller · 2019-06-03T03:50:18Z

@fb55 @jugglinmike this seems to be causing some folks pain. If i recall, we were going to continue supporting both dom-serializer and parse5. do we have a snippet that we can point people to for the old behavior?

AlynxZhou · 2019-06-03T05:31:31Z

This is because parse5 ignores decodeEntity, I modified dom-serializer to fix this, but it needs a version bump.

inikulin · 2019-06-27T18:33:51Z

I wonder what breaks exactly with decoded entities. The only thing that comes to my mind is that resulting pages are served in encoding other than UTF-8. Otherwise, nowadays there is no reason to prefer HTML entities (with an exception for safety-related HTML escaping) over code points.

fb55 · 2019-06-27T18:51:57Z

@inikulin The issue is serialization. Some people have entities as plain text in their HTML or want to maintain the encoding style they've used.

AlynxZhou · 2019-07-02T03:56:35Z

@fb55 But for people who don't want HTML entities in their files, cheerio 1.0.0 is totally broken because it cannot handle things correctly (decodeEntities false, < becomes <, decodeEntities true, CJK chars becomes entities). The problem is not in cheerio, it's because parse5 always decode entities while parsing. Using decodeEntities to decide whether we should encode is not a good choice, because we cannot control whether decode entities when using parse5, we'd better not to encode entities, just escape HTML is enough. cheerio is expected to be a parser, not encoder, encoding all entities makes it really hard for further string editing.

I hope you can consider my PR cheeriojs/dom-serializer#80, I am using cheerio in my tools and I cannot always stick to 0.22.0.

AlynxZhou · 2019-08-07T06:33:24Z

One month passed and this problem still exists:

I am afraid that as a Chinese user I have to keep using cheerio 0.22.0 in my project forever because it seems no one will take a look on the reason I found.

claviska · 2019-08-07T13:52:16Z

@AlynxZhou this workaround should help. I've been using it in production since April with no ill effects.

AlynxZhou · 2019-08-08T01:34:25Z

@claviska Thank you. But what I want is not a workaround, I know I can solve it by a simple hook. However, the problem is a bug in fact, it is because new library's behavior is different from the old one, and it needs to be fixed by remove unsupported option, it's complex and needs a change of logic, workaround can only hide the problem, but the problem still exists and may break other part.

Anyway thank you for your help.

claviska · 2019-08-08T14:24:23Z

@AlynxZhou I'd like to see it fixed too, but this is open source so our options are to:

Contribute a fix
Use a workaround
Wait patiently
Use an old version that probably won't get updated 😕

What I've proposed can be used as a separate module that you can easily remove once the bug is properly fixed. If you want to use the latest version without this encoding issue, it's probably your best bet.

Cheers

AlynxZhou · 2019-08-09T03:21:34Z

@claviska I have send a PR to dom-serializer which removed decodeEntities because parser5 always do decoding and this options is useless. However, still no one replied.

matthewmueller · 2019-08-18T16:23:55Z

To unblock, you can use the old parser:

const cheerio = require('cheerio')
const $ = cheerio.load(`<code>&lt;script src&gt;</code>`, {
  _useHtmlParser2: true,
  decodeEntities: false,
})
console.log($.html())
// <code>&lt;script src&gt;</code>

This is undocumented and may be removed when we go to 1.0.0. However, since I'm telling people they can use it, we'll probably need to support it. I hope this is okay, @jugglinmike @fb55.

@AlynxZhou thanks for taking the time to share your findings. Unfortunately, it's not easy for me to tell if the snippet above solves your problem or not. Can you please reframe your problem in the following way:

Create a minimal code example that illustrates the problem.
What you expect to see.
What you actually see.

This will help us debug your issue. You can refer to the original poster for a decent example of this.

AlynxZhou · 2019-08-19T01:45:45Z

@matthewmueller No need to check via code, I can tell you this can definitely solve problem, however, as you said, _useHtmlParser2 is undocumented and may be removed, so I am afraid that I can't use such a unstable feature in my project to fix bugs.

Such snippet is just let 1.0.0 works like 0.22.0, to solve problem we need to consider differences between parse5 and htmlparser2, and I need to say again, encode all things into XML entities is a BAD option and we may consider to drop it because parse5 always decode input.

matthewmueller · 2019-08-19T09:27:09Z

It's highly dependent on your use case, which is why we need to provide an option to switch. parse5 is a better supported parser at this point and I believe it was @fb55's idea to switch to it over his own parser.

FYI, for the final public API I'd probably try and have a parser interface like this:

const parser = {
  parse(code: string): AST
}

cheerio.load("...", { parser: parser })

What do you think @jugglinmike @fb55 ?

AlynxZhou · 2019-08-19T09:54:15Z

@matthewmueller Sorry, I am reading code again now and find that the problem has been fixed by using parse5's serializer instead of dom-serializer, so I am trying to test it, I am REALLY happy to see that it may be fixed!

UPDATE: I am sure this is fixed in v1.0.0 branch and now waiting this to be updated to npm happily! If you want to test I write my code here:

const cheerio = require('cheerio')

const $ = cheerio.load('<p>&lt;哈哈哈</p>')

$('body').html() // output -> '<p>&lt;哈哈哈</p>'

'<p><哈哈哈</p>' or you get XML entities means it was not solved, but in my test v1.0.0 works correct!

(

cheerio/lib/static.js

Line 106 in c635bea

return parse5.serialize(

is the code that fixed the problem.)

curbengh · 2019-08-19T11:17:14Z

So, #1307 fixed this issue?

cheeriojs/cheerio#1198 (comment)

AlynxZhou · 2019-08-19T11:32:49Z

So, #1307 fixed this issue?

I think it was fixed before this, you may do a test.

curbengh · 2019-08-20T02:02:08Z

I previously had an issue with embedding xml code block in my blog. I can confirm it's fixed when I use v1.0.0 branch.

OP @huanz can you confirm?

AlynxZhou · 2019-09-26T09:06:40Z

@matthewmueller Any plan for releasing 1.0.0? Or could you please consider to release another rc version to fix this problem? We really want 1.0.0 branch because it fixed this bug. Thanks!

michaelperrin · 2020-08-01T22:20:08Z

I can confirm that the problem is fixed on the 1.0.0 branch 👍 I installed it this way:

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

This is not ideal and hope that a new RC will soon be released.

This fixes a bug (cheeriojs/cheerio#1198) that decodes entities

superhuai · 2020-12-06T10:23:54Z

is be fixed ？ I also have this problem

const $ = cheerio.load('<p>&lt;哈哈哈</p>')

console.log($('body').html())

npm i cheerio@1.0.0-rc.3

emmm , it's ok

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

AlynxZhou · 2020-12-07T00:56:53Z

is be fixed ？ I also have this problem

const $ = cheerio.load('<p>&lt;哈哈哈</p>')

console.log($('body').html())

npm i cheerio@1.0.0-rc.3

emmm , it's ok

yarn add https://github.com/cheeriojs/cheerio.git#1.0.0

two years past and they still don't update version on npm, so you will still face this problem if you install 1.0.0-rc3 from npm. You need to install 1.0.0 branch from github.

Related: - cheeriojs/cheerio#1198 - cheeriojs/cheerio#1219

fb55 · 2020-12-21T22:30:07Z

cheerio@1.0.0-rc.5 is now on npm. Hoping this is resolved now!

magicdawn · 2023-04-26T12:12:00Z

looks like for xhtml output + cjk charaters, _useHtmlParser2 is still needed.

import { load } from 'cheerio' // 1.0.0-rc.12

const html = `
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Document</title>
  </head>
  <body>
    <br />
    <pre>
      // 哈哈哈
      // swift generic
      var a5 = MemoryLayout&lt;UInt&gt;.size
    </pre>
  </body>
</html>

`

console.log('-'.repeat(20))
{
  const $ = load(html, {})
  const y = $.xml().trim()
  console.log(y)
  // &#x54c8;&#x54c8;&#x54c8;                    <--------- wrong, so `decodeEntities` is needed
  // swift generic
  // var a5 = MemoryLayout&lt;UInt&gt;.size
}

console.log('-'.repeat(20))
{
  const $ = load(html, {
    decodeEntities: false,
  })
  const y = $.xml().trim()
  console.log(y)
  // 哈哈哈                                      <--------- correct
  // swift generic
  // var a5 = MemoryLayout<UInt>.size           <--------- wrong, it's code in a `<pre>` tag
}

console.log('-'.repeat(20))
{
  const $ = load(html, {
    // @ts-ignore
    _useHtmlParser2: true,
    decodeEntities: false,
  })
  const y = $.xml().trim()
  console.log(y)
  // 哈哈哈                                      <--------- correct
  // swift generic
  // var a5 = MemoryLayout&lt;UInt&gt;.size     <--------- correct
}

huanz changed the title ~~decodeEntities false code tag~~ decodeEntities false code tag May 14, 2018

momocow mentioned this issue Feb 26, 2019

cheerio.load() with {decodeEntities: false} causes html escaped text to be unescaped. curbengh/hexo-nofollow#3

Closed

bard mentioned this issue Apr 10, 2019

Can you have & not encoded to & in url query strings and not have html entities decoded? #1132

Closed

curbengh mentioned this issue May 18, 2019

the problem when i hexo deploy curbengh/hexo-nofollow#4

Closed

AlynxZhou mentioned this issue May 19, 2019

Fixed HTML symbols escaping bug. cheeriojs/dom-serializer#79

Closed

AlynxZhou mentioned this issue May 19, 2019

Version bump: Support cheerio 1.0.0-RC3 with parse5. cheeriojs/dom-serializer#80

Closed

ZhenTek mentioned this issue Jul 29, 2019

viewBox in SVG but generate viewbox insteadof website-scraper/node-website-scraper#369

Closed

AlynxZhou mentioned this issue Aug 9, 2019

Change escape function to fix bug with default parse5 parser cheeriojs/dom-serializer#85

Closed

curbengh mentioned this issue Aug 15, 2019

Performance Data of Hexo hexojs/hexo#3663

Closed

matthewmueller mentioned this issue Aug 18, 2019

Non-latin text should not be HTML encoded automatically with @html selector matthewmueller/x-ray#350

Merged

4 tasks

curbengh added a commit to curbengh/hexo-nofollow that referenced this issue Aug 19, 2019

chore(deps): use v1.0.0 branch of cheerio

81afa24

cheeriojs/cheerio#1198 (comment)

liolok mentioned this issue Aug 29, 2019

Broken tags in HTML code block liolok/hexo-asset-link#2

Closed

curbengh mentioned this issue Oct 27, 2019

Replace cheerio with native API hexojs/hexo#3677

Closed

4 tasks

Aaronius mentioned this issue Jul 24, 2020

Non-breaking spaces decoded when decodeEntities = false? #1454

Closed

michaelperrin added a commit to michaelperrin/miam.js that referenced this issue Aug 1, 2020

Update Cheerio to 1.0.0 branch

7d48bd0

This fixes a bug (cheeriojs/cheerio#1198) that decodes entities

AndrewKvalheim added a commit to AndrewKvalheim/Typeset that referenced this issue Dec 10, 2020

Document bug

41abb74

Related: - cheeriojs/cheerio#1198 - cheeriojs/cheerio#1219

fb55 closed this as completed Dec 21, 2020

ncksllvn mentioned this issue Mar 11, 2021

[Prearchive] Patch for encoded HTML entities becoming decoded in DOM department-of-veterans-affairs/vets-website#16335

Merged

5 tasks

vejja mentioned this issue Mar 5, 2024

Integrate option to disable Cheerio decodeEntities nuxt-modules/security#393

Closed

decodeEntities false code tag #1198

decodeEntities false code tag #1198

Comments

huanz commented May 14, 2018

Jasonlhy commented May 29, 2018

huanz commented May 29, 2018 • edited Loading

Jasonlhy commented May 29, 2018 • edited Loading

senwzz commented Aug 7, 2018

bard commented Apr 10, 2019

AlynxZhou commented May 19, 2019 • edited Loading

AlynxZhou commented May 19, 2019

bard commented May 19, 2019

AlynxZhou commented May 19, 2019

AlynxZhou commented May 19, 2019 • edited Loading

propattern commented Jun 3, 2019

AlynxZhou commented Jun 3, 2019 via email

matthewmueller commented Jun 3, 2019

AlynxZhou commented Jun 3, 2019

inikulin commented Jun 27, 2019

fb55 commented Jun 27, 2019

AlynxZhou commented Jul 2, 2019

AlynxZhou commented Aug 7, 2019

claviska commented Aug 7, 2019

AlynxZhou commented Aug 8, 2019

claviska commented Aug 8, 2019

AlynxZhou commented Aug 9, 2019

matthewmueller commented Aug 18, 2019 • edited Loading

AlynxZhou commented Aug 19, 2019

matthewmueller commented Aug 19, 2019 • edited Loading

AlynxZhou commented Aug 19, 2019 • edited Loading

curbengh commented Aug 19, 2019

AlynxZhou commented Aug 19, 2019

curbengh commented Aug 20, 2019

AlynxZhou commented Sep 26, 2019

michaelperrin commented Aug 1, 2020

superhuai commented Dec 6, 2020 • edited Loading

AlynxZhou commented Dec 7, 2020

fb55 commented Dec 21, 2020

magicdawn commented Apr 26, 2023 • edited Loading

huanz commented May 29, 2018 •

edited

Loading

Jasonlhy commented May 29, 2018 •

edited

Loading

AlynxZhou commented May 19, 2019 •

edited

Loading

AlynxZhou commented May 19, 2019 •

edited

Loading

matthewmueller commented Aug 18, 2019 •

edited

Loading

matthewmueller commented Aug 19, 2019 •

edited

Loading

AlynxZhou commented Aug 19, 2019 •

edited

Loading

superhuai commented Dec 6, 2020 •

edited

Loading

magicdawn commented Apr 26, 2023 •

edited

Loading