Non-terminated HTML Entities are not recognized properly #2207

Muthukirthan · 2024-10-06T07:02:44Z

Case1

Input: a&nbspc
Brower result: a c
&nbsp is recognized as   html entity

Jsoup parsed content: a&nbspc
Brower result: a&nbspc
&nbsp is not recognized which shows different result in browser

Case2

Input: a&nbsp&shyc
Brower result: a c
&nbsp and &shy is recognized as   and  respective html entity

Jsoup parsed content: a &shyc
Brower result: a &shyc
&nbsp is recognized (might be due to succeeding & character), but &shy is not recognized as . Shows different result in browser

Case3

Input: a&shyc&nbsp
Brower result: ac
&nbsp and &shy is recognized as   and  respective html entity

Jsoup parsed content: a&shyc 
Brower result: a&shyc
&nbsp is recognized (as the string ends with that entity), but &shy is not recognized as . Shows different result in browser

On checking few more cases, this issue is seen only for named entities (like , &, ", and others) where the entity is not ended with semi-colon and followed by letters. Hexa-decimal entities and numeric entities are detected even if they are not ended with semi-colon.

Examples:
Proper detection as expected,
&nbsp,ddd (Expected  ,ddd , and got same results)
&nbsp ddd (Expected   ddd , and got same results)
djdjb&nbsp (Expected djdjb  , and got same results)

Invalid detections (ISSUES):
&nbspdhdj (Expected  dhdj but got &nbspdhdj)
&ampdfgsj (Expected &dfgsj but got &ampdfgsj)

Browsers are able to detect these html entities. Validated in https://mothereff.in/html-entities as well

Parser: Html parser
Escape mode: Same result for both base and extended. nbsp entity is replaced by   in xhtml escape mode but the result is same

I also raised this doubt related to entity: #2206

The text was updated successfully, but these errors were encountered:

jhy · 2024-11-21T05:15:10Z

I guess the spec changed here a bit (https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state). jsoup actually used to work like that way back when. Will need to review.

jhy · 2024-11-21T22:07:28Z

Here's the change from when we used to look for the longest matching prefix: a31ec08

jhy · 2024-11-22T02:04:13Z

Thanks for the clear report! Fixed.

Muthukirthan changed the title ~~🚨 Jsoup - Entities are not recognized properly and &shy is not treated like other entities~~ 🚨 Jsoup - Html Entities are not recognized properly Nov 7, 2024

jhy changed the title ~~🚨 Jsoup - Html Entities are not recognized properly~~ HTML Entities are not recognized properly Nov 21, 2024

jhy changed the title ~~HTML Entities are not recognized properly~~ Non-terminated HTML Entities are not recognized properly Nov 21, 2024

jhy added the bug Confirmed bug that we should fix label Nov 21, 2024

Muthukirthan mentioned this issue Nov 21, 2024

Not able to identify escaped/unescaped html entity in the text nodes #2206

Closed

jhy closed this as completed in 5ee376b Nov 22, 2024

jhy added the fixed label Nov 22, 2024

jhy added this to the 1.18.2 milestone Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-terminated HTML Entities are not recognized properly #2207

Non-terminated HTML Entities are not recognized properly #2207

Muthukirthan commented Oct 6, 2024 •

edited by jhy

Loading

jhy commented Nov 21, 2024

jhy commented Nov 21, 2024

jhy commented Nov 22, 2024

Non-terminated HTML Entities are not recognized properly #2207

Non-terminated HTML Entities are not recognized properly #2207

Comments

Muthukirthan commented Oct 6, 2024 • edited by jhy Loading

Case1

Case2

Case3

jhy commented Nov 21, 2024

jhy commented Nov 21, 2024

jhy commented Nov 22, 2024

Muthukirthan commented Oct 6, 2024 •

edited by jhy

Loading