Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-terminated HTML Entities are not recognized properly #2207

Closed
Muthukirthan opened this issue Oct 6, 2024 · 3 comments
Closed

Non-terminated HTML Entities are not recognized properly #2207

Muthukirthan opened this issue Oct 6, 2024 · 3 comments
Labels
bug Confirmed bug that we should fix fixed
Milestone

Comments

@Muthukirthan
Copy link

Muthukirthan commented Oct 6, 2024

Case1

Input: <p>a&nbspc</p>
Brower result: a c
&nbsp is recognized as &nbsp; html entity

Jsoup parsed content: <p>a&amp;nbspc</p>
Brower result: a&nbspc
&nbsp is not recognized which shows different result in browser


Case2

Input: <p>a&nbsp&shyc</p>
Brower result: a ­c
&nbsp and &shy is recognized as &nbsp; and &shy; respective html entity

Jsoup parsed content: <p>a&nbsp;&amp;shyc</p>
Brower result: a &shyc
&nbsp is recognized (might be due to succeeding & character), but &shy is not recognized as &shy;. Shows different result in browser


Case3

Input: <p>a&shyc&nbsp</p>
Brower result: a­c
&nbsp and &shy is recognized as &nbsp; and &shy; respective html entity

Jsoup parsed content: <p>a&amp;shyc&nbsp;</p>
Brower result: a&shyc
&nbsp is recognized (as the string ends with that entity), but &shy is not recognized as &shy;. Shows different result in browser


On checking few more cases, this issue is seen only for named entities (like  , &, ", and others) where the entity is not ended with semi-colon and followed by letters. Hexa-decimal entities and numeric entities are detected even if they are not ended with semi-colon.

Examples:
Proper detection as expected,
&nbsp,ddd (Expected &nbsp;,ddd , and got same results)
&nbsp ddd (Expected &nbsp; ddd , and got same results)
djdjb&nbsp (Expected djdjb&nbsp; , and got same results)

Invalid detections (ISSUES):
&nbspdhdj (Expected &nbsp;dhdj but got &amp;nbspdhdj)
&ampdfgsj (Expected &amp;dfgsj but got &amp;ampdfgsj)

Browsers are able to detect these html entities. Validated in https://mothereff.in/html-entities as well

Parser: Html parser
Escape mode: Same result for both base and extended. nbsp entity is replaced by &#xa0; in xhtml escape mode but the result is same

I also raised this doubt related to entity: #2206

@Muthukirthan Muthukirthan changed the title 🚨 Jsoup - Entities are not recognized properly and &shy is not treated like other entities 🚨 Jsoup - Html Entities are not recognized properly Nov 7, 2024
@jhy jhy changed the title 🚨 Jsoup - Html Entities are not recognized properly HTML Entities are not recognized properly Nov 21, 2024
@jhy jhy changed the title HTML Entities are not recognized properly Non-terminated HTML Entities are not recognized properly Nov 21, 2024
@jhy
Copy link
Owner

jhy commented Nov 21, 2024

I guess the spec changed here a bit (https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state). jsoup actually used to work like that way back when. Will need to review.

@jhy
Copy link
Owner

jhy commented Nov 21, 2024

Here's the change from when we used to look for the longest matching prefix: a31ec08

@jhy jhy closed this as completed in 5ee376b Nov 22, 2024
@jhy jhy added the fixed label Nov 22, 2024
@jhy jhy added this to the 1.18.2 milestone Nov 22, 2024
@jhy
Copy link
Owner

jhy commented Nov 22, 2024

Thanks for the clear report! Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix fixed
Projects
None yet
Development

No branches or pull requests

5 participants
@jhy @Muthukirthan and others