You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Input:<p>a c</p> Brower result:a c   is recognized as html entity
Jsoup parsed content:<p>a&nbspc</p> Brower result:a c   is not recognized which shows different result in browser
Case2
Input:<p>a ­c</p> Brower result:a c   and ­ is recognized as and ­ respective html entity
Jsoup parsed content:<p>a &shyc</p> Brower result:a ­c   is recognized (might be due to succeeding & character), but ­ is not recognized as ­. Shows different result in browser
Case3
Input:<p>a­c </p> Brower result:ac   and ­ is recognized as and ­ respective html entity
Jsoup parsed content:<p>a&shyc </p> Brower result:a­c   is recognized (as the string ends with that entity), but ­ is not recognized as ­. Shows different result in browser
On checking few more cases, this issue is seen only for named entities (like , &, ", and others) where the entity is not ended with semi-colon and followed by letters. Hexa-decimal entities and numeric entities are detected even if they are not ended with semi-colon.
Examples:
Proper detection as expected,  ,ddd (Expected ,ddd , and got same results)   ddd (Expected ddd , and got same results) djdjb  (Expected djdjb , and got same results)
Invalid detections (ISSUES):  dhdj (Expected dhdj but got &nbspdhdj) &dfgsj (Expected &dfgsj but got &ampdfgsj)
The text was updated successfully, but these errors were encountered:
Muthukirthan
changed the title
🚨 Jsoup - Entities are not recognized properly and ­ is not treated like other entities
🚨 Jsoup - Html Entities are not recognized properly
Nov 7, 2024
jhy
changed the title
🚨 Jsoup - Html Entities are not recognized properly
HTML Entities are not recognized properly
Nov 21, 2024
jhy
changed the title
HTML Entities are not recognized properly
Non-terminated HTML Entities are not recognized properly
Nov 21, 2024
Case1
Input:
<p>a c</p>
Brower result:
a c
 
is recognized as
html entityJsoup parsed content:
<p>a&nbspc</p>
Brower result:
a c
 
is not recognized which shows different result in browserCase2
Input:
<p>a ­c</p>
Brower result:
a c
 
and­
is recognized as
and­
respective html entityJsoup parsed content:
<p>a &shyc</p>
Brower result:
a ­c
 
is recognized (might be due to succeeding & character), but­
is not recognized as­
. Shows different result in browserCase3
Input:
<p>a­c </p>
Brower result:
ac
 
and­
is recognized as
and­
respective html entityJsoup parsed content:
<p>a&shyc </p>
Brower result:
a­c
 
is recognized (as the string ends with that entity), but­
is not recognized as­
. Shows different result in browserOn checking few more cases, this issue is seen only for named entities (like , &, ", and others) where the entity is not ended with semi-colon and followed by letters. Hexa-decimal entities and numeric entities are detected even if they are not ended with semi-colon.
Examples:
Proper detection as expected,
 ,ddd
(Expected ,ddd
, and got same results)  ddd
(Expected ddd
, and got same results)djdjb 
(Expecteddjdjb
, and got same results)Invalid detections (ISSUES):
 dhdj
(Expected dhdj
but got&nbspdhdj
)&dfgsj
(Expected&dfgsj
but got&ampdfgsj
)Browsers are able to detect these html entities. Validated in https://mothereff.in/html-entities as well
Parser: Html parser
Escape mode: Same result for both
base
andextended
. nbsp entity is replaced by 
inxhtml
escape mode but the result is sameI also raised this doubt related to entity: #2206
The text was updated successfully, but these errors were encountered: