Skip to content
This repository has been archived by the owner on Apr 12, 2024. It is now read-only.

ngSanitize cant escape all unicode characters #5088

Closed
jtangelder opened this issue Nov 22, 2013 · 5 comments
Closed

ngSanitize cant escape all unicode characters #5088

jtangelder opened this issue Nov 22, 2013 · 5 comments

Comments

@jtangelder
Copy link
Contributor

ngSanitize has a bug escaping unicode chars that arent in the range of charCodeAt, see the fiddle below. Removing this replace function fixes the problem, but i was wondering why this is done in the first place. Unicode chars can be placed inside documents safely, especially since utf-8 became a standard charset these days.

http://jsfiddle.net/jtangelder/SQf7w/

replace(NON_ALPHANUMERIC_REGEXP, function(value){

I can do a PR if it's ok!

@ghost ghost assigned IgorMinar Jan 4, 2014
@IgorMinar
Copy link
Contributor

I can't think of a reason why we still need to do this.

@mhevery any idea?

I suggest that you send us a PR and we'll evaluate it there. At first glance it seems that it should be safe to remove, but we need to have a careful look at this before merging any change. In any case, this is a legitimate bug IMO, I'm just not sure of the constraints for the right solution.

@mhevery
Copy link
Contributor

mhevery commented Jan 14, 2014

is it possible that you could construct a valid and safe UTF8 string which would be an unsafe string in another encoding? Not sure I would be that eager to remove this.

@jtangelder
Copy link
Contributor Author

Another fix could be to check if document.charset is at UTF-8 or UTF-16... Maybe someone with more knowledge of charsets/encoding could take a look at this? This could fix the issues @mhevery brings up.

@anders
Copy link

anders commented Mar 10, 2014

The problem is JavaScript uses surrogate pairs, see: http://stackoverflow.com/questions/3744721/javascript-strings-outside-of-the-bmp

@memolog
Copy link
Contributor

memolog commented Mar 24, 2014

Hi,

I got the same issue recently.
It's the surrogate pairs issue as @anders mentioned in the above.

We could handle it before escaping unicode chars as the following diff
https://gist.github.com/memolog/79c88598b12d309368e5

see also http://mdn.beonex.com/en/Core_JavaScript_1.5_Reference/Global_Objects/String/charCodeAt.html

thanks!

caitp pushed a commit that referenced this issue May 2, 2014
The encodeEndities function encode non-alphanumeric characters to entities with charCodeAt.
charCodeAt does not return one value when their unicode codeponts is higher than 65,356.
It returns surrogate pair, and this is why the Emoji which has higher codepoints is garbled.
We need to handle them properly.

Closes #5088
Closes #6911
@caitp caitp closed this as completed in 627b035 May 2, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants