ngSanitize cant escape all unicode characters #5088

jtangelder · 2013-11-22T08:09:36Z

ngSanitize has a bug escaping unicode chars that arent in the range of charCodeAt, see the fiddle below. Removing this replace function fixes the problem, but i was wondering why this is done in the first place. Unicode chars can be placed inside documents safely, especially since utf-8 became a standard charset these days.

http://jsfiddle.net/jtangelder/SQf7w/

angular.js/src/ngSanitize/sanitize.js

Line 370 in a7e12b7

replace(NON_ALPHANUMERIC_REGEXP, function(value){

I can do a PR if it's ok!

IgorMinar · 2014-01-04T00:47:29Z

I can't think of a reason why we still need to do this.

@mhevery any idea?

I suggest that you send us a PR and we'll evaluate it there. At first glance it seems that it should be safe to remove, but we need to have a careful look at this before merging any change. In any case, this is a legitimate bug IMO, I'm just not sure of the constraints for the right solution.

mhevery · 2014-01-14T15:26:51Z

is it possible that you could construct a valid and safe UTF8 string which would be an unsafe string in another encoding? Not sure I would be that eager to remove this.

jtangelder · 2014-01-15T15:19:38Z

Another fix could be to check if document.charset is at UTF-8 or UTF-16... Maybe someone with more knowledge of charsets/encoding could take a look at this? This could fix the issues @mhevery brings up.

anders · 2014-03-10T18:30:34Z

The problem is JavaScript uses surrogate pairs, see: http://stackoverflow.com/questions/3744721/javascript-strings-outside-of-the-bmp

memolog · 2014-03-24T15:02:30Z

Hi,

I got the same issue recently.
It's the surrogate pairs issue as @anders mentioned in the above.

We could handle it before escaping unicode chars as the following diff
https://gist.github.com/memolog/79c88598b12d309368e5

see also http://mdn.beonex.com/en/Core_JavaScript_1.5_Reference/Global_Objects/String/charCodeAt.html

thanks!

The encodeEndities function encode non-alphanumeric characters to entities with charCodeAt. charCodeAt does not return one value when their unicode codeponts is higher than 65,356. It returns surrogate pair, and this is why the Emoji which has higher codepoints is garbled. We need to handle them properly. Closes #5088 Closes #6911

ghost assigned IgorMinar Jan 4, 2014

jtangelder added a commit to jtangelder/angular.js that referenced this issue Jan 15, 2014

fix for angular#5088

b705acf

jtangelder mentioned this issue Jan 15, 2014

fix(ngSanitize): removed encoding of non alphanumeric on textContent #5819

Closed

torhve mentioned this issue Mar 7, 2014

Encoding problems with high multibyte UTF-8 glowing-bear/glowing-bear#222

Closed

memolog mentioned this issue Mar 29, 2014

fix(ngSanitize): encode surrogate pair properly #6911

Closed

caitp closed this as completed in 627b035 May 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ngSanitize cant escape all unicode characters #5088

ngSanitize cant escape all unicode characters #5088

jtangelder commented Nov 22, 2013

IgorMinar commented Jan 4, 2014

mhevery commented Jan 14, 2014

jtangelder commented Jan 15, 2014

anders commented Mar 10, 2014

memolog commented Mar 24, 2014

ngSanitize cant escape all unicode characters #5088

ngSanitize cant escape all unicode characters #5088

Comments

jtangelder commented Nov 22, 2013

IgorMinar commented Jan 4, 2014

mhevery commented Jan 14, 2014

jtangelder commented Jan 15, 2014

anders commented Mar 10, 2014

memolog commented Mar 24, 2014