Crusher splits Unicode high/low surrogate pairs, producing incorrect strings #50

Siorki · 2016-03-07T22:47:58Z

Tried to pack '🔥🔥🔥' (3x \uD83D\uDD25 ) -> crashes in getByteLength -> encodeURI. Ty if u find time to take a look

Run under Chrome and FF, both fail in getByteLength (regPack.js:123) on error "malformed URI sequence". The original JSCrush suffers the same issue.

magnitudoOrg · 2016-03-08T11:09:01Z

There are byteLength(str) solutions without use of encodeURI(), but that may not solve the problem as allowing unsupported Unicode sequences probably will cause problems on other parts of the source (the calling fn prob. should not produce these seq?).

However, answer 7 from fuweichin on Jan 21 2016 provides an impl. where you can change the line throw new Error("UCS-2 String malformed") and try out what happens if this is ignored.
(answer 2 from lovasoa Apr 27 2014 may look tempting but returns wrong length results for some Unicode chars)

http://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript answer 7

//count UTF-8 bytes of a string
function byteLengthOf(s){
    //assuming the String is UCS-2(aka UTF-16) encoded
    var n=0;
    for(var i=0,l=s.length; i<l; i++){
        var hi=s.charCodeAt(i);
        if(hi<0x0080){ //[0x0000, 0x007F]
            n+=1;
        }else if(hi<0x0800){ //[0x0080, 0x07FF]
            n+=2;
        }else if(hi<0xD800){ //[0x0800, 0xD7FF]
            n+=3;
        }else if(hi<0xDC00){ //[0xD800, 0xDBFF]
            var lo=s.charCodeAt(++i);
            if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF]
                n+=4;
            }else{
                throw new Error("UCS-2 String malformed");
            }
        }else if(hi<0xE000){ //[0xDC00, 0xDFFF]
            throw new Error("UCS-2 String malformed");
        }else{ //[0xE000, 0xFFFF]
            n+=3;
        }
    }
    return n;
}

var s="\u0000\u007F\u07FF\uD7FF\uDBFF\uDFFF\uFFFF";
console.log("expect byteLengthOf(s) to be 14, actually it is %s.",byteLengthOf(s));

Siorki · 2016-03-08T20:20:33Z

The issue only happens with Unicode surrogate pairs.

The crusher phase has no knowledge of Unicode constraints and is ignorant of high/low surrogate pairs, thus will not hesitate to break them (as does the naive string reverse function).

This is what happens with your example : the string \uDD25\uD83D (which is the pair making the "astral" character, reversed) is considered for the dictionary, but crashes getByteLength() because it is actually malformed.

Replacing encodeURI() avoids the crash but, as you mention, does not solve the underlying issue and may cause problems further down the line.

The solution would be to avoid dictionary strings starting in the middle of an astral character. Proposed fix, upon building the dictionary in regPack.js:150-161, if the current character is a high surrogate, skip the next one entirely.

Siorki · 2016-10-15T21:31:18Z

Possible solution in ES6 : iterate String with for .. of
From http://exploringjs.com/es6/ch_strings.html#ch_strings §6.4.1 :

for (const ch of 'x\uD83D\uDE80y') {
    console.log(ch.length);
}
// Output:
// 1
// 2
// 1

Siorki · 2017-01-31T20:29:00Z

Pushing this to 5.0 as an attempt to fix #64. Unicode characters may become more prevalent in js1k 2017 and the fix is quite easy. I initially wanted to wait for the overhaul of the crusher code (#7), which was postponed to 6.0.

… 0x10000) Tentative fix for #64 - supposed duplicate

Siorki added the bug label Mar 7, 2016

Siorki self-assigned this Mar 7, 2016

Siorki changed the title ~~Unsupported Unicode sequence in getByteLength()~~ Crusher splits Unicode high/low surrogate pairs, producing incorrect strings Mar 8, 2016

Siorki mentioned this issue Mar 8, 2016

Fixed base64 html output for non Latin1 chars #51

Closed

Siorki mentioned this issue Jan 30, 2017

Bug: Strings contains UTF8 break regpack #64

Closed

Siorki added this to the 5.0 milestone Jan 31, 2017

Siorki added a commit that referenced this issue Jan 31, 2017

#50 : do not split surrogate pair for astral characters (code point >…

7d2ab39

… 0x10000) Tentative fix for #64 - supposed duplicate

Siorki closed this as completed Jan 31, 2017

Siorki added a commit that referenced this issue Feb 20, 2017

Added #50 to v5.0.0 contents

b0bb249

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crusher splits Unicode high/low surrogate pairs, producing incorrect strings #50

Crusher splits Unicode high/low surrogate pairs, producing incorrect strings #50

Siorki commented Mar 7, 2016

magnitudoOrg commented Mar 8, 2016

Siorki commented Mar 8, 2016

Siorki commented Oct 15, 2016

Siorki commented Jan 31, 2017

Crusher splits Unicode high/low surrogate pairs, producing incorrect strings #50

Crusher splits Unicode high/low surrogate pairs, producing incorrect strings #50

Comments

Siorki commented Mar 7, 2016

magnitudoOrg commented Mar 8, 2016

Siorki commented Mar 8, 2016

Siorki commented Oct 15, 2016

Siorki commented Jan 31, 2017