Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crusher splits Unicode high/low surrogate pairs, producing incorrect strings #50

Closed
Siorki opened this issue Mar 7, 2016 · 4 comments
Closed
Assignees
Labels
Milestone

Comments

@Siorki
Copy link
Owner

Siorki commented Mar 7, 2016

Tweet from @magnitudoOrg :

Tried to pack '🔥🔥🔥' (3x \uD83D\uDD25 ) -> crashes in getByteLength -> encodeURI. Ty if u find time to take a look

Run under Chrome and FF, both fail in getByteLength (regPack.js:123) on error "malformed URI sequence". The original JSCrush suffers the same issue.

@Siorki Siorki added the bug label Mar 7, 2016
@Siorki Siorki self-assigned this Mar 7, 2016
@magnitudoOrg
Copy link

There are byteLength(str) solutions without use of encodeURI(), but that may not solve the problem as allowing unsupported Unicode sequences probably will cause problems on other parts of the source (the calling fn prob. should not produce these seq?).

However, answer 7 from fuweichin on Jan 21 2016 provides an impl. where you can change the line throw new Error("UCS-2 String malformed") and try out what happens if this is ignored.
(answer 2 from lovasoa Apr 27 2014 may look tempting but returns wrong length results for some Unicode chars)

http://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript answer 7

//count UTF-8 bytes of a string
function byteLengthOf(s){
    //assuming the String is UCS-2(aka UTF-16) encoded
    var n=0;
    for(var i=0,l=s.length; i<l; i++){
        var hi=s.charCodeAt(i);
        if(hi<0x0080){ //[0x0000, 0x007F]
            n+=1;
        }else if(hi<0x0800){ //[0x0080, 0x07FF]
            n+=2;
        }else if(hi<0xD800){ //[0x0800, 0xD7FF]
            n+=3;
        }else if(hi<0xDC00){ //[0xD800, 0xDBFF]
            var lo=s.charCodeAt(++i);
            if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF]
                n+=4;
            }else{
                throw new Error("UCS-2 String malformed");
            }
        }else if(hi<0xE000){ //[0xDC00, 0xDFFF]
            throw new Error("UCS-2 String malformed");
        }else{ //[0xE000, 0xFFFF]
            n+=3;
        }
    }
    return n;
}

var s="\u0000\u007F\u07FF\uD7FF\uDBFF\uDFFF\uFFFF";
console.log("expect byteLengthOf(s) to be 14, actually it is %s.",byteLengthOf(s));

@Siorki
Copy link
Owner Author

Siorki commented Mar 8, 2016

The issue only happens with Unicode surrogate pairs.

The crusher phase has no knowledge of Unicode constraints and is ignorant of high/low surrogate pairs, thus will not hesitate to break them (as does the naive string reverse function).

This is what happens with your example : the string \uDD25\uD83D (which is the pair making the "astral" character, reversed) is considered for the dictionary, but crashes getByteLength() because it is actually malformed.

Replacing encodeURI() avoids the crash but, as you mention, does not solve the underlying issue and may cause problems further down the line.

The solution would be to avoid dictionary strings starting in the middle of an astral character. Proposed fix, upon building the dictionary in regPack.js:150-161, if the current character is a high surrogate, skip the next one entirely.

@Siorki Siorki changed the title Unsupported Unicode sequence in getByteLength() Crusher splits Unicode high/low surrogate pairs, producing incorrect strings Mar 8, 2016
@Siorki
Copy link
Owner Author

Siorki commented Oct 15, 2016

Possible solution in ES6 : iterate String with for .. of
From http://exploringjs.com/es6/ch_strings.html#ch_strings §6.4.1 :

for (const ch of 'x\uD83D\uDE80y') {
    console.log(ch.length);
}
// Output:
// 1
// 2
// 1

@Siorki Siorki added this to the 5.0 milestone Jan 31, 2017
@Siorki
Copy link
Owner Author

Siorki commented Jan 31, 2017

Pushing this to 5.0 as an attempt to fix #64. Unicode characters may become more prevalent in js1k 2017 and the fix is quite easy. I initially wanted to wait for the overhaul of the crusher code (#7), which was postponed to 6.0.

Siorki added a commit that referenced this issue Jan 31, 2017
… 0x10000)

Tentative fix for #64 - supposed duplicate
@Siorki Siorki closed this as completed Jan 31, 2017
Siorki added a commit that referenced this issue Feb 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants