Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minify unicode-range: U+2600-26FF → U+26?? #344

Closed
wants to merge 1 commit into from
Closed

Minify unicode-range: U+2600-26FF → U+26?? #344

wants to merge 1 commit into from

Conversation

JRaspass
Copy link

@tdewolff Consider this my opening salvo, it works, but it's not very elegant nor efficient, I particularly don't like the copy(), any feedback would be much appreciated.

I've decided to just tackle the first part of #321 in this PR as it was simpler imo and both optimisations are fairly self contained.

Updates #321

Copy link
Owner

@tdewolff tdewolff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent first try! I've added some comments to the code, let me know what you think!

@@ -1201,6 +1201,35 @@ func (c *cssMinifier) minifyProperty(prop Hash, values []Token) []Token {
values[0].Data = oneBytes
values[0].Ident = 0
}
case UnicodeRange:
for i, value := range values {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware that value may be just a CommaToken which you may need to continue?

css/css.go Outdated
copy(data, value.Data)

// If we have a range of exactly two parts.
if parts := bytes.Split(data, []byte{'-'}); len(parts) == 2 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove U+ first, for speed just check if the first byte is U and the second a +, then take a slice starting at 2 to skip them both. Next we have two cases: one single value or a range. Start looking byte for byte until we meet the end or -. Keep track of leading 0s so we trim those too. If this is a range, continue like you do.

css/css.go Outdated
parts[0] = bytes.TrimPrefix(parts[0], []byte("U+"))

// And both parts are the same length.
if len(parts[0]) == len(parts[1]) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a requirement as far as I can see

css/css.go Outdated
}

// If both parts now match we only need one.
if bytes.Equal(parts[0], parts[1]) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this check first: in reverse order check if the first part is a zero and the second part an F. If this is not the case, check that both bytes are equal, or the left side is a 0 or is shorter than the right side (e.g. U+0-04FF => U+4?? as far as I can see).

If you put this check first before setting the ?s, we can get rid of the copy().

@JRaspass
Copy link
Author

Take 2, I no longer have the copy(), it now handles a mixture of case in the U and the hexadecimal based on lowercase examples at https://www.w3.org/TR/css-fonts-3/#unicode-range-desc

I believe both parts do have to be the same length with the exception of a special case where the left side can be grown to be the same length, e.g. U+0-FFU+00-FFU+??, this PR doesn't handle that.

I also believe your explicit example is invalid and U+0-04FF cannot become U+4?? as that's equivalent to U+400-4FF.

Also there's a TODO test that I would like to resolve before this PR is merged but I don't yet see an elegant way to do that. It's not a bug, just a missed optimisation, it won't replace with question marks if there's no common prefix.

And finally I wasn't sure if the bitwise mask for case insensitive byte checking was too esoteric, if so

// Starts with "U+..." or "u+..."
if len(value.Data) >= 2 && value.Data[0]|32 == 'u' && value.Data[1] == '+' {

could become

// Starts with "U+..." or "u+..."
if bytes.HasPrefix(value.Data, []byte("U+")) || bytes.HasPrefix(value.Data, []byte("u+")) {

// U+2600-26FF → U+26??
for i, value := range values {
// Starts with "U+..." or "u+..."
if len(value.Data) >= 2 && value.Data[0]|32 == 'u' && value.Data[1] == '+' {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a personal style I have, but I prefer to write it the other way around, as in 1 < len(value.Data) or 2 <= len(value.Data). That way the left side is always smaller then the right side, which is more intuitive (e.g. when we write intervals). It's easier to process mentally, for me at least.

for i, value := range values {
// Starts with "U+..." or "u+..."
if len(value.Data) >= 2 && value.Data[0]|32 == 'u' && value.Data[1] == '+' {
hyphen := bytes.IndexByte(value.Data, '-')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid function in bytes like these, as this makes it easier to write code non-optimally. For example, in this case you should parse from left to right and put the bytes in left. Then if you encounter -, stop parsing and put the remainder in right.

// Skip over "U+", we have a range of two same length parts.
left := value.Data[2:hyphen]
right := value.Data[hyphen+1:]
if len(left) != len(right) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, this is not supported yet, but as you parse from left to right like I explain above, you can keep track of leading zeros. Then you can check for lengths to be equal (ignoring leading zeros).

}

// Starting at the ends compare each part byte by byte.
for j := len(left); j > 0; j-- {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too do I prefer j < 0


// Starting at the ends compare each part byte by byte.
for j := len(left); j > 0; j-- {
if left[j-1] != '0' || right[j-1]|32 != 'f' {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a really smart way to check whether it's either upper or lower case!

// Starting at the ends compare each part byte by byte.
for j := len(left); j > 0; j-- {
if left[j-1] != '0' || right[j-1]|32 != 'f' {
if bytes.EqualFold(left[:j], right[:j]) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bit wasteful to check the entire length every time, the first character will be checked for the same thing many times!

@tdewolff
Copy link
Owner

The bitwise mask is excellent! Be aware that the alternative you put is pretty slow, you need to do two byte-slice allocations and two function calls!

I like where this is going ;-), I've left a few comments we can discuss, let me know how it goes.

@tdewolff tdewolff closed this in 75cef5c Mar 15, 2021
@tdewolff
Copy link
Owner

This has now been implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants