Minify unicode-range: U+2600-26FF → U+26?? #344

JRaspass · 2020-10-17T09:02:29Z

@tdewolff Consider this my opening salvo, it works, but it's not very elegant nor efficient, I particularly don't like the copy(), any feedback would be much appreciated.

I've decided to just tackle the first part of #321 in this PR as it was simpler imo and both optimisations are fairly self contained.

Updates #321

tdewolff

Excellent first try! I've added some comments to the code, let me know what you think!

tdewolff · 2020-10-17T15:04:19Z

css/css.go

@@ -1201,6 +1201,35 @@ func (c *cssMinifier) minifyProperty(prop Hash, values []Token) []Token {
 			values[0].Data = oneBytes
 			values[0].Ident = 0
 		}
+	case UnicodeRange:
+		for i, value := range values {


Be aware that value may be just a CommaToken which you may need to continue?

tdewolff · 2020-10-17T15:10:44Z

css/css.go

+			copy(data, value.Data)
+
+			// If we have a range of exactly two parts.
+			if parts := bytes.Split(data, []byte{'-'}); len(parts) == 2 {


I would remove U+ first, for speed just check if the first byte is U and the second a +, then take a slice starting at 2 to skip them both. Next we have two cases: one single value or a range. Start looking byte for byte until we meet the end or -. Keep track of leading 0s so we trim those too. If this is a range, continue like you do.

tdewolff · 2020-10-17T15:11:23Z

css/css.go

+				parts[0] = bytes.TrimPrefix(parts[0], []byte("U+"))
+
+				// And both parts are the same length.
+				if len(parts[0]) == len(parts[1]) {


This isn't a requirement as far as I can see

tdewolff · 2020-10-17T15:14:34Z

css/css.go

+					}
+
+					// If both parts now match we only need one.
+					if bytes.Equal(parts[0], parts[1]) {


Put this check first: in reverse order check if the first part is a zero and the second part an F. If this is not the case, check that both bytes are equal, or the left side is a 0 or is shorter than the right side (e.g. U+0-04FF => U+4?? as far as I can see).

If you put this check first before setting the ?s, we can get rid of the copy().

Updates #321

JRaspass · 2020-10-18T14:23:43Z

Take 2, I no longer have the copy(), it now handles a mixture of case in the U and the hexadecimal based on lowercase examples at https://www.w3.org/TR/css-fonts-3/#unicode-range-desc

I believe both parts do have to be the same length with the exception of a special case where the left side can be grown to be the same length, e.g. U+0-FF → U+00-FF → U+??, this PR doesn't handle that.

I also believe your explicit example is invalid and U+0-04FF cannot become U+4?? as that's equivalent to U+400-4FF.

Also there's a TODO test that I would like to resolve before this PR is merged but I don't yet see an elegant way to do that. It's not a bug, just a missed optimisation, it won't replace with question marks if there's no common prefix.

And finally I wasn't sure if the bitwise mask for case insensitive byte checking was too esoteric, if so

// Starts with "U+..." or "u+..."
if len(value.Data) >= 2 && value.Data[0]|32 == 'u' && value.Data[1] == '+' {

could become

// Starts with "U+..." or "u+..."
if bytes.HasPrefix(value.Data, []byte("U+")) || bytes.HasPrefix(value.Data, []byte("u+")) {

tdewolff · 2020-10-19T21:32:54Z

css/css.go

+		// U+2600-26FF → U+26??
+		for i, value := range values {
+			// Starts with "U+..." or "u+..."
+			if len(value.Data) >= 2 && value.Data[0]|32 == 'u' && value.Data[1] == '+' {


This is a personal style I have, but I prefer to write it the other way around, as in 1 < len(value.Data) or 2 <= len(value.Data). That way the left side is always smaller then the right side, which is more intuitive (e.g. when we write intervals). It's easier to process mentally, for me at least.

tdewolff · 2020-10-19T21:35:41Z

css/css.go

+		for i, value := range values {
+			// Starts with "U+..." or "u+..."
+			if len(value.Data) >= 2 && value.Data[0]|32 == 'u' && value.Data[1] == '+' {
+				hyphen := bytes.IndexByte(value.Data, '-')


Try to avoid function in bytes like these, as this makes it easier to write code non-optimally. For example, in this case you should parse from left to right and put the bytes in left. Then if you encounter -, stop parsing and put the remainder in right.

tdewolff · 2020-10-19T21:38:44Z

css/css.go

+				// Skip over "U+", we have a range of two same length parts.
+				left := value.Data[2:hyphen]
+				right := value.Data[hyphen+1:]
+				if len(left) != len(right) {


As you said, this is not supported yet, but as you parse from left to right like I explain above, you can keep track of leading zeros. Then you can check for lengths to be equal (ignoring leading zeros).

tdewolff · 2020-10-19T21:42:48Z

css/css.go

+				}
+
+				// Starting at the ends compare each part byte by byte.
+				for j := len(left); j > 0; j-- {


Here too do I prefer j < 0

tdewolff · 2020-10-19T21:43:53Z

css/css.go

+
+				// Starting at the ends compare each part byte by byte.
+				for j := len(left); j > 0; j-- {
+					if left[j-1] != '0' || right[j-1]|32 != 'f' {


This is actually a really smart way to check whether it's either upper or lower case!

tdewolff · 2020-10-19T21:44:28Z

css/css.go

+				// Starting at the ends compare each part byte by byte.
+				for j := len(left); j > 0; j-- {
+					if left[j-1] != '0' || right[j-1]|32 != 'f' {
+						if bytes.EqualFold(left[:j], right[:j]) {


This is bit wasteful to check the entire length every time, the first character will be checked for the same thing many times!

tdewolff · 2020-10-19T21:47:01Z

The bitwise mask is excellent! Be aware that the alternative you put is pretty slow, you need to do two byte-slice allocations and two function calls!

I like where this is going ;-), I've left a few comments we can discuss, let me know how it goes.

tdewolff · 2021-03-15T19:35:58Z

This has now been implemented

tdewolff reviewed Oct 17, 2020

View reviewed changes

Minify unicode-range: U+2600-26FF → U+26??

4747db3

Updates #321

tdewolff reviewed Oct 19, 2020

View reviewed changes

tdewolff closed this in 75cef5c Mar 15, 2021

JRaspass mentioned this pull request Mar 16, 2021

unicode-range could be minified #321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minify unicode-range: U+2600-26FF → U+26?? #344

Minify unicode-range: U+2600-26FF → U+26?? #344

JRaspass commented Oct 17, 2020

tdewolff left a comment

tdewolff Oct 17, 2020

tdewolff Oct 17, 2020

tdewolff Oct 17, 2020

tdewolff Oct 17, 2020

JRaspass commented Oct 18, 2020

tdewolff Oct 19, 2020

tdewolff Oct 19, 2020

tdewolff Oct 19, 2020

tdewolff Oct 19, 2020

tdewolff Oct 19, 2020

tdewolff Oct 19, 2020

tdewolff commented Oct 19, 2020

tdewolff commented Mar 15, 2021

Minify unicode-range: U+2600-26FF → U+26?? #344

Minify unicode-range: U+2600-26FF → U+26?? #344

Conversation

JRaspass commented Oct 17, 2020

tdewolff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JRaspass commented Oct 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdewolff commented Oct 19, 2020

tdewolff commented Mar 15, 2021