Use LSS comparison for `sort-alternatives` #423

RunDevelopment · 2022-04-20T12:51:25Z

This PR implements a char set comparison function that has the properties described in #420. It works by comparing two char sets by comparing the lists of characters accepted by the char sets. At least, that's what it behaves like. It's still implemented based on ranges.

That's what I implemented, but I'm not sure whether this is actually better than what we had. As we can see when looking at the tests, (?:sample|(?:un)?stable) is now sorted "correctly". However, there are a few examples where the order is now worse. E.g. (false|[ft]|true) used to be ordered as ([ft]|false|true) which was a lot better IMO.

Frankly, I'm starting to question whether sorting based on prefixes is even a good idea. It's (relatively) easy to do, but is it the best we can do?

How about this sorting method: Let a and b be the two alternatives we want to compare. Let w be the lexicographically smallest string that is accepted by either a or b but not both. If no such string exists, then a == b. If w is accepted by a, then a < b. If w is accepted by b, then a > b.

This sorting method will neatly order (?:sample|(?:un)?stable) and ([ft]|false|true). It's also more intuitive IMO.

The only problem with this sorting method is that it doesn't work. The "lexicographically smallest string" is only defined for finite sets of strings. I.e. What's the lexicographically smallest string for a+b? ab > aab > aaab > aaaaaaaaaaaab > ... I'm not sure how this could be fixed.

ota-meshi · 2022-04-21T05:05:57Z

Thank you for the detailed explanation.

I think it would be best if we could sort it in lexicographically as you say.

The only problem with this sorting method is that it doesn't work. The "lexicographically smallest string" is only defined for finite sets of strings. I.e. What's the lexicographically smallest string for a+b? ab > aab > aaab > aaaaaaaaaaaab > ... I'm not sure how this could be fixed.

What do you think about limiting the maximum number of repeats? (For example, 10 repeats)
I think it works for most of the regexes that are actually used.

RunDevelopment · 2022-04-21T12:41:26Z

What do you think about limiting the maximum number of repeats?

That's difficult to do right. E.g. how do we limit the repeats in ((((a+)+)+)+)+?

I was thinking about implementing this a little more. To find the lexicographically smallest string (LSS) in either a or b but not both, we would need to compute the (symmetric) difference of a and b. This is a O(2^n) operation with regexes and might be a little too slow for us.

So how about we use a hybrid method? I was thinking about this:

Find the LSS for both a and b respectively. Let's call those strings a' and b'.
If both a' and b' are defined/were computed successfully:
1. If a' < b', then return a < b.
2. If a' > b', then return a > b.
Otherwise, compare using prefixes as implemented in this PR.

This has a bunch of nice properties:

If the LSS is defined, it's used in the comparison.
We don't have to compute any differences, so it's fast.
The LSS is computed per alternative, so it's easy to cache.
Two alternatives having the same LSS should be pretty rare in the wild, so we will likely not fallback to prefix-based comparisons too often.

However, there's a problem again: the comparison function isn't transitive. The problem is that based on LSS comparison, an alternative without an LSS is "equal" to all other alternatives. This can be fixed using a little trick: if the LSS of an alternative isn't defined, we compute the prefix and pick the LSS of the prefix. With this little trick, the comparison function becomes transitive.

I'll try implementing this.

ota-meshi · 2022-04-21T23:18:18Z

That's difficult to do right. E.g. how do we limit the repeats in ((((a+)+)+)+)+?

You are right. The maximum is not a good idea 😓.

The hybrid method sounds good to me 👍

RunDevelopment · 2022-04-22T11:53:01Z

Alright, I implemented it.

Computing the LSS is fairly simple for NFAs, so I convert alternatives into NFAs using refa and then compute the LSS. Since this is fairly costly, I also implemented a "fast" path when the alternative is just a simple sequence of characters.

I tested this on Prism and the results are pretty good. They are mostly (~99%) the same as before, but it finally gets difficult cases like (?:sample|(?:un)?stable) right.

ota-meshi

Amazing! Thank you

Use char sequences comparison for sort-alternatives

c288d3e

Implemented LSS-based comparison

873bb7e

RunDevelopment changed the title ~~Use char sequences comparison for sort-alternatives~~ Use LSS comparison for sort-alternatives Apr 22, 2022

RunDevelopment added 2 commits April 22, 2022 13:37

Fallback to prefix-based comparison

e867f67

Added a little more documentation

66dc675

RunDevelopment marked this pull request as ready for review April 22, 2022 11:43

RunDevelopment requested a review from ota-meshi April 22, 2022 11:43

RunDevelopment added the enhancement New feature or request label Apr 22, 2022

ota-meshi approved these changes Apr 23, 2022

View reviewed changes

ota-meshi merged commit cde355f into master Apr 23, 2022

ota-meshi deleted the issue420 branch April 23, 2022 01:05

RunDevelopment mentioned this pull request Apr 23, 2022

Change character set comparison function in sort-alternatives? #420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LSS comparison for `sort-alternatives` #423

Use LSS comparison for `sort-alternatives` #423

RunDevelopment commented Apr 20, 2022

ota-meshi commented Apr 21, 2022

RunDevelopment commented Apr 21, 2022

ota-meshi commented Apr 21, 2022

RunDevelopment commented Apr 22, 2022 •

edited

Loading

ota-meshi left a comment

Use LSS comparison for sort-alternatives #423

Use LSS comparison for sort-alternatives #423

Conversation

RunDevelopment commented Apr 20, 2022

ota-meshi commented Apr 21, 2022

RunDevelopment commented Apr 21, 2022

ota-meshi commented Apr 21, 2022

RunDevelopment commented Apr 22, 2022 • edited Loading

ota-meshi left a comment

Choose a reason for hiding this comment

Use LSS comparison for `sort-alternatives` #423

Use LSS comparison for `sort-alternatives` #423

RunDevelopment commented Apr 22, 2022 •

edited

Loading