Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LSS comparison for sort-alternatives #423

Merged
merged 4 commits into from
Apr 23, 2022
Merged

Use LSS comparison for sort-alternatives #423

merged 4 commits into from
Apr 23, 2022

Conversation

RunDevelopment
Copy link
Collaborator

This PR implements a char set comparison function that has the properties described in #420. It works by comparing two char sets by comparing the lists of characters accepted by the char sets. At least, that's what it behaves like. It's still implemented based on ranges.

That's what I implemented, but I'm not sure whether this is actually better than what we had. As we can see when looking at the tests, (?:sample|(?:un)?stable) is now sorted "correctly". However, there are a few examples where the order is now worse. E.g. (false|[ft]|true) used to be ordered as ([ft]|false|true) which was a lot better IMO.


Frankly, I'm starting to question whether sorting based on prefixes is even a good idea. It's (relatively) easy to do, but is it the best we can do?

How about this sorting method: Let a and b be the two alternatives we want to compare. Let w be the lexicographically smallest string that is accepted by either a or b but not both. If no such string exists, then a == b. If w is accepted by a, then a < b. If w is accepted by b, then a > b.

This sorting method will neatly order (?:sample|(?:un)?stable) and ([ft]|false|true). It's also more intuitive IMO.

The only problem with this sorting method is that it doesn't work. The "lexicographically smallest string" is only defined for finite sets of strings. I.e. What's the lexicographically smallest string for a+b? ab > aab > aaab > aaaaaaaaaaaab > ... I'm not sure how this could be fixed.

@ota-meshi
Copy link
Owner

Thank you for the detailed explanation.

I think it would be best if we could sort it in lexicographically as you say.

The only problem with this sorting method is that it doesn't work. The "lexicographically smallest string" is only defined for finite sets of strings. I.e. What's the lexicographically smallest string for a+b? ab > aab > aaab > aaaaaaaaaaaab > ... I'm not sure how this could be fixed.

What do you think about limiting the maximum number of repeats? (For example, 10 repeats)
I think it works for most of the regexes that are actually used.

@RunDevelopment
Copy link
Collaborator Author

What do you think about limiting the maximum number of repeats?

That's difficult to do right. E.g. how do we limit the repeats in ((((a+)+)+)+)+?


I was thinking about implementing this a little more. To find the lexicographically smallest string (LSS) in either a or b but not both, we would need to compute the (symmetric) difference of a and b. This is a O(2^n) operation with regexes and might be a little too slow for us.

So how about we use a hybrid method? I was thinking about this:

  1. Find the LSS for both a and b respectively. Let's call those strings a' and b'.
  2. If both a' and b' are defined/were computed successfully:
    1. If a' < b', then return a < b.
    2. If a' > b', then return a > b.
  3. Otherwise, compare using prefixes as implemented in this PR.

This has a bunch of nice properties:

  • If the LSS is defined, it's used in the comparison.
  • We don't have to compute any differences, so it's fast.
  • The LSS is computed per alternative, so it's easy to cache.
  • Two alternatives having the same LSS should be pretty rare in the wild, so we will likely not fallback to prefix-based comparisons too often.

However, there's a problem again: the comparison function isn't transitive. The problem is that based on LSS comparison, an alternative without an LSS is "equal" to all other alternatives. This can be fixed using a little trick: if the LSS of an alternative isn't defined, we compute the prefix and pick the LSS of the prefix. With this little trick, the comparison function becomes transitive.

I'll try implementing this.

@ota-meshi
Copy link
Owner

That's difficult to do right. E.g. how do we limit the repeats in ((((a+)+)+)+)+?

You are right. The maximum is not a good idea 😓.

The hybrid method sounds good to me 👍

@RunDevelopment RunDevelopment changed the title Use char sequences comparison for sort-alternatives Use LSS comparison for sort-alternatives Apr 22, 2022
@RunDevelopment RunDevelopment marked this pull request as ready for review April 22, 2022 11:43
@RunDevelopment
Copy link
Collaborator Author

RunDevelopment commented Apr 22, 2022

Alright, I implemented it.

Computing the LSS is fairly simple for NFAs, so I convert alternatives into NFAs using refa and then compute the LSS. Since this is fairly costly, I also implemented a "fast" path when the alternative is just a simple sequence of characters.

I tested this on Prism and the results are pretty good. They are mostly (~99%) the same as before, but it finally gets difficult cases like (?:sample|(?:un)?stable) right.

@RunDevelopment RunDevelopment added the enhancement New feature or request label Apr 22, 2022
Copy link
Owner

@ota-meshi ota-meshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants