-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use LSS comparison for sort-alternatives
#423
Conversation
Thank you for the detailed explanation. I think it would be best if we could sort it in lexicographically as you say.
What do you think about limiting the maximum number of repeats? (For example, 10 repeats) |
That's difficult to do right. E.g. how do we limit the repeats in I was thinking about implementing this a little more. To find the lexicographically smallest string (LSS) in either So how about we use a hybrid method? I was thinking about this:
This has a bunch of nice properties:
However, there's a problem again: the comparison function isn't transitive. The problem is that based on LSS comparison, an alternative without an LSS is "equal" to all other alternatives. This can be fixed using a little trick: if the LSS of an alternative isn't defined, we compute the prefix and pick the LSS of the prefix. With this little trick, the comparison function becomes transitive. I'll try implementing this. |
You are right. The maximum is not a good idea 😓. The hybrid method sounds good to me 👍 |
sort-alternatives
sort-alternatives
Alright, I implemented it. Computing the LSS is fairly simple for NFAs, so I convert alternatives into NFAs using refa and then compute the LSS. Since this is fairly costly, I also implemented a "fast" path when the alternative is just a simple sequence of characters. I tested this on Prism and the results are pretty good. They are mostly (~99%) the same as before, but it finally gets difficult cases like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing! Thank you
This PR implements a char set comparison function that has the properties described in #420. It works by comparing two char sets by comparing the lists of characters accepted by the char sets. At least, that's what it behaves like. It's still implemented based on ranges.
That's what I implemented, but I'm not sure whether this is actually better than what we had. As we can see when looking at the tests,
(?:sample|(?:un)?stable)
is now sorted "correctly". However, there are a few examples where the order is now worse. E.g.(false|[ft]|true)
used to be ordered as([ft]|false|true)
which was a lot better IMO.Frankly, I'm starting to question whether sorting based on prefixes is even a good idea. It's (relatively) easy to do, but is it the best we can do?
How about this sorting method: Let
a
andb
be the two alternatives we want to compare. Letw
be the lexicographically smallest string that is accepted by eithera
orb
but not both. If no such string exists, thena == b
. Ifw
is accepted bya
, thena < b
. Ifw
is accepted byb
, thena > b
.This sorting method will neatly order
(?:sample|(?:un)?stable)
and([ft]|false|true)
. It's also more intuitive IMO.The only problem with this sorting method is that it doesn't work. The "lexicographically smallest string" is only defined for finite sets of strings. I.e. What's the lexicographically smallest string for
a+b
?ab
>aab
>aaab
>aaaaaaaaaaaab
> ... I'm not sure how this could be fixed.