-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement UAX31's quick check algorithm #16383
Conversation
Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please see CONTRIBUTING.md for more information. If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment. Bugzilla referencesYour PR doesn't reference any Bugzilla issue. If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog. Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "master + dmd#16383" |
b0f7ebd
to
a44fc2a
Compare
GCC was very helpful in erroring out with ImportC test.
Had to disable silently accept there. If we had a way of knowing that the CPP is GCC, we'd be able to turn this warning off and therefore turn the warning back on. |
0a76991
to
039666a
Compare
I have no idea why Nor is it doing it in my benchmark testsuite which verifies it against the old search algorithm. |
1efeb4f
to
f07a676
Compare
Oh hello there, so the buckets aren't being computed... Table itself is ok, but not the computation of the bucket offets. |
0eaae20
to
d29f012
Compare
Looks like there is a bug some place regarding signed vs unsigned integer comparisons at CTFE.
Makes sense to be getting that, preprocessor runs before us. Gonna have to disable it as well, not much I can do without knowing which preprocessor it is. |
6f5972e
to
aa04be1
Compare
Looks like there is a bug hiding out for the It'll need to be fixed prior to merging. But that is not a problem for today. |
auto sNormalized = toNFC(s); | ||
if (sNormalized in stringtable) | ||
{ | ||
free(sNormalized); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mem.xfree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact function doesn't matter it's pseudo code.
What I'm worried about is whether the string interner stringtable
is capable of handling the pattern of storing unnormalized as key to normalized value.
Write an identifier one way 100 times, which should result in normalization only the 1 time if it wasn't normalized.
This is something I need feedback on before I start next PR in a few months time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stringtable uses a hash of the bytes in the identifier. It does not attempt normalization. Doing that would slow it way down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that isn't what I am suggesting.
The stringtable is only responsible for storing key to value map. Unlike currently I would need to make the key be a different string to the value.
Turns out this is as simple as adding an extra parameter (key) which gets sent to the hash instead of the value parameter.
The quick check algorithm quickly determines if normalization needs to occur for Unicode, for ASCII its always off.
idPoolNormalize
then takes whether it is currently normalized and performs the normalization calling into the stringtable as describe in the pseudo code.
I do believe this is all solved for next PR.
@@ -639,6 +639,26 @@ dmd -cov -unittest myprog.d | |||
`Turns off all array bounds checking, even for safe functions. $(RED Deprecated | |||
(use $(TT $(SWLINK -boundscheck)=off) instead).)`, | |||
), | |||
Option("normalization=<strategy>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need all these strategies? Doesn't C23 specify behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C23 does not specify. Neither does UAX31, as long as we handle it in some stated way (or pickable) we are good.
We don't need the error case as that is already handled with warning as error. We also don't need to expose deprecation externally.
Next PR would be normalization that would get exposed (currently internal only) and later made the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pick the simplest way and leave it at that. Why should we do normalization at all? I'm beginning to like Nim's strategy of every non-ASCII code unit is an identifier character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't handle normalization in some form a D identifier may look the same, but won't actually match to a C identifier.
Erroring out basically requires a subset of the normalization process except less well-documented. I haven't implemented the full algorithm, only enough to decrease the cost of normalization to "don't worry about it" levels.
Performing normalization autocorrects, and the pseudo code in idPoolNormalize
means you only have to do it once per string representation.
Note: just having a larger table doesn't handle normalization, as long as we accept Unicode it will always be able to be an issue for someone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at Nim's documentation.
Case insensitive comparisons of identifiers: https://nim-lang.org/docs/manual.html#lexical-analysis-identifier-equality
So that would be a "fun" idea to propose.
Anyway, Nim transcompiles to C or C++. We don't.
All their identifiers get verified by the C/C++ compiler, which also means in practice they are defined against UAX31 already with normalization (erroring). Even if their language manual doesn't.
We would not be copying them if we changed our tables. They still have a lot more logic going on than we have had.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we don't do normalization. If a user needs it, he can run a filter over his source files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot recommend this.
You have to implement a check for normalization, and it has to be a complete isNFC
check to be C23 compliant, unless you want code to not compile that is required to compile.
From C23:
C requires that identifiers be in Normalization Form C and therefore identifiers that compare the
same under NFC are equivalent. This is described in 6.4.2.
To detect if you are in form C requires that you basically perform normalization, except you do not store the result. https://unicode.org/reports/tr15/#Detecting_Normalization_Forms
That includes having all the same tables as you would've had otherwise. Not a great tradeoff. Especially when you consider code formatters should be doing it only on identifiers anyway.
|
||
Unless you use non-ASCII characters, this change will not affect you. | ||
|
||
Due to bug [24507](https://issues.dlang.org/show_bug.cgi?id=24507) if you use GCC as your preprocessor with any identifier that is not normalized to form C, by default it will produce a warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just support C23 and leave it at that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are, the problem is GCC is picking different behaviors on failure that we want to override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the rest of the text, normalization form C is the big breaking change that people need to be aware of in this PR.
Which is what C23 is defined against.
D.5 Equivalent Normalized Identifiers
1 UAX #31 requires that implementations describe how identifiers are compared and considered
equivalent.
2 C requires that identifiers be in Normalization Form C and therefore identifiers that compare the
same under NFC are equivalent. This is described in 6.4.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be surprised if anyone even uses normalized identifiers. If we support various incompatible normalized identifier schemes, then the user will be creating code that won't link with other ImportC code.
I suggest pick the C23 scheme and not do any of these other bug accommodations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does gcc have some flag to pick different schemes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. They can pick between a few, but the important ones are C and KC.
I do not want to offer any other normalization forms than C to a user.
D form is easily done once you've got C (it's a subset) so I'll write the wrapper function for it during the next PR that adds the normalization function itself.
But the KC/KD require extra tables, while I'm not against seeing them added I don't see a reason for dmd to need it.
The compiler switches introduced here are to pick what to do when it is not normalized.
Right now that is to accept it as-is (with varying levels of errorness) and later that would be to normalize it.
Without this, I cannot verify the quick check algorithm as part of the test suite currently.
The goal of the switches is to give a ± 10 release window when people can transition forward and backward as required.
That window can be changed, but I believe that may work well so as to try and get one GDC release on either side of changing defaults to C23.
We may not keep any of these switches long-term.
Before removing them happens ImportC should be defined against C23 and we need to transition everyone over to C23.
Once that happens picking tables becomes unnecessary and can all be cleaned up.
But until then to switch 100% over right now would be a breaking change without a transition backwards hence the need for switches.
Unless of course, you're telling me to do the hard break, at which point I'll do a different PR removing all that infrastructure.
Iain may have opinions on switches, as well as if we support NFKC. So we do need to hear from him.
All the Unicode tables got a reformat to be more compact when it is generated. Intellij really wasn't happy with the significant line count with only one entry per line. Pairs now have 4, otherwise it's 8 values per line. Now it even syntax highlights! |
6c0655b
to
521c6b1
Compare
Ignoring coverage upload, CI is green. However, I've remembered that the table generator must not emit entries that span multiple planes. They must be split. |
521c6b1
to
58adbd2
Compare
Right fixed, keeping all the planes separate, actually made the index table for CCC smaller. |
Adding some more test cases would probably be a good idea for the tables. Could probably generate the test cases with the help of the table generator, will do that another day, but it is getting closer I think. |
58adbd2
to
fbc02c1
Compare
Tests for all tables start/middle/end against their indexes have been added. Please review @WalterBright, scope should be met for this PR now and I've tried to do everything you wanted so if I haven't I need more clarification. |
fbc02c1
to
a073763
Compare
Right, we are at an impasse and it is not being resolved for 2.109. @WalterBright since you don't like my solution I'll let you figure out what you want to implement instead of normalization like I wanted to do. A bug fix that was included in this has been split out into #16410 it must go into 2.109 otherwise the ImportC character table switch won't work. |
@rikkimax I still like the 2000 line reduction in code! |
Unfortunately, that was just a reformatting! You are welcome to PR anything you do like out of this PR, the branch will stay in case you change your mind as it was complete and passing CI. |
Thank you, @rikkimax ! I really do appreciate the time and effort you have put into this. Yes, I would like to keep the branch around in case I'm wrong (although I have yet to be wrong about anything :-) but ya never know! ). |
@rikkimax I'd like to see this added as a library function! |
Looks like std.uni is missing a Contrary to what it may appear, my focus upstream is more on foundational stabilization of D than niceties like this function. Shared library support, identifiers, memory safety + temporal, reference counting, things like that are more important to me atm. |
I sure hate to not use all the work you've put into this. It sure would be ideal as a library function. |
It's not a huge loss, other parts of the PR will get merged in eventually. The improvements to the search algorithm for bucketing will be a good speed up, unfortunately, the increased table sizes gave a bit of a performance degradation. I'll do it if you don't want to probably for a point release. For now, I have my own stuff to work on as well as the owner escape analysis DIP, which has come along quite nicely. Already heard positive things about it, and Bruce was able to finally understand what DIP1000 is trying to do! He is quite keen to talk to you about it at next meetup! |
If you really want to see some of my work not go to waste, why not have a look at my member of operator PR? This unblocks my design for sum types and solves the zero size struct problem for value type exceptions (aka zero size struct exceptions ala C++ proposal). |
PR number 2.
A few changes from the original approach:
Fibonaccian search unfortunately did not win out in my benchmarks for what I tried and it required a lot of extra ROM for both the standard algorithms.
Two things I want to draw your attention to @WalterBright:
acquireIdentifierFromString
function. I'm going to guess that you won't like this function existing, I don't. But I couldn't find a way to merge the pathways wrt. universal character names + D identifiers. So it had to be split out.idPoolNormalize
, there is pseudo code for what that function needs to do oncetoNFC
has been implemented. I don't particularly want to touch string interning itself, so I'd appreciate if you could look at it and give me some feedback about how you would want it to be implemented or do a pass over it to make it possible.