-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Glob patterns and umlauts on HFS vs. APFS #845
Comments
Yeah ripgrep doesn't handle normalization at all. From your example, it looks like HFS specifically used the decomposed normal form. This doesn't just impact glob matching, but actual search as well. In particular, if you search for a composed Unicode codepoint, you won't find files that contain the same glyph in decomposed form and vice versa. This is because ripgrep doesn't really know which normal form to use. In this case, if we know that HFS always uses a specific type of Unicode normal form, then in theory, we could do the following:
This is fraught with complications and unknowns. Some of them are significant:
In other words, the only real way to solve this problem is to build normalization support into the regex engine itself. This is basically a rewrite and drastically alters the performance profile of regular expressions. If you read UTS#18, you can see that the Unicode people are well aware of this, which is why features like canonical equivalence are pushed into "extended" level 2 support, which very few regex engines support at all. Unless there is a simple fix I'm missing---perhaps even one that partially fixes the issue---then I suspect this is a |
One possibility is to decompose codepoints in a glob that aren't part of a character class. It would be a frustrating half-measure, and it would still be problematic with respect to needing to detect the HFS file system. One possible way to solve that would be to expose a flag to enable the normalization pass unconditionally if you know you're only searching an HFS file system. This wouldn't work if you search across multiple file systems, and a flag like that seems like not a good UX to me. |
Thanks for the great analysis! Reading UTS#18 this seems like a more general problem than just with HFS because other filesystems don't bother with normalization at all. UTS#18 suggests using NFD (or NFKD) for regular expression matching (HFS is using NFD). Maybe the regex engine could have a mode where it normalizes all inputs (glob and filename to match against) and character classes to NFD. That would also deal with the case where a filesystem without normalization (that seems most except HFS) ends up with a mixture of representations of the same visual character. Maybe that would leave the regex engine's core untouched, but I'm just guessing. As an aside: We noticed that while OSX's APFS does not change the filename's representation, it still does not allow you to create two files with filenames that have the same normalized representation. So it is aware of the equivalence. |
This is basically what I suggested above, and is exactly problematic for the reasons stated, specifically with regard to character classes. Note that UTS#18 S2.1 (on canonical equivalents) is specifically suggesting that the end user construct their pattern such that it uses NFD, likely for exactly the reasons I mentioned. Unicode hints at this with the last bullet point, which is critical: "Applying the matching algorithm on a code point by code point basis, as usual." Translating the input to NFD is definitely not something that should be in the regex engine itself, mostly because it doesn't really confer any advantages. It would be something that ripgrep would do as a pre-processing step outside the regex engine if we were to pursue that path. As I hinted above, this would drastically alter the performance profile of ripgrep. Unicode normalization is decidedly not cheap.
Indeed! On Unix (and probably Windows too), it is entirely possible to create a file that contains a composed codepoint in its name, and then use the decomposed codepoint in a glob (and vice versa), and that would result in a match failure even though the text strings look the same to an end user. It is a frustrating UX, no doubt about it. Presumably HFS is trying to fix that.
Now that is interesting! |
Sounds good. I'll look into mitigating this on VS Code's side. We might get away with supplying two exclusion patterns when when the NFD form differs from the user input. |
@chrmarti Aye. Just be careful! If you replace If you just have a literal |
I came across this today again. It would be great if ripgrep would support optionally normalizing all input text. For me it was not filenames but the actual contents were in different form and I had to search for both decomposed and composed form. |
Nothing has change since my comments above, so I don't see this happening. Sorry. |
What version of ripgrep are you using?
ripgrep 0.8.1 (rev c8e9f25)
+SIMD -AVX
What operating system are you using ripgrep on?
OSX 10.12.6 and 10.13.3
If this is a bug, what are the steps to reproduce the behavior?
The glob patterns do not match umlauts because HFS uses a normalized encoding.
Create a corpus:
Also create third through the command line for comparison:
ls
shows:If this is a bug, what is the actual behavior?
Running
rg --files -g '*ü'
shows no output on OSX 10.12.6 with HFS, on OSX 10.13.3 with APFS it is:If this is a bug, what is the expected behavior?
Ideally on both versions of OSX we would get all three files as matches. The problem comes from HFS changing to the 3-byte code-point for all three variations of creating a file with the umlaut whereas APFS just leaves the representation of the filename as it is given.
Found via microsoft/vscode#43691.
/cc @joaomoreno
The text was updated successfully, but these errors were encountered: