-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Port Collator expressions to core/iOS/Android/Qt #11869
Conversation
I've pushed initial implementations for darwin and Android, but without any testing so far -- basically just making sure I can hook everything up. Doing that exercise made me realize I hadn't fully thought through how the PRIMARY/SECONDARY/TERTIARY collator strengths (used by Android/ICU) would translate to I implemented a workaround of the form:
There is another wrinkle here in that it is up to the locale what PRIMARY/SECONDARY/TERTIARY actually mean, so it's not guaranteed they'll map the way people expect to "case" and "diacritic", but I think it should usually (?) be a pretty good mapping. |
I think the collator strengths will correspond to the flags found on other platforms, at least for Latin alphabets. Vietnamese is as good a stress test for Latin alphabets as I can think of:
|
platform/darwin/src/collator.mm
Outdated
} | ||
|
||
int compare(const std::string& lhs, const std::string& rhs) const { | ||
NSString* nsLhs = [NSString stringWithUTF8String:lhs.c_str()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you can write @(lhs.c_str())
instead of calling -[NSString stringWithUTF8String:]
explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
platform/darwin/src/collator.mm
Outdated
int compare(const std::string& lhs, const std::string& rhs) const { | ||
NSString* nsLhs = [NSString stringWithUTF8String:lhs.c_str()]; | ||
// TODO: verify "abc" != "abcde" -- the "range" argument seems strange to me | ||
// https://developer.apple.com/documentation/foundation/nsstring/1414561-compare |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does seem weird that there isn’t a -compare:options:locale:
that doesn’t take an NSRange
. It looks like CFString has the same quirk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could implement this ourselves with a length check -- ie:
- First check that lhs == rhs over lhs's range
- If they're equal, check if rhs is longer than lhs, in which case rhs > lhs no matter what's in the extra characters
But I can imagine cases where that's not necessarily correct because the "extra characters" affect the sort order of the original strings (e.g. "ue" gets split up but it's in a german locale that treats "ue" as equal to "ü")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After playing around with this method a bit, I’m satisfied that the length check won’t even be necessary. As far as I can tell, a string always sorts after its prefix. The case below is the closest I could come to an unexpected result, but I think even this is correct. (Pardon the Swift; the playground makes it much easier to try out possibilities.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also verified the pre-fix sort behavior, and the only other concern would be character folding right at the end of the range -- however, it appears Apple doesn't do "ü"->"ue" type folding.
I just pushed an initial implementation of BCP 47 parsing in I did the most basic test of creating a collator with the unlikely language tag |
I'm not sure why spirit doesn't backtrack on reaching end of input without matching, and it seems like it'd be better to configure that globally, but for now I added explicit lookaheads to the |
81f0c09
to
a331587
Compare
🤔Another wrinkle, this time on the Darwin side. The The good news is: the rest of the collator tests running on darwin pass! (can't unignore them yet, though, since they don't have a "default" implementation). @kkaefer and/or @brunoabinader: what do you think of doing a very minimal default/linux/Qt implementation of I can see the counter-argument that if we don't do locale support on Qt, then either (1) we're piling up hidden technical debt, or (2) locale-awareness wasn't really necessary in the first place. But I still think this is a good punt -- if we start developing bilingual in-car navigation apps w/ Qt, this would be enough to get started. |
I just pushed a "default" implementation using nunicode (depends on adding nunicode 1.8 to mason: mapbox/mason#616). I believe the nunicode implementation should be able to pass all the tests except |
30180fc
to
23b33b3
Compare
Capturing some very useful conversation with @brunoabinader:
|
3fff201
to
f5f9b95
Compare
Initially had some issues with stability of the MapSnapshotter when running render test but that shouldn't happen anymore. Is this happening randomly or specific to your test? |
I only just figured out how to get the Studio logcat to keep showing logs for a dead process and it hasn't crashed since I've been following it that way -- but my guess would be the crash is in the collator test, considering my dev process has been "keep changing JNI invocations until it seems to work". 😜 |
@tobrun Actually it looks like the crash is occurring outside of
This is just a guess, but I noticed the screen dim on my test device (a Pixel 2) just before the crash. Maybe some Android user-interaction idle timeout is messing with it? |
Thank you for adding that information, will try reproducing and I'm hoping that this is the source of the crash. Fwiw on my test device I always have the following enabled:
|
@kkaefer I checked in your vendoring script and used it to re-import nunicode. One issue is that the nunicode implementation files expect to be in the same relative path as the header files, but I didn't want to pollute the root header search path with all the nunicode stuff, so instead I had to modify all the includes to use I squashed and re-wrote the commit history, in the process taking out a few TODOs that were covered by review comments. I also stopped using our hand-rolled BCP 47 parser in the darwin implementation, based on @1ec5 's assurance that "language identifiers" can already handle BCP 47. I think this is ready? Follow up work:
|
platform/darwin/src/collator.mm
Outdated
(diacriticSensitive ? 0 : NSDiacriticInsensitiveSearch)) | ||
, locale(locale_ ? | ||
[[NSLocale alloc] initWithLocaleIdentifier:@((*locale_).c_str())] : | ||
[NSLocale currentLocale]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although localeIdentifier
is BCP 47 compliant when NSLocale is created with a BCP 47 locale identifier, +[NSLocale currentLocale]
places an underscore before the region code, for backwards compatibility. So the one modification we need to make is to replace underscores with hyphens. (Alternatively, we can build a canonical BCP 47 code by stringing together the languageCode
, scriptCode
, and regionCode
properties instead of using localeIdentifier
.) Otherwise we’re in good shape on iOS and macOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean in resolvedLocale
we should do the _
/-
replacement? I guess since _
is not a valid part of any BCP 47 tag, that's a relatively safe replacement to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the replacement -- experimentally, it looks like the identifier passed into initWithLocaleIdentifier
gets round-tripped intact whether it uses the underscore or not. Not sure what pathway currentLocale
goes down.
// may append the region tag with an "_". Changing that to a "-" makes the identifier BCP 47 compliant. | ||
// Experimentally, "zh-Hans-HK" and "zh-Hans_HK" both round trip -- if the second is used by | ||
// `currentLocale`, we don't want to return the underscore. | ||
return [[locale localeIdentifier] stringByReplacingOccurrencesOfString:@"_" withString:@"-"].UTF8String; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is good enough to avoid the underscore issue, since none of the ISO or UN region codes have hyphens or underscores. If we want to be super safe and protect against extensions in the locale identifier, then this code would ensure that we only get the bits we care about:
// NSLocale.languageCode & company are unavailable in iOS 9.
NSArray *keys = @[NSLocaleLanguageCode, NSLocaleScriptCode, NSLocaleCountryCode];
NSMutableArray *components = [NSMutableArray array];
for (NSString *key in keys) {
NSString *code = [locale objectForKey:NSLocaleLanguageCode];
if (code) {
[components addObject:code];
}
}
return [components componentsJoinedByString:@"-"].UTF8String;
(Or the C++ string stream equivalent.) But I’m not even sure that an extension can ever be present in +[NSLocale currentLocale]
. For example, my Mac’s locale settings are all over the place, including custom sorting and date formats, but the locale identifier is still just vi_US
, which would be fine with just the underscore replacement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think if there's an extension that doesn't conform to BCP 47 conventions, all bets are off anyway.
We should try to get your personal machine containerized and used by CI for testing locale functionality. 😉
cmake/nunicode.cmake
Outdated
@@ -0,0 +1,38 @@ | |||
macro(mbgl_nunicode_core) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a macro mixin, we could also use an interface library in CMake, like this: https://github.com/mapbox/mapbox-gl-native/blob/master/cmake/loop-uv.cmake
This allows you to add target_link_libraries(mbgl-core PRIVATE nunicode)
to "mix in" the sources and include paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, building libnu as a standalone static library also makes sense because it allows us to restrict the define flags to just the nunicode files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Standalone static probably makes sense and is most similar to how we used to load it with mason. I'll look into setting that up (my CMake game is not that strong, but time to learn).
Do you have any concerns about the manual stripping of unused files I did in the process of vendoring? It's a potential source of confusion, but just helps reduce the total number of lines of code that have to get pulled into the repo, included in search, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The define flags actually have to be public because they toggle sections of the header files on and off (unless we just hard-wire the headers to the state we want).
nunicode/LICENSE
Outdated
@@ -0,0 +1,19 @@ | |||
Copyright (c) 2013 Aleksey Tulinov <aleksey.tulinov@gmail.com> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this at a level that is not the root level? e.g. in vendor/nunicode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand -- are you suggesting just the license should live in vendor/nunicode
, or the entire nunicode
should live under a vendor
folder? Moving everything into a vendor
folder under the root makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the entire nunicode folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@friedbunny I think the plan is to add changelog entries with the platform specific NSExpression and Expression.java wrappers. A possible entry could be:
|
ad763e2
to
cc98bfc
Compare
@jfirebaugh I just rebased on top of your DSL changes from #12258, and had to modify the |
Yeah, the DSL is piecemeal as needed. |
- Version bump to 1.8 necessary for "unaccent" functionality - Qt now depends on nunicode, ruling out use of precompiled binaries
Cross platform parsing and evaluation code.
- Based on nunicode - Not locale-aware - Used by linux and Qt builds
- Uses NSString for comparison - Uses NSLocale for loading locales
- Uses java.text.Collator for comparison - Uses java.util.Locale for locale loading - Uses LanguageTag for BCP 47 parsing - Falls back to non-locale-aware nunicode/default comparison for case-sensitive/diacritic-insensitive. - Testing these changes depends on running Android render tests - "collator" is not yet exposed in the SDK bindings.
When finished, this PR will fix issue #11692 (port of mapbox/mapbox-gl-js#6270). The functionality requires 3 separate platform implementations. Currently, I have an initial implementation of the core part (expression parsing/evaluation etc.), and I plan to start working through the platform implementations starting with iOS/macOS.
ICU (requires new Mason build with bundled collators)Qt implementation?Using default/nunicode for now/cc @anandthakker @1ec5 @nickidlugash @brunoabinader @tobrun