-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace PluralElementsV1 with PluralElementsPackedULE #5510
Conversation
This can't land until #5511 and #5520 land, and the commit history is a little messy because of it, but I thought it might be good to give some context about how I plan to use the new zerovec abstractions. This reduces the baked data size for the finely sliced data structs. It also makes PluralElementsPackedULE (with all the proper bounds and impls) which I intend to use in datetime 2.0. |
components/plurals/src/provider.rs
Outdated
// TODO: Inform the user more nicely that their data doesn't fit in our packed structure | ||
panic!("other value too long to be packed: {self:?}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way I could get rid of the explicit panic would be to restrict the PatternElement's T to be a certain length (create a newtype such as ShortT).
I've said this before, I'm really not in favor of tying the 2.0 release timeline to brand new ULE ideas. I don't think we have the time to design that properly. I don't like designing these things without getting an understanding of the variety of use cases, and we can't get that understanding with such short notice. I'm open to doing the MultiVar fix because it's one we've wanted for ages and have a plan for that that I think will be sufficiently general and reduce unsafe surface. VarTuple is rather straightforward as well and I'm fine with that, but I'm not really in favor of this set of zerovec changes. |
Hm? This PR is the subject specific ULE for packed plural patterns. There are no zerovec changes in this PR. This PR depends on #5520 which contains zerovec changes which I think are strongly motivated and justified independent of MultiVarULE, but I can change #5520 to be plurals-specific if it can land faster that way. |
This PR is largely the result of our discussion on Friday where we decided to proceed with Option 3 for packed patterns. In order to do Option 3, I needed a better solution for packed plurals. |
Also, please take a look at the fingerprint file in this PR. The savings for baked data are rather substantial. Generalized MultiFieldULE would get us somewhat close, but this is I feel a mostly optimal solution. I spent a lot of time designing this PR's ULE. TinyVarVar is a component I split out because I like to generalize when possible. |
Yes, this PR requires #5520 which is a new ULE representation that hasn't been discussed.
My impression of our discussion on Friday was that we decided explicitly to not add more abstractions to zerovec and instead use MultiField (and/or do manual things for patterns), which we had decided to fast track for 2.0. This feels counter to that discussion? (To be clear: I'm fine with #5511, it's #5520 that's a surprise to me)
Yes please. Generic unsafe code is a heavier burden on maintenance and while I like to generalize where possible too, I absolutely do not want to do that last minute. We have one use case for this generic type at the moment, that is not enough for us to know whether this abstraction is at the right level of specificity. And as far as I can tell #5520 is just (the fixed version of)
It seems like we have a path forward for now, but for future reference I'm really not happy with the dynamic where we design new ULE things last minute when we come up with a new way to save size. I've expressed this before: I'm fine doing it when we've had time to get a grasp of the variety of use cases, but that takes time and I don't think we have time to do this well here. The way I see things are that the data design phase for a component (which should typically be many many months before stabilization) is when we may decide to invent stuff like this, and we can give those design processes the time they deserve in such a case. In the near-stabilization data optimization phase, we might tweak the representation to pack it further, but it is too late to invent new abstractions and at best we should do component-specific manual ULE impls (I'm not overly happy with those either but I think they're fine as we still figure things out). So I basically don't find "the savings for this are substantial" to be a valid argument here. Unfortunately pattern data design got punted close to the end of 2.0 so there never was a separate initial data design phase, but I don't think that obviates us being careful here.
I feel like "somewhat close" is fine? I don't think we need to fully optimize this last minute like this. My preference is rather to keep unsafe abstractions low but be "somewhat close" on datasize than proliferate them. |
FWIW, I'm really not seeing this so far, either the strong motivation or the justification independent of MultiFieldsULE. I think (an improved) MultiFieldsULE probably can cover this with no data size diff1, and even if it does; it's unclear to me that that the relatively small wins there will be worth it. We should probably prioritize the VarULE format refactors to be done sooner rather than later so we can get numbers. Footnotes
|
Right, packed plural patterns has long been a priority, and going with Option 3 made datetime 2.0 depend on it (which may not have been clear in our discussion on Friday) I implemented a packed plural solution here which I think is near optimal ("optimal" defined as being at least as good as what is would be with MFULE improvements). There is some but not a lot of unsafe code, and I made an effort for the unsafe code to be reasonably to review. I'm trying to determine the paths forward:
|
How about this: can you give specific suggestions on how to iterate on the data struct design? I spent hours looking at this, and I kept reaching mental blocks (one reason I scheduled that meeting on Friday), and this PR is the what I've come up with. It satisfies several constraints:
The layout of PluralElementsPackedULE is: /// A bitpacked DST for [`PluralElements`].
///
/// Can be put in a [`Cow`] or a [`VarZeroVec`].
#[derive(Debug, PartialEq, Eq)]
#[repr(transparent)]
pub struct PluralElementsPackedULE<V: VarULE + ?Sized> {
_v: PhantomData<V>,
/// Invariant Representation:
///
/// First byte: `d...mmmm`
/// - `d` = 0 if singleton, 1 if a map
/// - `...` = padding, should be 0
/// - `mmmm` = [`FourBitMetadata`] for the default value
///
/// Remainder: either [`PluralElementsSingletonVarULE`] or [`PluralElementsMultiVarULE`]
/// based on the discriminant `d`
bytes: [u8],
}
/// The type of bytes[1..] if there is a singleton pattern.
type PluralElementsSingletonVarULE<V> = V;
/// The type of bytes[1..] if there are special patterns.
type PluralElementsMultiVarULE<V> = TinyVarVarULE<V, PluralElementsTupleSliceVarULE<V>>;
/// The type of the special patterns list.
type PluralElementsTupleSliceVarULE<V> = VarZeroSlice<VarTupleULE<PluralCategoryAndMetadata, V>>; I think I still satisfy my objectives if I were to switch |
Another design would be to merge TinyVarVarULE into the representation invariants of PluralElementsPackedULE
This is identical to the current byte representation, but without the objectionable |
Yes, but we never designed the data for them early.
To be clear: I'm fine with the underlying data model. I'm not fine with the introduction of TinyVarVar. I would support:
So I think the data struct design is fine, I also think us allowing for some inefficiencies is fine, so a less efficient data model would also be okay, but I don't think that's actually a strong tradeoff we're facing at this moment if we fix MFULE. |
I need |
@sffc that's fine, it should be generic over one parameter then. Not ideal, but we're hoping to get past that anyway. Ideally I'd suggest wrapping MultiFields and only having a little bit of unsafe, but doing a custom thing is fine as well. Switching MultiFields over to having Format, and supporting Index8, are both very straightforward changes overall which we can easily do today if you need those wins. It's not all of the wins we need here but it covers some of them. |
Since I'm hoping to get rid of the TinyVarVar implementation anyway and you've already done the work I'm also fine with the minimal change in this PR of moving that code into the plurals crate, making it private, and wrapping it to give you I wish to be very careful about additional unsafe code in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'd like to see in this PR:
struct PluralElementsV1<'data> {
// your custom encoding
}
// reverted, i.e. a wrapper around `PluralElementsV1` until we make it generic
struct PluralPatterns<'data, B: PatternBackend> {
pub strings: PluralElementsV1<'data>,
pub _phantom: PhantomData<B>,
}
What we can then do in a subsequent PR:
struct PluralElementsV1<'data, T: VarULE> {
// your custom encoding
}
type PluralPatterns<'data, PatternBackend> =
PluralElementsV1<'data, Pattern<B, B::Store>>;
struct PluralElementsV1<'data> {
// your custom encoding
} With a hard-coded |
For now with a hardcoded |
ok, coming right up. I want to name it |
type PluralPatterns<'data, PatternBackend> =
PluralElementsV1<'data, Pattern<B, B::Store>>; It sounds like you agree, but I don't plan to do this in this PR |
Done. There are now a total of only 6 lines changed in |
components/plurals/src/provider.rs
Outdated
@@ -368,115 +375,705 @@ pub enum PluralElementsKeysV1 { | |||
// TODO: Make the str generic | |||
pub struct PluralElementsFieldV1<'data>(pub PluralElementsKeysV1, pub Cow<'data, str>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is dead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
#[cfg_attr(feature = "datagen", databake(path = icu_plurals::provider))] | ||
#[yoke(prove_covariance_manually)] | ||
/// A data-struct-optimized representation of [`PluralElements`]. | ||
pub struct PluralElementsV1<'data> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move the new struct to the top above all the implementation details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still 200 lines below where it used to be, after things like struct PluralElementsUnpackedBytes<'a>
. This is the most important type in this file after the data struct definitions, it shouldn't be burried in a bunch of implementation details and private structs.
My mental model of file structure for structs like struct A(B, C)
, struct B(D)
is to declare the types and impls in the order A, B, D, C, as this makes the data structure and control flow in impls flow with the reading direction. What you're doing is pretty much the reverse, which it makes it very hard to find relevant types in a file.
where | ||
T: PartialEq, | ||
{ | ||
fn get_specials_tuples(&self) -> impl Iterator<Item = (PluralElementsKeysV1, &T)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this has a single call site, which would be more readable if this wasn't defined on another type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree that the call site would be more readable. It is the job of PluralElements
to provide an iterator over the non-empty values it contains. I would consider making this a public function of PluralElements
, but I didn't want to scope creep, so I left it in the provider module.
@@ -43,5 +43,5 @@ pub struct CurrencyExtendedDataV1<'data> { | |||
/// Regards to the [Unicode Report TR35](https://unicode.org/reports/tr35/tr35-numbers.html#Currencies), | |||
/// If no matching for specific count, the `other` count will be used. | |||
#[cfg_attr(feature = "serde", serde(borrow))] | |||
pub display_names: PluralElementsV1<'data>, | |||
pub display_names: PluralElementsPackedCowStr<'data>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of calling this PackedCowStr
. I think PluralElementsV1
was fine as it's our first iteration of PluralElements
. The user here really doesn't need to care about encoding details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't typically call things V1
unless they are actually data structs. I let it land with this name initially because it's unstable, but I consider it an improvement to call the type by what it actually is.
} | ||
|
||
/// Helper function to access a value from [`PluralElementsTupleSliceVarULE`] | ||
fn get_special<V: VarULE + ?Sized>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can be in get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this as an associated function on type PluralElementsTupleSliceVarULE<V>
, but I can't add associated functions to typedefs. I also stylistically prefer to keep functions at the module level because there is less nesting / indentation.
But, I have the sense that you feel more strongly about this style (fully encapsulating private functions and structs when possible) than I do, so I'll move it for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I ended up calling this function in one more place (the new serde impl) so I'm leaving it where it is.
components/plurals/src/provider.rs
Outdated
V: VarULE + PartialEq + ?Sized, | ||
{ | ||
#[cfg(feature = "datagen")] | ||
fn to_plural_elements(&self) -> PluralElements<(FourBitMetadata, &V)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also only used by a single call site, serde::Serialize
. I'd prefer if that defined its own local struct with a derive(Serialize)
, so the PluralElements
doesn't have to expose Serialize
(it's not great having a public type be Serialize
if it doesn't work for non-human readable formats).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically PluralElements
does serialize to non-human-readable formats, it's just not optimized for it. But, we should discourage that. I'm convinced; I'll make a private struct for the serialization.
components/plurals/src/provider.rs
Outdated
/// | ||
/// See <https://github.com/unicode-org/icu4x/pull/1556> | ||
#[cfg(feature = "serde")] | ||
pub fn deserialize_plural_elements_packed_cow<'de, 'data, D, V>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be private, if you're using a cow you might as well use the PluralElementsPackedCowStr
struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it public because if you want to deserialize something other than a str
, you'll need it. I consider it part of the value proposition of PatternElementsPackedULE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I managed to make PluralElementsPackedCow
be generic, so now I am able to make the fn be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observation: postcard size is slightly increasing, baked is decreasing
-currency/extended@1, en/ESA, 108B, 62B, a051b393cf885669
-currency/extended@1, en/ESB, 128B, 82B, 610579242f4b65f9
-currency/extended@1, en/ESP, 84B, 38B, 34b7b7dccf3b5fd2
+currency/extended@1, en/ESA, 86B, 63B, d05fac7f1be65e18
+currency/extended@1, en/ESB, 106B, 83B, 91a6ef580bbe95f3
+currency/extended@1, en/ESP, 62B, 39B, 2e4f2ff2266524df
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no postcard size difference for data structs that have only the other
plural. I guess where there are specials, the bytes get one byte bigger because there's a new discriminant bit (pulling something that was previously in postcard land into zerovec land).
Previously:
// No Specials
[postcard length] [other pattern] [postcard length 0]
// With Specials
[postcard length] [other pattern] [postcard length] [specials]
Now:
// No Specials
[postcard length] [discriminant] [other pattern]
// With Specials
[postcard length] [discriminant] [other length] [other pattern] [specials]
} | ||
|
||
// Need a manual impl because the derive(Clone) impl bounds are wrong | ||
impl<'data, V> Clone for PluralElementsPackedCow<'data, V> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've gone above and beyond on this PR. Ready to merge. Note: Some of the CI jobs are being killed with exit code 143, but re-running fixes them. |
#[cfg_attr(feature = "datagen", databake(path = icu_plurals::provider))] | ||
#[yoke(prove_covariance_manually)] | ||
/// A data-struct-optimized representation of [`PluralElements`]. | ||
pub struct PluralElementsV1<'data> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still 200 lines below where it used to be, after things like struct PluralElementsUnpackedBytes<'a>
. This is the most important type in this file after the data struct definitions, it shouldn't be burried in a bunch of implementation details and private structs.
My mental model of file structure for structs like struct A(B, C)
, struct B(D)
is to declare the types and impls in the order A, B, D, C, as this makes the data structure and control flow in impls flow with the reading direction. What you're doing is pretty much the reverse, which it makes it very hard to find relevant types in a file.
Depends on #5511
Depends (no longer) on #5520