Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace PluralElementsV1 with PluralElementsPackedULE #5510

Merged
merged 39 commits into from
Sep 11, 2024

Conversation

sffc
Copy link
Member

@sffc sffc commented Sep 6, 2024

Depends on #5511
Depends (no longer) on #5520

@sffc sffc changed the title Start sketching out PackedSkeletonDataV2 Add PatternElementsPackedULE and use it Sep 8, 2024
@sffc sffc marked this pull request as ready for review September 8, 2024 11:01
@sffc sffc requested review from zbraniecki, pdogr and a team as code owners September 8, 2024 11:01
@sffc
Copy link
Member Author

sffc commented Sep 8, 2024

This can't land until #5511 and #5520 land, and the commit history is a little messy because of it, but I thought it might be good to give some context about how I plan to use the new zerovec abstractions.

This reduces the baked data size for the finely sliced data structs. It also makes PluralElementsPackedULE (with all the proper bounds and impls) which I intend to use in datetime 2.0.

Comment on lines 821 to 822
// TODO: Inform the user more nicely that their data doesn't fit in our packed structure
panic!("other value too long to be packed: {self:?}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way I could get rid of the explicit panic would be to restrict the PatternElement's T to be a certain length (create a newtype such as ShortT).

@Manishearth
Copy link
Member

It also makes PluralElementsPackedULE (with all the proper bounds and impls) which I intend to use in datetime 2.0.

I've said this before, I'm really not in favor of tying the 2.0 release timeline to brand new ULE ideas. I don't think we have the time to design that properly. I don't like designing these things without getting an understanding of the variety of use cases, and we can't get that understanding with such short notice.

I'm open to doing the MultiVar fix because it's one we've wanted for ages and have a plan for that that I think will be sufficiently general and reduce unsafe surface. VarTuple is rather straightforward as well and I'm fine with that, but I'm not really in favor of this set of zerovec changes.

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

Hm? This PR is the subject specific ULE for packed plural patterns. There are no zerovec changes in this PR.

This PR depends on #5520 which contains zerovec changes which I think are strongly motivated and justified independent of MultiVarULE, but I can change #5520 to be plurals-specific if it can land faster that way.

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

This PR is largely the result of our discussion on Friday where we decided to proceed with Option 3 for packed patterns. In order to do Option 3, I needed a better solution for packed plurals.

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

Also, please take a look at the fingerprint file in this PR. The savings for baked data are rather substantial. Generalized MultiFieldULE would get us somewhat close, but this is I feel a mostly optimal solution. I spent a lot of time designing this PR's ULE. TinyVarVar is a component I split out because I like to generalize when possible.

@Manishearth
Copy link
Member

Hm? This PR is the subject specific ULE for packed plural patterns. There are no zerovec changes in this PR.

Yes, this PR requires #5520 which is a new ULE representation that hasn't been discussed.

This PR is largely the result of our discussion on Friday where we decided to proceed with Option 3 for packed patterns. In order to do Option 3, I needed a better solution for packed plurals.

My impression of our discussion on Friday was that we decided explicitly to not add more abstractions to zerovec and instead use MultiField (and/or do manual things for patterns), which we had decided to fast track for 2.0. This feels counter to that discussion?

(To be clear: I'm fine with #5511, it's #5520 that's a surprise to me)

I can change #5520 to be plurals-specific if it can land faster that way

Yes please. Generic unsafe code is a heavier burden on maintenance and while I like to generalize where possible too, I absolutely do not want to do that last minute. We have one use case for this generic type at the moment, that is not enough for us to know whether this abstraction is at the right level of specificity.

And as far as I can tell #5520 is just (the fixed version of) MultiFieldsULE with additional support for a u8 length format: that is something I've wanted to support as well and is easy to do. This is exactly what I'm getting at when talking about figuring out if something is sufficiently generic or specific.

Also, please take a look at the fingerprint file in this PR. The savings for baked data are rather substantial.

It seems like we have a path forward for now, but for future reference I'm really not happy with the dynamic where we design new ULE things last minute when we come up with a new way to save size. I've expressed this before: I'm fine doing it when we've had time to get a grasp of the variety of use cases, but that takes time and I don't think we have time to do this well here.

The way I see things are that the data design phase for a component (which should typically be many many months before stabilization) is when we may decide to invent stuff like this, and we can give those design processes the time they deserve in such a case. In the near-stabilization data optimization phase, we might tweak the representation to pack it further, but it is too late to invent new abstractions and at best we should do component-specific manual ULE impls (I'm not overly happy with those either but I think they're fine as we still figure things out).

So I basically don't find "the savings for this are substantial" to be a valid argument here. Unfortunately pattern data design got punted close to the end of 2.0 so there never was a separate initial data design phase, but I don't think that obviates us being careful here.

Generalized MultiFieldULE would get us somewhat close, but this is I feel a mostly optimal solution.

I feel like "somewhat close" is fine? I don't think we need to fully optimize this last minute like this. My preference is rather to keep unsafe abstractions low but be "somewhat close" on datasize than proliferate them.

@Manishearth
Copy link
Member

Manishearth commented Sep 9, 2024

think are strongly motivated and justified independent of MultiVarULE

FWIW, I'm really not seeing this so far, either the strong motivation or the justification independent of MultiFieldsULE. I think (an improved) MultiFieldsULE probably can cover this with no data size diff1, and even if it does; it's unclear to me that that the relatively small wins there will be worth it.

We should probably prioritize the VarULE format refactors to be done sooner rather than later so we can get numbers.

Footnotes

  1. It may be tricky to make the varint repr work via VarZeroVecFormat but that's not happening here anyway, it was a potential future optimization, and if the time comes for such an optimization we can design it then.

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

Right, packed plural patterns has long been a priority, and going with Option 3 made datetime 2.0 depend on it (which may not have been clear in our discussion on Friday)

I implemented a packed plural solution here which I think is near optimal ("optimal" defined as being at least as good as what is would be with MFULE improvements). There is some but not a lot of unsafe code, and I made an effort for the unsafe code to be reasonably to review.

I'm trying to determine the paths forward:

  1. Merge these PRs as-is, and perhaps migrate to MFULE later when the capabilities are there. Also potentially change StrStrPair to use TinyVarVar and remove MFULE from 2.0 data
  2. Move TinyVarVar into the icu_plurals crate and treat it as internal to PatternElementsPackedULE
  3. Rewrite this PR to use MFULE instead of TinyVarVar, and probably accelerate plans to improve MFULE

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

How about this: can you give specific suggestions on how to iterate on the data struct design?

I spent hours looking at this, and I kept reaching mental blocks (one reason I scheduled that meeting on Friday), and this PR is the what I've come up with. It satisfies several constraints:

  1. Plural patterns should be encoded in-place, rather than at the same level as lengths and variants (Option 3)
  2. Singleton plural patterns should not take up more space than they currently do in datetime: achieved by a single discriminant bit plus FourBitMetadata (which lets us share the existing plural datetime patterns metadata byte).
  3. At the same time, optimize the stack representation of PluralElementsPacked to be no larger than 4 usizes, which was a 2024 blocker but not a 2.0 blocker: this PR achieves it

The layout of PluralElementsPackedULE is:

/// A bitpacked DST for [`PluralElements`].
///
/// Can be put in a [`Cow`] or a [`VarZeroVec`].
#[derive(Debug, PartialEq, Eq)]
#[repr(transparent)]
pub struct PluralElementsPackedULE<V: VarULE + ?Sized> {
    _v: PhantomData<V>,
    /// Invariant Representation:
    ///
    /// First byte: `d...mmmm`
    /// - `d` = 0 if singleton, 1 if a map
    /// - `...` = padding, should be 0
    /// - `mmmm` = [`FourBitMetadata`] for the default value
    ///
    /// Remainder: either [`PluralElementsSingletonVarULE`] or [`PluralElementsMultiVarULE`]
    /// based on the discriminant `d`
    bytes: [u8],
}

/// The type of bytes[1..] if there is a singleton pattern.
type PluralElementsSingletonVarULE<V> = V;

/// The type of bytes[1..] if there are special patterns.
type PluralElementsMultiVarULE<V> = TinyVarVarULE<V, PluralElementsTupleSliceVarULE<V>>;

/// The type of the special patterns list.
type PluralElementsTupleSliceVarULE<V> = VarZeroSlice<VarTupleULE<PluralCategoryAndMetadata, V>>;

I think I still satisfy my objectives if I were to switch PluralElementsMultiVarULE<V> to use MultiFieldVarULE, because that would only impact the cases where there is more than one plural form, which is not the common case.

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

Another design would be to merge TinyVarVarULE into the representation invariants of PluralElementsPackedULE

First byte: d...mmmm
d = 0 if singleton, 1 if a map
... = padding
mmmm = FourBitMetadata for the default value

If d is 0:
  Remainder: a single instance of V

If d is 1:
  Second byte: wwwwwwww = length of the first (default) V
  Next: an instance of V of length wwwwwwww
  Remainder: VarZeroSlice of plural specials

This is identical to the current byte representation, but without the objectionable TinyVarVarULE being split out.

@Manishearth
Copy link
Member

Right, packed plural patterns has long been a priority, and going with Option 3 made datetime 2.0 depend on it (which may not have been clear in our discussion on Friday)

Yes, but we never designed the data for them early.

How about this: can you give specific suggestions on how to iterate on the data struct design?

To be clear: I'm fine with the underlying data model. I'm not fine with the introduction of TinyVarVar. I would support:

  • Doing a one-off manual VarULE type for the specific types in question (not generic), that lives in the component crate. I think this is what you suggest in your comment above?
  • "Rewrite this PR to use MFULE instead of TinyVarVar, and probably accelerate plans to improve MFULE"

So I think the data struct design is fine, I also think us allowing for some inefficiencies is fine, so a less efficient data model would also be okay, but I don't think that's actually a strong tradeoff we're facing at this moment if we fix MFULE.

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

I need PluralElementsPackedULE<V> to be generic over V because we already have cases for multiple types of V.

@Manishearth
Copy link
Member

@sffc that's fine, it should be generic over one parameter then. Not ideal, but we're hoping to get past that anyway.

Ideally I'd suggest wrapping MultiFields and only having a little bit of unsafe, but doing a custom thing is fine as well.

Switching MultiFields over to having Format, and supporting Index8, are both very straightforward changes overall which we can easily do today if you need those wins. It's not all of the wins we need here but it covers some of them.

@Manishearth
Copy link
Member

Since I'm hoping to get rid of the TinyVarVar implementation anyway and you've already done the work I'm also fine with the minimal change in this PR of moving that code into the plurals crate, making it private, and wrapping it to give you PluralElementsPackedULE<V>. If this ends up being a part of the release, that's also fine.

I wish to be very careful about additional unsafe code in zerovec that's going to make it for the 2.0 release, since I'm gearing up to get unsafe review for that crate and that's going to take quite a while.

@sffc sffc changed the title Add PatternElementsPackedULE and use it Replace PluralElementsV1 with PatternElementsPackedULE Sep 10, 2024
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'd like to see in this PR:

struct PluralElementsV1<'data> {
   // your custom encoding
}

// reverted, i.e. a wrapper around `PluralElementsV1` until we make it generic
struct PluralPatterns<'data, B: PatternBackend> {
   pub strings: PluralElementsV1<'data>,
   pub _phantom: PhantomData<B>,
}

What we can then do in a subsequent PR:

struct PluralElementsV1<'data, T: VarULE> {
   // your custom encoding
}

type PluralPatterns<'data, PatternBackend> =
        PluralElementsV1<'data, Pattern<B, B::Store>>;

@sffc
Copy link
Member Author

sffc commented Sep 10, 2024

struct PluralElementsV1<'data> {
   // your custom encoding
}

With a hard-coded str or with V?

@robertbastian
Copy link
Member

For now with a hardcoded str, making it generic and removing PluralPatterns can be a followup. It's actually quite involved (I just tried it at head) because Pattern is missing a lot of implementations (Deref, ToOwned, even VarULE).

@sffc
Copy link
Member Author

sffc commented Sep 10, 2024

ok, coming right up.

I want to name it PluralElementsPackedCowStr

@sffc
Copy link
Member Author

sffc commented Sep 10, 2024

type PluralPatterns<'data, PatternBackend> =
        PluralElementsV1<'data, Pattern<B, B::Store>>;

It sounds like you agree, but I don't plan to do this in this PR

@sffc
Copy link
Member Author

sffc commented Sep 10, 2024

Done. There are now a total of only 6 lines changed in icu_experimental and they are all either changing imports or changing PluralElements<T> so that it doesn't always borrow T. This in effect means that my PR is fully internal to icu_plurals from the perspective of icu_experimental. I do not seek to scope-creep this PR into icu_experimental.

@@ -368,115 +375,705 @@ pub enum PluralElementsKeysV1 {
// TODO: Make the str generic
pub struct PluralElementsFieldV1<'data>(pub PluralElementsKeysV1, pub Cow<'data, str>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#[cfg_attr(feature = "datagen", databake(path = icu_plurals::provider))]
#[yoke(prove_covariance_manually)]
/// A data-struct-optimized representation of [`PluralElements`].
pub struct PluralElementsV1<'data> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move the new struct to the top above all the implementation details

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still 200 lines below where it used to be, after things like struct PluralElementsUnpackedBytes<'a>. This is the most important type in this file after the data struct definitions, it shouldn't be burried in a bunch of implementation details and private structs.

My mental model of file structure for structs like struct A(B, C), struct B(D) is to declare the types and impls in the order A, B, D, C, as this makes the data structure and control flow in impls flow with the reading direction. What you're doing is pretty much the reverse, which it makes it very hard to find relevant types in a file.

where
T: PartialEq,
{
fn get_specials_tuples(&self) -> impl Iterator<Item = (PluralElementsKeysV1, &T)> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this has a single call site, which would be more readable if this wasn't defined on another type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that the call site would be more readable. It is the job of PluralElements to provide an iterator over the non-empty values it contains. I would consider making this a public function of PluralElements, but I didn't want to scope creep, so I left it in the provider module.

@@ -43,5 +43,5 @@ pub struct CurrencyExtendedDataV1<'data> {
/// Regards to the [Unicode Report TR35](https://unicode.org/reports/tr35/tr35-numbers.html#Currencies),
/// If no matching for specific count, the `other` count will be used.
#[cfg_attr(feature = "serde", serde(borrow))]
pub display_names: PluralElementsV1<'data>,
pub display_names: PluralElementsPackedCowStr<'data>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of calling this PackedCowStr. I think PluralElementsV1 was fine as it's our first iteration of PluralElements. The user here really doesn't need to care about encoding details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't typically call things V1 unless they are actually data structs. I let it land with this name initially because it's unstable, but I consider it an improvement to call the type by what it actually is.

}

/// Helper function to access a value from [`PluralElementsTupleSliceVarULE`]
fn get_special<V: VarULE + ?Sized>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be in get

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as an associated function on type PluralElementsTupleSliceVarULE<V>, but I can't add associated functions to typedefs. I also stylistically prefer to keep functions at the module level because there is less nesting / indentation.

But, I have the sense that you feel more strongly about this style (fully encapsulating private functions and structs when possible) than I do, so I'll move it for you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I ended up calling this function in one more place (the new serde impl) so I'm leaving it where it is.

V: VarULE + PartialEq + ?Sized,
{
#[cfg(feature = "datagen")]
fn to_plural_elements(&self) -> PluralElements<(FourBitMetadata, &V)> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also only used by a single call site, serde::Serialize. I'd prefer if that defined its own local struct with a derive(Serialize), so the PluralElements doesn't have to expose Serialize (it's not great having a public type be Serialize if it doesn't work for non-human readable formats).

Copy link
Member Author

@sffc sffc Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically PluralElements does serialize to non-human-readable formats, it's just not optimized for it. But, we should discourage that. I'm convinced; I'll make a private struct for the serialization.

///
/// See <https://github.com/unicode-org/icu4x/pull/1556>
#[cfg(feature = "serde")]
pub fn deserialize_plural_elements_packed_cow<'de, 'data, D, V>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be private, if you're using a cow you might as well use the PluralElementsPackedCowStr struct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it public because if you want to deserialize something other than a str, you'll need it. I consider it part of the value proposition of PatternElementsPackedULE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to make PluralElementsPackedCow be generic, so now I am able to make the fn be private.

@robertbastian robertbastian changed the title Replace PluralElementsV1 with PatternElementsPackedULE Replace PluralElementsV1 with PluralElementsPackedULE Sep 10, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: postcard size is slightly increasing, baked is decreasing

-currency/extended@1, en/ESA, 108B, 62B, a051b393cf885669
-currency/extended@1, en/ESB, 128B, 82B, 610579242f4b65f9
-currency/extended@1, en/ESP, 84B, 38B, 34b7b7dccf3b5fd2
+currency/extended@1, en/ESA, 86B, 63B, d05fac7f1be65e18
+currency/extended@1, en/ESB, 106B, 83B, 91a6ef580bbe95f3
+currency/extended@1, en/ESP, 62B, 39B, 2e4f2ff2266524df

Copy link
Member Author

@sffc sffc Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no postcard size difference for data structs that have only the other plural. I guess where there are specials, the bytes get one byte bigger because there's a new discriminant bit (pulling something that was previously in postcard land into zerovec land).

Previously:

// No Specials
[postcard length] [other pattern] [postcard length 0]

// With Specials
[postcard length] [other pattern] [postcard length] [specials]

Now:

// No Specials
[postcard length] [discriminant] [other pattern]

// With Specials
[postcard length] [discriminant] [other length] [other pattern] [specials]

@sffc sffc requested a review from robertbastian September 10, 2024 21:00
}

// Need a manual impl because the derive(Clone) impl bounds are wrong
impl<'data, V> Clone for PluralElementsPackedCow<'data, V>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sffc
Copy link
Member Author

sffc commented Sep 11, 2024

I think I've gone above and beyond on this PR. Ready to merge.

Note: Some of the CI jobs are being killed with exit code 143, but re-running fixes them.

#[cfg_attr(feature = "datagen", databake(path = icu_plurals::provider))]
#[yoke(prove_covariance_manually)]
/// A data-struct-optimized representation of [`PluralElements`].
pub struct PluralElementsV1<'data> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still 200 lines below where it used to be, after things like struct PluralElementsUnpackedBytes<'a>. This is the most important type in this file after the data struct definitions, it shouldn't be burried in a bunch of implementation details and private structs.

My mental model of file structure for structs like struct A(B, C), struct B(D) is to declare the types and impls in the order A, B, D, C, as this makes the data structure and control flow in impls flow with the reading direction. What you're doing is pretty much the reverse, which it makes it very hard to find relevant types in a file.

@robertbastian robertbastian merged commit c0e2dbc into unicode-org:main Sep 11, 2024
28 checks passed
@sffc sffc deleted the packedpattern branch September 11, 2024 15:25
@sffc sffc mentioned this pull request Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants