Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TinyVarVar and TinyVarVarULE #5520

Closed
wants to merge 12 commits into from
Closed

Conversation

sffc
Copy link
Member

@sffc sffc commented Sep 8, 2024

Depends on #5511

This is not the same as the MultiFieldsULE refactor: the length byte is currently 1 byte only, with the potential to expand to a varint in the future. This encoding should work for most use cases currently in ICU4X, including most things involving plurals, which are typically an enum between leaf strings or patterns.

Comment on lines 294 to 296
// Invariants:
// - the slice begins with a byte indicating the cutoff between V0 and V1
// - the high bit of the first byte is 0 (reserved for a future varint)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I discovered after running this through full datagen that there is some Adlam text for currency long names or something that is long enough that it violates this invariant.

Should I:

  1. Make everyone use u16
  2. Implement a varint approach (I have one stashed)
  3. Start using the high bit, which should cover the known cases in Adlam, without the ability to extend it later to varint

Copy link
Member Author

@sffc sffc Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that if I want to use a varint in the future, I can store the varint discriminant externally since there are some unused bits in #5510. So for now I'll go ahead and just use the high bit in the TinyVarVar.

@Manishearth
Copy link
Member

At first glance I'm not really in favor of yet another specialized ULE representation: we discussed this, we should be optimizing MultiVar for this, and it seems like the small-length optimization does not seem to be getting much wins anyway since you're forced to use u16 (which MultiVar uses by default!)

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

At first glance I'm not really in favor of yet another specialized ULE representation: we discussed this, we should be optimizing MultiVar for this, and it seems like the small-length optimization does not seem to be getting much wins anyway since you're forced to use u16 (which MultiVar uses by default!)

This PR uses u8, not u16.

The initial version used 7 bits of the u8, but there was a small (<1%) set of strings that didn't fit, so now it uses all 8 bits and there's plenty of room

@Manishearth
Copy link
Member

Ah, I see.

Still, is there any difference between this type and:

That last one isn't planned for 2.0 but it's a relatively small fix, comparatively.

I think that's a far better design for this.

Copy link

dpulls bot commented Sep 9, 2024

🎉 All dependencies have been resolved !

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

If MultiFieldsULE also gains a VarZeroVecFormat argument, then I think the representations would converge, yes.

If we merge this PR and also migrate StrStrPair over to it, then we can decouple 2.0 from MFULE work. We can migrate later without changing the postcard. That seems like a win?

@Manishearth
Copy link
Member

Manishearth commented Sep 9, 2024

2.0 is already blocked on MultiFieldsULE work, I think. It's not just StrStrPair, there's also

pub struct CodePointInversionListAndStringList<'data> {

Anyway, I plan to do the VZV changes for 2.0.

One way to decouple 2.0 from this is in a more general way is to design an internal TwoFieldsULE<Format> which is a generalized version of TinyVarVar with a MultiFieldsULE-like API, and then use it for n = 2 in make_varule. We never use it directly, much like how we never use MultiFieldsULE directly, but rather all callers just use it via the macro.

We could do TwoFieldsULE<T, U, Format> as well, though then the MultiFieldsULE-calling code would not work the same for TwoFieldsULE, and it's internal anyway.

I'm fine doing this since it introduces a new abstraction, but one that is primarily for internal use and has a path to being replaced with a general API.

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

Thanks for flagging CodePointInversionListAndStringList.

So, I was hoping that we would be working toward making MultiFieldsULE a public-facing class. TinyVarVar is intended as public-facing. One thing I think we did agree on recently was that we would not make as many ULE types, instead referencing existing ones when possible. Citation: #5127 (comment). These abstractions make it possible to apply the same principle to VarULE.

One thing we haven't discussed as much is that I think it would be good for us to get to the point where we have derive(ULE) and derive(VarULE) that work only on repr(transparent) types (which we could alternatively get via macro_rules), and then developers build out their ULE representation explicitly using things like VarTupleULE and MultiFieldsULE directly.

I'm not opposed to changing TinyVarVar<U, V> to TinyVarVar<U, V, Format>

@Manishearth
Copy link
Member

So, I was hoping that we would be working toward making MultiFieldsULE a public-facing class.

We can't, it would need variadic generics to do so.

However, we could do TwoFieldsULE<T, U, Format> and have that be public facing, and have it internally wrap a MultiFieldsULE<2>. This would have a much better encapsulated amount of unsafe.

I want time to figure this out, so I'd rather close this PR for now.

One thing I think we did agree on recently was that we would not make as many ULE types, instead referencing existing ones when possible.

No, that was specifically around wrappers.

I do think we should consider moving towards less use of the derives where it can be done without tradeoffs. I don't think this means we immediately make new abstractions. I'd like us to consider the design much more carefully. I suspect by the end of this effort we will have a bunch of well designed abstractions that obviate a lot of custom design work. VarTuple is one such abstraction that can take on a huge load, it's one I've been considering for a while, which is why when Rob asked for it I was immediately in favor. I'm very unconvinced for the specific design of TinyVarVar.

I have found that whenever we do this the proposed abstraction often doesn't have the right level of genericness (as I've pointed out here! MultiFields already can have this level of genericness) and that immediately makes me want to have much more time to consider designs even after figuring out the genericness.

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

I'll go out on a limb and say that I think we might not ever need more than the following composable ULEs:

  • ZeroVec (list of ULE)
  • VarZeroVec (list of VarULE)
  • Tuple2ULE..Tuple6ULE (named fields for ULE)
  • VarVar2ULE..VarVar6ULE (named fields for VarULE)
  • VarTupleULE (composing a ULE and a VarULE)

Sitting on top of these, we can have safe abstractions such as ZeroMap and VarZeroSet which are transparent over one of the sanctioned impls.

And then we can get rid of make_ule and make_varule.

I understand your desire for more time, but we already have what can be described as a patchwork of types in the zerovec crate. It is a pre-1.0 crate so that we can continue using it as a sandbox and discover what works for different use cases.

@Manishearth
Copy link
Member

I'll go out on a limb and say that I think we might not ever need more than the following composable ULEs:

Oh, this has been my position for ages, this is exactly why I am so strongly against one-off designs. I'm not 100% sure I want to get rid of make_ule and make_varule1, but yes, I think this is more or less the set of abstractions we should be building towards, though potentially tweaking the number of max fields on the tuples based on how best we can encapsulate that unsafe code (if it's going to only be a little bit of unsafe code wrapping MultiFields that we can macro, great, let's do six, if its going to be a load of custom unsafe code per TupleNULE, we should limit the number.

This is exactly why I was fine with VarTupleULE without much discussion, I've long felt that that was a missing piece in this holistic design.

I've been meaning to file an issue about this for a while, but I just haven't had time to sit down and think about VarULE.

I understand your desire for more time, but we already have what can be described as a patchwork of types in the zerovec crate. It is a pre-1.0 crate so that we can continue using it as a sandbox and discover what works for different use cases.

We're coming very close to needing unsafe audit, and I think the current situation in the crate is rather bad for that. I very much do not want to make it worse, which is why I really want to approach this problem holistically rather than with one-offs, and I want to do that soon. I think I have a decent holistic design in mind2, which is close to what you describe, I have been meaning to document it, and will do so today. I don't think we need to implement the entire design for 2.0, but I think we should implement the unsafe cleanups involved and the parts of the design we need now.

This matters now, this isn't something that only matters for hypothetical polished zerovec 1.0.

In this case it appears that "TinyVarVar" ends up being more or less one of the types we may want in the holistic design3. I'm fine with either landing a temporary pattern crate one-off4 as mentioned in the other issue, or implementing it in zerovec with MultIFieldsULE.

However, in general I am not a happy with needing to suddenly review a potentially one-off abstraction that gets us great size wins but is tied to a proximal release timeline. It's not necessarily the case that we'll end up realizing that this is bit-compatible with existing abstractions (like my realization about this basically being an optimized-as-planned, typed, MultiFieldsULE with Index8), and I don't want to accidentally commit us to the bit representation in a way that is incompatible with a potential holistic design. My general attitude is that I'm happy sacrificing minor size wins to be able to have better unsafe abstractions that cover more cases. This situation appears to end up not being one, I think, but I'm not sure, and I'd rather not keep being put in the position of having to figure that out in short notice.


For next steps:

  • Let's either land this as a targeted one off in the patterns code, or as a zerovec abstraction that internally uses MultiFieldsULE
  • I'm going to go file that issue
  • I'm going to prioritize fixing MultiFieldsULE which unblocks us for 2.0. I'll do it after I finish my current two open PRs.

Footnotes

  1. If we can get rid of all other custom derive deps for a particular crate, maybe, otherwise I kinda prefer the custom derives

  2. Not just from an API point of view, but an implementation point of view. I've filed a patchwork of issues around aspecs of this.

  3. though, not specific to "tiny". I'm also not a fan of the "VarVar" name, but that's different and bikesheddable.

  4. Why a one-off in the component crate? Because it keeps the amount of unsafe in this crate, which is already a lot, down, and we're not going to need unsafe audits for every component crate any time soon. Plus a little bit of private, targeted unsafe in a component crate is typically easy to get reviewed compared to something more public and generic. I still wish to keep unsafe in components to a bare minimum, but in this specific case I prefer things being in the component until we have a holistic design.

@sffc
Copy link
Member Author

sffc commented Sep 9, 2024

Ok I rewrote #5510 to be independent of this.

Time was not wasted because I applied the same sort of coding style from this PR into #5510.

@sffc sffc closed this Sep 9, 2024
@sffc sffc deleted the tinyvarvar branch September 9, 2024 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants