-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TinyVarVar and TinyVarVarULE #5520
Conversation
utils/zerovec/src/ule/vartuple.rs
Outdated
// Invariants: | ||
// - the slice begins with a byte indicating the cutoff between V0 and V1 | ||
// - the high bit of the first byte is 0 (reserved for a future varint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I discovered after running this through full datagen that there is some Adlam text for currency long names or something that is long enough that it violates this invariant.
Should I:
- Make everyone use
u16
- Implement a varint approach (I have one stashed)
- Start using the high bit, which should cover the known cases in Adlam, without the ability to extend it later to varint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that if I want to use a varint in the future, I can store the varint discriminant externally since there are some unused bits in #5510. So for now I'll go ahead and just use the high bit in the TinyVarVar.
At first glance I'm not really in favor of yet another specialized ULE representation: we discussed this, we should be optimizing MultiVar for this, and it seems like the small-length optimization does not seem to be getting much wins anyway since you're forced to use u16 (which MultiVar uses by default!) |
This PR uses u8, not u16. The initial version used 7 bits of the u8, but there was a small (<1%) set of strings that didn't fit, so now it uses all 8 bits and there's plenty of room |
Ah, I see. Still, is there any difference between this type and:
That last one isn't planned for 2.0 but it's a relatively small fix, comparatively. I think that's a far better design for this. |
🎉 All dependencies have been resolved ! |
If MultiFieldsULE also gains a VarZeroVecFormat argument, then I think the representations would converge, yes. If we merge this PR and also migrate StrStrPair over to it, then we can decouple 2.0 from MFULE work. We can migrate later without changing the postcard. That seems like a win? |
2.0 is already blocked on MultiFieldsULE work, I think. It's not just StrStrPair, there's also
Anyway, I plan to do the VZV changes for 2.0. One way to decouple 2.0 from this is in a more general way is to design an internal We could do I'm fine doing this since it introduces a new abstraction, but one that is primarily for internal use and has a path to being replaced with a general API. |
Thanks for flagging So, I was hoping that we would be working toward making MultiFieldsULE a public-facing class. TinyVarVar is intended as public-facing. One thing I think we did agree on recently was that we would not make as many ULE types, instead referencing existing ones when possible. Citation: #5127 (comment). These abstractions make it possible to apply the same principle to VarULE. One thing we haven't discussed as much is that I think it would be good for us to get to the point where we have I'm not opposed to changing |
We can't, it would need variadic generics to do so. However, we could do I want time to figure this out, so I'd rather close this PR for now.
No, that was specifically around wrappers. I do think we should consider moving towards less use of the derives where it can be done without tradeoffs. I don't think this means we immediately make new abstractions. I'd like us to consider the design much more carefully. I suspect by the end of this effort we will have a bunch of well designed abstractions that obviate a lot of custom design work. VarTuple is one such abstraction that can take on a huge load, it's one I've been considering for a while, which is why when Rob asked for it I was immediately in favor. I'm very unconvinced for the specific design of TinyVarVar. I have found that whenever we do this the proposed abstraction often doesn't have the right level of genericness (as I've pointed out here! MultiFields already can have this level of genericness) and that immediately makes me want to have much more time to consider designs even after figuring out the genericness. |
I'll go out on a limb and say that I think we might not ever need more than the following composable ULEs:
Sitting on top of these, we can have safe abstractions such as ZeroMap and VarZeroSet which are transparent over one of the sanctioned impls. And then we can get rid of I understand your desire for more time, but we already have what can be described as a patchwork of types in the zerovec crate. It is a pre-1.0 crate so that we can continue using it as a sandbox and discover what works for different use cases. |
Oh, this has been my position for ages, this is exactly why I am so strongly against one-off designs. I'm not 100% sure I want to get rid of This is exactly why I was fine with VarTupleULE without much discussion, I've long felt that that was a missing piece in this holistic design. I've been meaning to file an issue about this for a while, but I just haven't had time to sit down and think about VarULE.
We're coming very close to needing unsafe audit, and I think the current situation in the crate is rather bad for that. I very much do not want to make it worse, which is why I really want to approach this problem holistically rather than with one-offs, and I want to do that soon. I think I have a decent holistic design in mind2, which is close to what you describe, I have been meaning to document it, and will do so today. I don't think we need to implement the entire design for 2.0, but I think we should implement the unsafe cleanups involved and the parts of the design we need now. This matters now, this isn't something that only matters for hypothetical polished zerovec 1.0. In this case it appears that "TinyVarVar" ends up being more or less one of the types we may want in the holistic design3. I'm fine with either landing a temporary pattern crate one-off4 as mentioned in the other issue, or implementing it in zerovec with MultIFieldsULE. However, in general I am not a happy with needing to suddenly review a potentially one-off abstraction that gets us great size wins but is tied to a proximal release timeline. It's not necessarily the case that we'll end up realizing that this is bit-compatible with existing abstractions (like my realization about this basically being an optimized-as-planned, typed, MultiFieldsULE with Index8), and I don't want to accidentally commit us to the bit representation in a way that is incompatible with a potential holistic design. My general attitude is that I'm happy sacrificing minor size wins to be able to have better unsafe abstractions that cover more cases. This situation appears to end up not being one, I think, but I'm not sure, and I'd rather not keep being put in the position of having to figure that out in short notice. For next steps:
Footnotes
|
Depends on #5511
This is not the same as the MultiFieldsULE refactor: the length byte is currently 1 byte only, with the potential to expand to a varint in the future. This encoding should work for most use cases currently in ICU4X, including most things involving plurals, which are typically an enum between leaf strings or patterns.