Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layout of char #101

Open
gnzlbg opened this issue Mar 15, 2019 · 2 comments
Open

Layout of char #101

gnzlbg opened this issue Mar 15, 2019 · 2 comments
Labels
A-layout Topic: Related to data structure layout (`#[repr]`) C-open-question Category: An open question that we should revisit S-not-opsem Despite being in this repo, this is not primarily a T-opsem question

Comments

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 15, 2019

Currently we only say that char has size 4. Is there anything we can say about its alignment beyond "implementation-defined" ?

Also @ubsan mentioned:

I'd argue it'd be useful to be ABI compatible with char32_t) (note: that type only exists in C++).

People have argued that they shouldn't be ABI compatible, since char32_t doesn't have the correctness guarantees Rust's char does; I would argue that it's the same idea as C-like enums in Rust vs enums in C++.

This question is still unresolved.


We probably also want to mention whether on all currently supported platforms the ABI of char is INTEGER or AGGREGATE.

@gnzlbg gnzlbg added the A-layout Topic: Related to data structure layout (`#[repr]`) label Mar 15, 2019
@RalfJung RalfJung added the C-open-question Category: An open question that we should revisit label Aug 14, 2019
@joshtriplett
Copy link
Member

joshtriplett commented Nov 21, 2019

An interesting thought that came up today: there's an alternative representation that might make sense. Since any valid Unicode character can be represented as at most 4 bytes of UTF-8, we could represent char as a [u8; 4] containing UTF-8.

That would have the interesting property of allowing a function on char to return a &str without allocating memory (rather than the current char::encode_utf8). And &str or String could return a char much more quickly, without having to translate UTF-8 first.

This would have tradeoffs, as well. In exchange for fast character -> UTF-8 and UTF-8 string -> character operations, we'd have to change the implementations of functions that do Unicode character classification and similar, to either convert to a character number first, or provide an optimized mapping from UTF-8 directly to the needed classification (which may be possible in some cases).

I don't know if we'd want to consider this tradeoff, but it seems worth considering. If code can't currently assume the representation is UCS-4 (e.g. transmuting to/from u32), then we can consider such an alternative representation.

@Diggsey
Copy link

Diggsey commented Nov 21, 2019

@joshtriplett that sounds like a useful type, but IMO the extra complexity combined with the extra cost of converting to the corresponding scalar make it not worthwhile.

However, it might be a cool idea as the basis for a "fixed size string" type similar to fixed-size arrays. For example with const generics: FixedStr<1> could be equivalent to the type you are describing and have From/To implementations for char, whilst also being able to be borrowed as a &str.

@JakobDegen JakobDegen added the S-not-opsem Despite being in this repo, this is not primarily a T-opsem question label Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-layout Topic: Related to data structure layout (`#[repr]`) C-open-question Category: An open question that we should revisit S-not-opsem Despite being in this repo, this is not primarily a T-opsem question
Projects
None yet
Development

No branches or pull requests

5 participants