Layout of char #101

gnzlbg · 2019-03-15T10:53:27Z

Currently we only say that char has size 4. Is there anything we can say about its alignment beyond "implementation-defined" ?

Also @ubsan mentioned:

I'd argue it'd be useful to be ABI compatible with char32_t) (note: that type only exists in C++).

People have argued that they shouldn't be ABI compatible, since char32_t doesn't have the correctness guarantees Rust's char does; I would argue that it's the same idea as C-like enums in Rust vs enums in C++.

This question is still unresolved.

We probably also want to mention whether on all currently supported platforms the ABI of char is INTEGER or AGGREGATE.

The text was updated successfully, but these errors were encountered:

joshtriplett · 2019-11-21T21:28:34Z

An interesting thought that came up today: there's an alternative representation that might make sense. Since any valid Unicode character can be represented as at most 4 bytes of UTF-8, we could represent char as a [u8; 4] containing UTF-8.

That would have the interesting property of allowing a function on char to return a &str without allocating memory (rather than the current char::encode_utf8). And &str or String could return a char much more quickly, without having to translate UTF-8 first.

This would have tradeoffs, as well. In exchange for fast character -> UTF-8 and UTF-8 string -> character operations, we'd have to change the implementations of functions that do Unicode character classification and similar, to either convert to a character number first, or provide an optimized mapping from UTF-8 directly to the needed classification (which may be possible in some cases).

I don't know if we'd want to consider this tradeoff, but it seems worth considering. If code can't currently assume the representation is UCS-4 (e.g. transmuting to/from u32), then we can consider such an alternative representation.

Diggsey · 2019-11-21T22:14:45Z

@joshtriplett that sounds like a useful type, but IMO the extra complexity combined with the extra cost of converting to the corresponding scalar make it not worthwhile.

However, it might be a cool idea as the basis for a "fixed size string" type similar to fixed-size arrays. For example with const generics: FixedStr<1> could be equivalent to the type you are describing and have From/To implementations for char, whilst also being able to be borrowed as a &str.

gnzlbg added the A-layout Topic: Related to data structure layout (`#[repr]`) label Mar 15, 2019

RalfJung added the C-open-question Category: An open question that we should revisit label Aug 14, 2019

JakobDegen added the S-not-opsem Despite being in this repo, this is not primarily a T-opsem question label Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layout of char #101

Layout of char #101

gnzlbg commented Mar 15, 2019 •

edited

Loading

joshtriplett commented Nov 21, 2019 •

edited

Loading

Diggsey commented Nov 21, 2019

Layout of char #101

Layout of char #101

Comments

gnzlbg commented Mar 15, 2019 • edited Loading

joshtriplett commented Nov 21, 2019 • edited Loading

Diggsey commented Nov 21, 2019

gnzlbg commented Mar 15, 2019 •

edited

Loading

joshtriplett commented Nov 21, 2019 •

edited

Loading