Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

char8_t and std::u8string support #1914

Closed
LeonF23 opened this issue Jan 28, 2020 · 4 comments
Closed

char8_t and std::u8string support #1914

LeonF23 opened this issue Jan 28, 2020 · 4 comments
Labels
kind: enhancement/improvement state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated

Comments

@LeonF23
Copy link

LeonF23 commented Jan 28, 2020

Hello,
when porting our codebase to std=c++2a the compability with nlohmann::json will break since we use u8"" string literals for assigning strings to json objects.
I tried changing ObjectType's StringType to std::u8string, but that is not working because there are some types hardcoded to char in the json.hpp, for example serializer's output_adapter_t<char> (line 14587) and others.

So my questions are:
Is there any native support for char8_t planned in the future?
Are there any known workarounds to add char8_t support by hand right now?

Greetings

@nlohmann
Copy link
Owner

You are right, std::u8string is currently not supported. I currently see no blocker in supporting it, but I cannot promise any timeline for the feature. Any help (and PRs) welcome!

@LeonF23
Copy link
Author

LeonF23 commented Jan 29, 2020

So, just out of curiosity i replaced most of the char occurences with char8_t, and set StringType to std::u8string in the single json.hpp header. It builds successfully but produces garbage at runtime. Here is the commit for reference, but please do not try this at home.

Iam very sure that I introduced undefined Behaviour, which is no wonder with such a straigh-forward type replacement (maybe i also replaced way to much, also iam not very familiar with the codebase). I guess it fails somewhere in the dump(..) dump_escaped() decode() area because calls to output_adapter_t<char8_t>::write_characters(...) sometimes overrwrite the beginning of output_adapter's underlying string, even though the function calls an StringType::append or StringType::push_back.

So it is somehow possible to build with char8_t enabled and with a lot of tweaking and tinkering the internal functionality could be kept as it is without any UB.
But for char8_t support that actually can be released a lot more is needed.

StringType::CharT template parameters would be needed so that it is possible to switch between char and char8_t. Also some detection mechanism, if char8_t is availible, using compiler dependend defines could be usefull.

Some mechanism to allow backwards compability is needed. Therfore this paper exists. There are some recommandations on how to be compatible in both directions. for example a user defined literal "U8(...)" that can switch between char and char8_t would be possible. But iam not quite sure if that would break something for the end users. those decisions have to be taken very carefully.

I think a lot of the unit test would have to be rewritten. For example a
CHECK(j["url"] == "https://github.com/nlohmann/json"); as it is right now would throw because there is no comparision operator between char and char8_t.

And i guess there is way more to think about. I just wanted to write down my thoughs, maybe they help some one else to develop a strategy.

@stale
Copy link

stale bot commented Feb 28, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Feb 28, 2020
@stale stale bot closed this as completed Mar 6, 2020
@patrolez
Copy link

patrolez commented Aug 18, 2021

Hello,
I was reading Changelog.md which literally includes title of this issue and I believe it should not be closed.

- char8\_t and std::u8string support [\#1914](https://github.com/nlohmann/json/issues/1914)

I think this issue should be renamed to Lack of char8_t and std::u8string support.

The problem will still be there and will continue to arise as the release date of C++20 was some time ago and the use of its functionalities stabilizes.

As JSON files is being by specification encoded with UTF-8, and it is illegal to include BOM bytes at the beginning of files, so I would say that std::string and char type should be used in "not encoding aware"/"unknown encoding" contexts OR "encoded in ANY encoding" contexts, but I might be not enough aware about C++ committee definitions.

On the other hand think there is a C++ specification hell regarding distinguishing traits of a written human language in digital:

  • locale/regionalization awareness (related to the world politics, local culture/habits/conventions and locally understandable true meaning representation),
  • encoded sequences of bytes (8bits packs) with guarantied decodability (related to digital in memory storage) [I guess that is intention for char8_t strings meaning],
  • code-points, graphemes, glyphs awareness.

Where code-points creditability criteria is directly required to ensure decodability. I guess this criterion is blurring/fuzzing between two layers.

https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling

So I think std::string since C++20 should denote carrying bytes with unknown encoding, std::u8string with validated utf-8 encoding on code-points level and the same should be reflected by the code.

As I can read in FAQ:

- Invalid surrogates (e.g., incomplete pairs such as `\uDEAD`) will yield parse errors.

I think that after passing this step, every key/value JSON string should/could be represented by std::u8string and char8_t strings.

What is being tested by

SECTION("incorrect sequences")

But has some sections turned off:

I am not enough into Unicode and C++ strict level specs, so maybe we can ask @tahonermann :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: enhancement/improvement state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated
Projects
None yet
Development

No branches or pull requests

3 participants