Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Unicode 16.0.0 #5

Merged
merged 7 commits into from
Sep 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Upa IDNA

Upa IDNA is the [Unicode IDNA Compatibility Processing (UTS #46)](https://www.unicode.org/reports/tr46/) C++ library. It is compliant with the latest 15.1.0 version of the Unicode standard.
Upa IDNA is the [Unicode IDNA Compatibility Processing (UTS #46)](https://www.unicode.org/reports/tr46/) C++ library. It is compliant with the 16.0.0 version of the Unicode standard.

This library implements two functions from [UTS #46](https://www.unicode.org/reports/tr46/): [`to_ascii`](https://www.unicode.org/reports/tr46/#ToASCII) and [`to_unicode`](https://www.unicode.org/reports/tr46/#ToUnicode), and two functions from the [WHATWG URL Standard](https://url.spec.whatwg.org/): [`domain_to_ascii`](https://url.spec.whatwg.org/#concept-domain-to-ascii) and [`domain_to_unicode`](https://url.spec.whatwg.org/#concept-domain-to-unicode). It has no dependencies and requires C++11 or later.

Expand Down
11 changes: 4 additions & 7 deletions include/upa/idna/idna.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
#include "idna_table.h"
#include "iterate_utf.h"
#include "nfc.h"
#include <algorithm>
#include <string>
#include <type_traits> // std::make_unsigned

Expand Down Expand Up @@ -119,13 +118,11 @@ inline bool map(std::u32string& mapped, const CharT* input, const CharT* input_e
}
break;
default:
// CP_DISALLOWED
// CP_NO_STD3_MAPPED, CP_NO_STD3_VALID if Option::UseSTD3ASCIIRules
// Starting with Unicode 15.1.0 - don't record an error
// CP_DISALLOWED or
// CP_NO_STD3_VALID if Option::UseSTD3ASCIIRules
// Starting with Unicode 15.1.0 don't record an error
if (is_to_ascii && // to_ascii optimization
((value & util::CP_DISALLOWED_STD3) == 0
? !std::binary_search(std::begin(util::comp_disallowed), std::end(util::comp_disallowed), cp)
: !std::binary_search(std::begin(util::comp_disallowed_std3), std::end(util::comp_disallowed_std3), cp)))
((value & util::CP_DISALLOWED_STD3) == 0 || cp > 0x3E || cp < 0x3C))
return false;
mapped.push_back(cp);
break;
Expand Down
4 changes: 1 addition & 3 deletions include/upa/idna/idna_table.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ const std::uint32_t CP_MAPPED = 0x0002 << 16;
const std::uint32_t CP_DEVIATION = CP_VALID | CP_MAPPED; // 0x0003 << 16
const std::uint32_t CP_DISALLOWED_STD3 = 0x0004 << 16;
const std::uint32_t CP_NO_STD3_VALID = CP_VALID | CP_DISALLOWED_STD3;
const std::uint32_t CP_NO_STD3_MAPPED = CP_MAPPED | CP_DISALLOWED_STD3;
const std::uint32_t MAP_TO_ONE = 0x0008 << 16;
// General_Category=Mark
const std::uint32_t CAT_MARK = 0x0010 << 16;
Expand Down Expand Up @@ -57,8 +56,7 @@ extern const std::uint32_t blockData[];
extern const std::uint16_t blockIndex[];
extern const char32_t allCharsTo[];

extern const std::uint32_t comp_disallowed[5];
extern const std::uint32_t comp_disallowed_std3[21];
extern const std::uint8_t comp_disallowed_std3[3];

extern const std::uint8_t asciiData[128];
// END-GENERATED
Expand Down
14 changes: 6 additions & 8 deletions src/idna.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,11 @@ bool processing_mapped(std::u32string* pdecoded, const std::u32string& mapped, O
// P4 - Convert/Validate
if (label_end - label >= 4 && label[0] == 'x' && label[1] == 'n' && label[2] == '-' && label[3] == '-') {
if (*(label_end - 1) == '-' && label_end - label != 5) {
// For compatibility with ICU, report errors on "xn--" or "xn--ascii-" labels.
// Ignore "xn---", it will fail punycode::decode.
// More info: https://github.com/whatwg/url/issues/760#issuecomment-1462706617
// > 4. Processing - 4. - 3. If (after Punycode decode) the label is empty, or if the label
// > contains only ASCII code points, record that there was an error.
// 1) "xn--" is decoded to empty label
// 2) "xn--ascii-" is decoded to "ascii"
// Note: "xn---" is ignored here, because it will fail punycode::decode
error = true;
// Decode "xn--ascii-" to "ascii" for to_unicode:
if (pdecoded && label_end - label > 5) {
Expand Down Expand Up @@ -303,7 +305,6 @@ bool to_ascii_mapped(std::string& domain, const std::u32string& mapped, Option o

// A2 - Break the result into labels at U+002E FULL STOP
if (mapped.length() == 0) {
// to simplify root label detection
if (detail::has(options, Option::VerifyDnsLength))
ok = false;
} else {
Expand All @@ -312,9 +313,6 @@ bool to_ascii_mapped(std::string& domain, const std::u32string& mapped, Option o
std::size_t domain_len = domain.length() + static_cast<std::size_t>(-1);
bool first_label = true;
split(first, last, 0x002E, [&](const char32_t* label, const char32_t* label_end) {
// root is ending empty label
const bool is_root = (label == last);

// join
if (first_label) {
first_label = false;
Expand Down Expand Up @@ -342,7 +340,7 @@ bool to_ascii_mapped(std::string& domain, const std::u32string& mapped, Option o
}

// A4 - DNS length restrictions
if (detail::has(options, Option::VerifyDnsLength) && !is_root) {
if (detail::has(options, Option::VerifyDnsLength)) {
const std::size_t label_length = domain.length() - label_start_ind;
// A4_2
if (label_length < 1 || label_length > 63)
Expand Down
Loading