GitHub - railgunlabs/charisma: Secure Unicode® character decoders and encoders.

Charisma is a Unicode® character decoder and encoder library that conforms to the MISRA C:2012 coding standard. It provides functions for decoding and encoding characters safely in UTF-8, UTF-16, and UTF-32 (big or little endian). It can recover from malformed characters, allowing decoding to continue.

Why?

There are many Unicode character decoders floating about, but most are unsafe and do not support recovering from malformed character sequences. Attempting to decode or incorrectly recover from malformed text with these decoders can lead to security vulnerabilities. It's critical for software that processes external text to use a robust character decoder that can detect malformed character sequences.

Features

Safely decode and encode Unicode characters
Safely recover from malformed character sequences
Supports UTF-8, UTF-16-BE, UTF-16-LE, UTF-32-BE, and UTF-32-LE
Supports both null terminated and non-null terminated strings
Reentrant implementation
Lightweight (< 200 semicolons)
Extensively tested (see below)
No dependencies

MISRA C:2012 Compliance

Charisma honors all Required, Mandatory, and Advisory rules defined by MIRSA C:2012 and its four amendments. The complete compliance table is documented here.

Ultra Portable

Charisma is ultra portable. It's written in C99 and only requires a few features from libc which are listed in the following table.

Header	Types	Macros
stdint.h	`uint8_t`, `uint16_t`, `int32_t`, `uint32_t`
stdbool.h		`bool`, `true`, `false`
assert.h		`assert`

How Charisma is Tested

100% branch coverage
Unit tests
Fuzz tests
Static analysis
Valgrind analysis
Code sanitizers (UBSAN, ASAN, and MSAN)
Extensive use of assert() and run-time checks

Example

This code snippet demonstrates how to decode UTF-8 text.

const char8_t *string = "The quick 갈색 🦊 กระโดด över the 怠け者 🐶.";
int32_t index = 0;
for (;;)
{
    uchar cp = 0x0;
    int32_t r = utf8_decode(string, -1, &index, &cp);
    if (r == 0)
    {
        break; // end of string
    }
    else if (r < 0)
    {
        // malformed character sequence
    }

    // Malformed character sequences will be
    // recovered from and returned as U+FFFD.
    printf("U+%04X\n", cp);
}

Building

Download the latest release and build with

$ ./configure
$ make
$ make install

or build with CMake.

Related Work

Charisma is focused on decoding and encoding Unicode characters. If you need Unicode algorithms, like normalization or collation, then use Unicorn.

License

Charisma is dual-licensed under the GNU Lesser General Public License version 3 (LGPL v3) and a proprietary license, which can be purchased from Railgun Labs.

The unit tests are not open source. Access to them is granted exclusively to commercial licensees.

Unicode® is a registered trademark of Unicode, Inc. in the United States and other countries. This project is not in any way associated with or endorsed or sponsored by Unicode, Inc. (aka The Unicode Consortium).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
man		man
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CharismaConfig.cmake.in		CharismaConfig.cmake.in
LICENSE		LICENSE
Makefile.am		Makefile.am
README.md		README.md
autogen.sh		autogen.sh
charisma.c		charisma.c
charisma.h		charisma.h
charisma.pc.in		charisma.pc.in
config.cmake.in		config.cmake.in
configure.ac		configure.ac
misra-compliance.html		misra-compliance.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why?

Features

MISRA C:2012 Compliance

Ultra Portable

How Charisma is Tested

Example

Building

Related Work

License

About

Releases 3

Languages

License

railgunlabs/charisma

Folders and files

Latest commit

History

Repository files navigation

Why?

Features

MISRA C:2012 Compliance

Ultra Portable

How Charisma is Tested

Example

Building

Related Work

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Languages