A library to assist in security-testing Unicode enabled applications. The original intent of putting this together was threefold:
- To provide a reduced set of useful Unicode input to a software fuzzer
- To document historically problematic Unicode characters sequences which might negatively affect protocols and Web applications.
- To lookup mappings for ASCII equivalent characters
For example, the best-fit and normalization mappings can be useful for testing Web applications for cross-site scripting (XSS) or SQL injection (SQLi) vulnerabilities, by providing you with alternative characters which map back, or transform, to the intended ASCII encoded input - such as "<", "'", etc.
Additionally, many problem characters have been pre-defined as a small set, reducing the number of iterations a fuzzer might need to perform.
Major features:
- best fit mappings
- Unicode normalization mappings
- hard-coded Unicode characters useful in fuzzing
For fuzzing applications it includes:
- ill-formed byte sequences
- non-characters
- private use area (PUA)
- unassigned code points
- code points with special meaning such as the BOM and RLO
- half-surrogate values
This Windows form application loads the UniHax library mainly to test the best-fit and normalization mappings.
If you simply input a single ASCII character, all of its equivalent characters will be displayed.
e.g. If you're testing a Web-application and want to test equivalents for the "<" character U+003C, enter that as input and select either "best-fit mapping", which is linked to a charset encoding, or "normalization" equivalents. For this character, the following are best-fits:
- U+003B in the APL-ISO-IR-68 encoding
- U+0014 in the CP424 encoding
- etc...
Also, the following are normalization decomposition mappings:
- U+FE64 SMALL LESS-THAN SIGN
- U+FF1C FULLWIDTH LESS-THAN SIGN
This library contains a small set of problematic Unicode characters in Fuzzer.cs such as the following:
/// <summary>
/// An unassigned code point U+0FED
/// </summary>
public static readonly string uUnassigned = "\u0FED";
/// <summary>
/// An illegal low half-surrogate U+DEAD
/// </summary>
public static readonly string uDEAD = "\uDEAD";
Also the following method to return those characters as a byte array in any encoding.
public byte[] GetCharacterBytes(string encoding, string character)
There's also the following method to return any Unicode character as a malformed byte sequence, simply by trimming the last byte.
public byte[] GetCharacterBytesMalformed(string encoding, string character)
This project also contains the data files, pre-created in the /data folder, and a Mapping.cs Mapping class which can lookup mapping equivalents for the following:
- ASCII equivalent best-fit mappings across legacy character encodings
- ASCII equivalent mappings for Unicode normalization types. For example, Web browsers commonly use a form of normalization for keeping URL content and host names compatible.
For more on Unicode Normalization see TR15: http://www.unicode.org/reports/tr15/
Unicode-Hax by Chris Weber is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License . Based on a work at https://github.com/cweb/unicode-hax.