Skip to content

A Note on Encodings

Martin Büttner edited this page Mar 2, 2014 · 1 revision

Getting encodings right is always a pain, so this document serves as an overview of the (potentially) different encodings involved when using the library. If you use UTF-8 everywhere on your page and your server, you may safely ignore this.

Side note: URL encoding vs. encoding of a URL

This distinction can lead to quite some confusion. URLs are usually URL encoded. In most cases, this means replacing a questionable character with it's code point in the form %xx, such that a space (for instance) becomes %20.

The question is, how are characters outside of the ASCII range encoded. The answer is that URL encoding doesn't care about that. URL encoding simply encodes bytes. How those bytes are interpreted after applying URL decoding is up to the server, and it could really be any encoding. Say we have a query parameter named foobär (note the umlaut). How this parameter appears in the URL depends on the server's URL encoding:

  • In ISO-8859-1, ä is encoded as 0xE4, so we'd expect foob%E4r.
  • In UTF-16, ä is encoded as 0x00E4, so we'd expect foob%00%E4r.
  • In UTF-8, ä is encoded as 0xC3A4, so we'd expect foob%C3%A4r.

So in the remainder of this article, when we refer URL encoding, we don't mean %xx encoding, but the encoding by which the URL-decoded bytes are interpreted.

Different encodings in the communication

Since there are several different participants in the communication handled by the library (see Page vs. Client vs. Server), there also also (potentially) different encodings envolved - after all each participant could be configured differently. There are five primary places for strings in the communication, which could all be encoded differently:

  • the rendered page
  • URLs to the Client
  • URLs to the Server
  • raw data (JSON) returned by the Server
  • strings stored in \FACTFinder\Data objects within the library

Luckily, the FACT-Finder server consistently uses UTF-8. We decided to keep all data within the library stored in UTF-8 as well (because UTF-8 everywhere!). However, we have no control over the encoding you render your shop pages in or the encoding you have configured for the URLs to your server. These can be configured in the library's XML configuration:

<encoding>
    <pageContent>UTF-8</pageContent>
    <clientUrl>ISO-8859-1</clientUrl>
</encoding>

What will the library do with these configuration values? All URLs intended for the Client (e.g. the URL stored in an Item object) will be encoded with the correct encoding before applying (%xx) URL encoding. Note that the fully URL encoded strings are then still in UTF-8 (which means, the resulting % will be represented by the single byte 0x25). The library itself never apply the pageContent encoding. However, you can get your hands on an instance of an EncodingConverter, which provides you with the method encodeContentForPage. You can use this method to prepare all strings from the library when rendering your shop page.