Why can't I get my special characters to display properly?

Posted by Dominic Cronin at Sep 23, 2013 10:15 PM | Permalink

Filed under: debugging, Tridion, encoding

Today there was a question (http://tridion.stackexchange.com/q/2891/129) on the Tridion Stack Exchange that referred to putting superscript characters in a non-RTF field in Tridion. I started to answer it there, but soon realised that my answer was for a rather broader question - "How can I figure it out if funky characters don't display properly?"

Assuming you are using UTF-8, then the best way to verify the data at each stage is as follows:

Install a good byte editor. I personally use a freeware tool: http://mh-nexus.de/en/
Understand how UTF-8 works and be prepared to decode characters with a pencil and a sheet of paper. Make reference to http://www.ietf.org/rfc/rfc3629.txt and particularly the table on page 3. This way you can translate UTF-8 to Unicode.
Use the code charts at www.unicode.org/charts to verify the character in Unicode.

Tridion itself treats everything as Unicode, and will be able to cope with pretty much any character, including those in the Klingon language (Unicode range U+F8D0..U+F8FF), but good luck if you don't have a Klingon font installed.

So taking the trademark symbol as an example, and using the information available at https://en.wikipedia.org/wiki/Trademark_symbol...

Open notepad and type Alt + (numeric keypad) 0153. Save the file as UTF-8 and open it up with your byte editor. N.B. Don't ever copy/paste interesting characters, because the Windows clipboard will try to help - which is not what you want when debugging.

You should see the following three bytes (possibly preceded by some BOM data - if in doubt, surround your TM with known characters)

E2 84 A2

Open up Windows "calc" in programmers' mode and set the word length to DWord. Flipping between Hex and Binary, your three bytes end up looking like this:

11100010 10000100 10100010

Referring back to http://www.ietf.org/rfc/rfc3629.txt you can translate this to the byte sequence:

0010000100100010

which, of course you immediately feed back into calc to translate it to the hex value 0x2122

You can then look in the relevant Unicode chart... searching at http://www.unicode.org/charts/#symbols for 2122, we end up at http://www.unicode.org/charts/PDF/U2100.pdf and discover that this byte sequence represents "2122 ™ TRADE MARK SIGN".

If, by this point, you can't see a trade mark sign, it's probably because you haven't correctly told the browser what encoding you've used for the bytes you've sent, or because the font you are using doesn't know how to display that character.

You can also use this process in the opposite direction, going from a code point to a byte sequence.

Understanding how this all works is essential to your peace of mind when dealing with encoding and character display issues.

Dominic Cronin's web site

Why can't I get my special characters to display properly?