Thursday, January 22, 2009

Character Encoding Rules of %54%68%75%6D%62

Most software engineers are aware that there are some Security Issues that have something to do with character encoding (can you see the ceremonial waving of hands?). In a nutshell, the issues stem from the fact that people want software to make decisions based on strings of characters (e.g. when the password matches, let the user log on; allow anonymous FTP when the path to the requested file is in one of these directories; disallow these HTML tags in blog posts), but computers think in bits. When there is a single, well-defined bi-directional mapping (aka encoding in the character set sense) between characters and their bit representations, there's no problem. But when there are multiple character sets to choose from (which is definitely true here in the Internet age), the same string of bits could mean several different strings of characters and vice versa. The same string might even be a different size (in bytes, characters, or both) in different encodings. Therefore, bits that don't match could represent characters that do, or vice versa, and unless the programmer took precautions, the software is going to make the wrong decision some of the time.

Let's be the good guy first. What precautions should we take?

1. As Joel Spolsky succinctly puts it, "It does not make sense to have a string without knowing what encoding it uses." Know the encoding each string uses, including endianness if applicable.
2. Ensure each string is legal in its encoding. When you convert from one encoding to another, do this twice, once before conversion (to avoid confusing the conversion algorithm) and once after (to avoid confusing the string consumer).
3. Convert strings to the same canonical encoding before comparison. Canonical may be more restrictive than legal.
4. Take encoding into account when you calculate string length or string or character sizes, use regular expressions, access the nth character in a string or otherwise manipulate strings and characters.

Sound simple? Great, because I left out this one itsy bitsy teeny weeny little detail: character escaping. Character escaping (like the percent-encoding used in URIs) is another kind of encoding which is usually used to separate the control and data channel in a mixed communication medium. Unlike the character set encodings (like UTF-8) discussed above, for which one character set per string is the limit, character escaping can nest and also cover only parts of the string. For example, we could have a HTML document encoded in UTF-8, and it could contain a URI (with the appropriate portions percent encoded) which has been base64 encoded and stashed in a hidden form field. All the same precautions apply, but to be safe (or for that matter correct), we need to rephrase precaution #1 and add a couple more.

1. Know all the encodings each string uses (or should use), including endianness if applicable, and the order in which they were (or should be) applied.
5. Change character set encoding whenever you like, but decode character escapes in the reverse order encodings were applied.
6. Extract the string to be decoded from the surrounding context before unescaping characters (lest you mix the control and data channel).

In our example, to get the original URI components back, we would need to grab the value of the form field, decode it using base64, then extract each URI component and separately percent-unescape it.

Is that all? No such luck! Unless you make a habit of talking to yourself, the strings you're manipulating are either coming from somewhere else or going somewhere else. Perhaps they are destined for a human, via some display mechanism. Our precautions to this point cover computers, which distinguish characters by comparing bits, but humans distinguish characters based on appearance. Some character sets (including all flavors of Unicode) contain more than one character with the same or very similar glyph. So, we need another precaution:

7. Display different characters using dissimilar glyphs.

On the other hand, perhaps the other party to your communication is another computer. In this situation, rule of thumb #1 effectively means you know what character set & character escaping the other party is using. Or at least what character set they say they are using. Or what character set they are supposed to be using. Or what character set it seems like they are using, based on the bits in the string and the information you expect it to contain.

This bring us neatly to our attacker (and to the end of the precautions a good guy ought to take), so next week, let's switch hats and be evil for a change.

--Brenda

0 comments: