Friday, January 30, 2009

UTF-8 Characters to Watch Out For

Let's say your target application makes a decision using some string you control, and you'd like it to do something more interesting than the developer meant it to do. You know the application processes UTF-8, and you know UTF-8 decoders are notorious for implementation-specific behavior. Now what?

The first thing you need is the Unicode code point that corresponds to the character you want to use. It consists of 4-6 hex digits and usually looks like U+HHHH, or maybe U+HHHHHH. Your mission is to create a UTF-8-like string of bytes that, when decoded, will result in this code point.

In theory, there is only one such string of bits, which you would create using the algorithm in RFC 3629 (or more likely, by using a Unicode library). To borrow from the RFC, this string of bytes looks like this:

Character number range (hexadecimal)

Significant Bits

UTF-8 octet sequence (binary)

0000 0000-0000 007F

up to 7

0xxxxxxx

0000 0080-0000 07FF

8-11

110xxxxx 10xxxxxx

0000 0800-0000 FFFF

12-16

1110xxxx 10xxxxxx 10xxxxxx

0001 0000-0010 FFFF

17-21

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


Basically, the number of leading 1s at the beginning of the first byte indicates how many total bytes are used to represent this character, and the bits in the character are filled into the positions marked x, from least to most significant.

In practice, several other representations might also work, depending on the exact decoding algorithm your target uses:
  1. The alternative 1-byte representation.
  2. Overlong (but otherwise legal) representations.
  3. Really overlong representations.
  4. Alternative succeeding byte representations.

Alternative 1-Byte Representation

The alternative 1-byte representation looks like this:

Character number range (hexadecimal)

Significant Bits

UTF-8 octet sequence (binary)

0000 0000-0000 003F

up to 6

10xxxxxx


This representation is illegal; in legal UTF-8, such a byte (beginning with the bits 10) can only appear as the second or following byte in a multi-byte sequence. But maybe the decoder assumes a new character begins here, and then counts the number of leading 1s (1) to decide how many total bytes this character takes. This leaves you with only 6 bits of actual data, so it will only work on code points U+0000-U+003F. This does not include the alphabet, but does include interesting tidbits like NUL, CR, LF, <, >, and several forms of quotes.

Overlong Representations

To get overlong representations, ignore the character number range column in the original conversion table and use all the octet sequences that have enough least significant bits to hold the character you want:

Character number range (hexadecimal)

Significant Bits

UTF-8 octet sequence (binary)

0000 0000-0000 007F

up to 7

0xxxxxxx

110xxxxx 10xxxxxx

1110xxxx 10xxxxxx 10xxxxxx

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

0000 0080-0000 07FF

8-11

110xxxxx 10xxxxxx

1110xxxx 10xxxxxx 10xxxxxx

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

0000 0800-0000 FFFF

12-16

1110xxxx 10xxxxxx 10xxxxxx

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


These representations would be legal except that UTF-8 specifically disallows overlong representations. But maybe the decoder is mindlessly copying the significant bits, not examining the bits for pesky details like legality.

Really Overlong Representations

Why stop at 4 bytes? Continue the pattern with more bytes:

Character number range (hexadecimal)

Significant Bits

UTF-8 octet sequence (binary)

0000 0000-
0010 FFFF

up to 21

111110nn 10nnnxxx 10xxxxxx 10xxxxxx 10xxxxxx

1111110n 10nnnnnn 10nnnxxx 10xxxxxx 10xxxxxx 10xxxxxx

11111110 10nnnnnn 10nnnnnn 10nnnxxx 10xxxxxx 10xxxxxx 10xxxxxx

...


These representations are illegal; legal UTF-8 is at most 4 bytes long and contains at most 21 bits of data. But maybe the decoder will mindlessly decode one byte for each leading 1 in the first byte and perhaps even write those extra ns back over preceding bits in the result string, or whatever else happens to precede this character in memory.

Alternative Succeeding Byte Representations

To get an alternative succeeding byte representation, just replace the first 2 bits of the second & following bytes (which are supposed to be 10):

Character number range (hexadecimal)

Significant Bits

UTF-8 octet sequence (binary)

0000 0080-0000 07FF

8-11

110xxxxx 00xxxxxx

110xxxxx 01xxxxxx

110xxxxx 11xxxxxx

0000 0800-0000 FFFF

12-16

1110xxxx 00xxxxxx 00xxxxxx

1110xxxx 00xxxxxx 01xxxxxx

1110xxxx 00xxxxxx 11xxxxxx

1110xxxx 01xxxxxx 00xxxxxx

1110xxxx 01xxxxxx 01xxxxxx

1110xxxx 01xxxxxx 11xxxxxx

1110xxxx 11xxxxxx 00xxxxxx

1110xxxx 11xxxxxx 01xxxxxx

1110xxxx 11xxxxxx 11xxxxxx

0001 0000-0010 FFFF

17-21

11110xxx 00xxxxxx 00xxxxxx 00xxxxxx

11110xxx 00xxxxxx 01xxxxxx 00xxxxxx

11110xxx 00xxxxxx 11xxxxxx 00xxxxxx

11110xxx 01xxxxxx 00xxxxxx 00xxxxxx

11110xxx 01xxxxxx 01xxxxxx 00xxxxxx

11110xxx 01xxxxxx 11xxxxxx 00xxxxxx

11110xxx 11xxxxxx 00xxxxxx 00xxxxxx

11110xxx 11xxxxxx 01xxxxxx 00xxxxxx

11110xxx 11xxxxxx 11xxxxxx 00xxxxxx

11110xxx 00xxxxxx 00xxxxxx 01xxxxxx

11110xxx 00xxxxxx 01xxxxxx 01xxxxxx

11110xxx 00xxxxxx 11xxxxxx 01xxxxxx

11110xxx 01xxxxxx 00xxxxxx 01xxxxxx

11110xxx 01xxxxxx 01xxxxxx 01xxxxxx

11110xxx 01xxxxxx 11xxxxxx 01xxxxxx

11110xxx 11xxxxxx 00xxxxxx 01xxxxxx

11110xxx 11xxxxxx 01xxxxxx 01xxxxxx

11110xxx 11xxxxxx 11xxxxxx 01xxxxxx

11110xxx 00xxxxxx 00xxxxxx 11xxxxxx

11110xxx 00xxxxxx 01xxxxxx 11xxxxxx

11110xxx 00xxxxxx 11xxxxxx 11xxxxxx

11110xxx 01xxxxxx 00xxxxxx 11xxxxxx

11110xxx 01xxxxxx 01xxxxxx 11xxxxxx

11110xxx 01xxxxxx 11xxxxxx 11xxxxxx

11110xxx 11xxxxxx 00xxxxxx 11xxxxxx

11110xxx 11xxxxxx 01xxxxxx 11xxxxxx

11110xxx 11xxxxxx 11xxxxxx 11xxxxxx


As you can see, there are a lot of combinations (and of course you'll want to try this technique on the overlong representations as well). These representations are illegal; every succeeding byte in a multi-byte representation must start with the bits 10. But maybe the decoder knows where a character starts and just unpacks the significant bits from each succeeding byte without checking the first 2 bits.

Have fun identifying applications that don't follow the character encoding rules of thumb, and don't get into trouble!

--Brenda

(If you're looking for more do-it-yourself entertainment along these lines, you might want to check out ligatures, surrogate pairs and astral planes, try representing code points that are out-of-range for UTF-8 using really overlong representations, or figure out how to design UTF-8 strings that decode differently depending whether the decoder is moving backwards or forwards in the string.)

0 comments: