The first thing you need is the Unicode code point that corresponds to the character you want to use. It consists of 4-6 hex digits and usually looks like U+HHHH, or maybe U+HHHHHH. Your mission is to create a UTF-8-like string of bytes that, when decoded, will result in this code point.
In theory, there is only one such string of bits, which you would create using the algorithm in RFC 3629 (or more likely, by using a Unicode library). To borrow from the RFC, this string of bytes looks like this:
| Character number range (hexadecimal) | Significant Bits | UTF-8 octet sequence (binary) |
| 0000 0000-0000 007F | up to 7 | 0xxxxxxx |
| 0000 0080-0000 07FF | 8-11 | 110xxxxx 10xxxxxx |
| 0000 0800-0000 FFFF | 12-16 | 1110xxxx 10xxxxxx 10xxxxxx |
| 0001 0000-0010 FFFF | 17-21 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Basically, the number of leading 1s at the beginning of the first byte indicates how many total bytes are used to represent this character, and the bits in the character are filled into the positions marked x, from least to most significant.
In practice, several other representations might also work, depending on the exact decoding algorithm your target uses:
- The alternative 1-byte representation.
- Overlong (but otherwise legal) representations.
- Really overlong representations.
- Alternative succeeding byte representations.
Alternative 1-Byte Representation
The alternative 1-byte representation looks like this:
| Character number range (hexadecimal) | Significant Bits | UTF-8 octet sequence (binary) |
| 0000 0000-0000 003F | up to 6 | 10xxxxxx |
This representation is illegal; in legal UTF-8, such a byte (beginning with the bits 10) can only appear as the second or following byte in a multi-byte sequence. But maybe the decoder assumes a new character begins here, and then counts the number of leading 1s (1) to decide how many total bytes this character takes. This leaves you with only 6 bits of actual data, so it will only work on code points U+0000-U+003F. This does not include the alphabet, but does include interesting tidbits like NUL, CR, LF, <, >, and several forms of quotes.
Overlong Representations
To get overlong representations, ignore the character number range column in the original conversion table and use all the octet sequences that have enough least significant bits to hold the character you want:
| Character number range (hexadecimal) | Significant Bits | UTF-8 octet sequence (binary) |
| 0000 0000-0000 007F | up to 7 | 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
| 0000 0080-0000 07FF | 8-11 | 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
| 0000 0800-0000 FFFF | 12-16 | 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
These representations would be legal except that UTF-8 specifically disallows overlong representations. But maybe the decoder is mindlessly copying the significant bits, not examining the bits for pesky details like legality.
Really Overlong Representations
Why stop at 4 bytes? Continue the pattern with more bytes:
| Character number range (hexadecimal) | Significant Bits | UTF-8 octet sequence (binary) |
| 0000 0000- | up to 21 | 111110nn 10nnnxxx 10xxxxxx 10xxxxxx 10xxxxxx 1111110n 10nnnnnn 10nnnxxx 10xxxxxx 10xxxxxx 10xxxxxx 11111110 10nnnnnn 10nnnnnn 10nnnxxx 10xxxxxx 10xxxxxx 10xxxxxx ... |
These representations are illegal; legal UTF-8 is at most 4 bytes long and contains at most 21 bits of data. But maybe the decoder will mindlessly decode one byte for each leading 1 in the first byte and perhaps even write those extra ns back over preceding bits in the result string, or whatever else happens to precede this character in memory.
Alternative Succeeding Byte Representations
To get an alternative succeeding byte representation, just replace the first 2 bits of the second & following bytes (which are supposed to be 10):
| Character number range (hexadecimal) | Significant Bits | UTF-8 octet sequence (binary) |
| 0000 0080-0000 07FF | 8-11 | 110xxxxx 00xxxxxx 110xxxxx 01xxxxxx 110xxxxx 11xxxxxx |
| 0000 0800-0000 FFFF | 12-16 | 1110xxxx 00xxxxxx 00xxxxxx 1110xxxx 00xxxxxx 01xxxxxx 1110xxxx 00xxxxxx 11xxxxxx 1110xxxx 01xxxxxx 00xxxxxx 1110xxxx 01xxxxxx 01xxxxxx 1110xxxx 01xxxxxx 11xxxxxx 1110xxxx 11xxxxxx 00xxxxxx 1110xxxx 11xxxxxx 01xxxxxx 1110xxxx 11xxxxxx 11xxxxxx |
| 0001 0000-0010 FFFF | 17-21 | 11110xxx 00xxxxxx 00xxxxxx 00xxxxxx 11110xxx 00xxxxxx 01xxxxxx 00xxxxxx 11110xxx 00xxxxxx 11xxxxxx 00xxxxxx 11110xxx 01xxxxxx 00xxxxxx 00xxxxxx 11110xxx 01xxxxxx 01xxxxxx 00xxxxxx 11110xxx 01xxxxxx 11xxxxxx 00xxxxxx 11110xxx 11xxxxxx 00xxxxxx 00xxxxxx 11110xxx 11xxxxxx 01xxxxxx 00xxxxxx 11110xxx 11xxxxxx 11xxxxxx 00xxxxxx
11110xxx 00xxxxxx 00xxxxxx 01xxxxxx 11110xxx 00xxxxxx 01xxxxxx 01xxxxxx 11110xxx 00xxxxxx 11xxxxxx 01xxxxxx 11110xxx 01xxxxxx 00xxxxxx 01xxxxxx 11110xxx 01xxxxxx 01xxxxxx 01xxxxxx 11110xxx 01xxxxxx 11xxxxxx 01xxxxxx 11110xxx 11xxxxxx 00xxxxxx 01xxxxxx 11110xxx 11xxxxxx 01xxxxxx 01xxxxxx 11110xxx 11xxxxxx 11xxxxxx 01xxxxxx
11110xxx 00xxxxxx 00xxxxxx 11xxxxxx 11110xxx 00xxxxxx 01xxxxxx 11xxxxxx 11110xxx 00xxxxxx 11xxxxxx 11xxxxxx 11110xxx 01xxxxxx 00xxxxxx 11xxxxxx 11110xxx 01xxxxxx 01xxxxxx 11xxxxxx 11110xxx 01xxxxxx 11xxxxxx 11xxxxxx 11110xxx 11xxxxxx 00xxxxxx 11xxxxxx 11110xxx 11xxxxxx 01xxxxxx 11xxxxxx 11110xxx 11xxxxxx 11xxxxxx 11xxxxxx |
As you can see, there are a lot of combinations (and of course you'll want to try this technique on the overlong representations as well). These representations are illegal; every succeeding byte in a multi-byte representation must start with the bits 10. But maybe the decoder knows where a character starts and just unpacks the significant bits from each succeeding byte without checking the first 2 bits.
Have fun identifying applications that don't follow the character encoding rules of thumb, and don't get into trouble!
--Brenda
(If you're looking for more do-it-yourself entertainment along these lines, you might want to check out ligatures, surrogate pairs and astral planes, try representing code points that are out-of-range for UTF-8 using really overlong representations, or figure out how to design UTF-8 strings that decode differently depending whether the decoder is moving backwards or forwards in the string.)
0 comments:
Post a Comment