Update: 4/21/09: This is another good link, on UTF-8 bitmasks, multiple URL encodings for the same character, multilevel decoding, and hiding Javascript in encoded URLs:
http://www.technicalinfo.net/papers/URLEmbeddedAttacks.html
If you worry much about security, this is a very interesting read.
----
If you work with UTF8, one of the variable width encodings for Unicode, it can be a pain to decode a sequence. I prefer to think of it in terms of bit masks rather than algebraically, but Wikipedia only goes up to the 4 byte sequence.
This is my shorthand for up to 31 bytes of source data expressed in six bytes, not that I can imagine when you'd need that?
Multi-byte sequences have an initial byte whose upper most bits tell you how long the sequence is, and then the rest of the bits are data. Each subsequent extended byte always has a 2 bit header of "10".
Basically each time you extend the sequence the header bits sequence in the lead byte increase in length by 1 bit, and the new byte contributes 6 bits; so each additional byte adds a net of only 5 bits of storage.
You might find these pages less cryptic:
http://en.wikipedia.org/wiki/Utf-8#Description
http://www1.tip.nl/~t876506/utf8tbl.html
Legend for each Group
Range (bits, decimal range, UTF-8 sequence length)
Marker Bits bits(decimal)
Bitmasks (with , byte groupings from the RIGHT)
1-7 Bits / 0-127 1 byte
0 (zero marker)
0111-1111
8-11 Bits / 2,047 2 bytes
110(192) 10(128)
0001-11,11 0011-1111
12-16 Bits / 65,535 3 bytes
1110(224) 10(128) 10
0000-1111 0011-11,11 0011-1111
17-21 Bits / 2,097,151 4 bytes
1111-0(240) 10(128) 10 10
0000-0111 0011-,1111 0011-11,11 0011-1111
- - - Wikipedia's table stops here - - -
22-26 Bits [2,097,152 - 67,108,863] 5 bytes
1111-10(248)
1111-10 10(128) 10 10 10
0000-0011 00,11-1111 0011-,1111 0011-11,11 0011-1111
27-31 Bits [67,108,864 - 2,147,483,647] 6 bytes
1111-110(252)
1111-110 10(128) 10 10 10 10
0000-0001 0011-1111 00,11-1111 0011-,1111 0011-11,11 0011-1111
Be careful, UTF-16 is also variable width, and is not represented here, but I suspect it looks similar to the casual observer.
Comments