UTF-8 compaction mode is principally designed to support data systems with 8-bit communications paths.
UTF-8 compaction mode is principally designed to support data systems with 8-bit communications paths. It has the clear advantage that the character addresses U+0000hex to U+007Fhex, corresponding ASCII (and ISO 646:1991) values 00hex to 7Fhex are represented by single octets of the same value. It is straightforward both to generate and parse and produces reasonable compaction.
Input and output of up to 21-bit Unicode 3 character addresses for all 1 114 112 characters on the 17 Code Planes 0 through 16 can be cumbersome in normal byte-oriented data systems. In Table B.1, the length of the binary data representation of characters to be encoded (ignoring leading zero bits) determines how many UTF-8 bytes are required.
Data type and length |
Unicode address (binary format) |
1st Byte |
2nd Byte |
3rd Byte |
4th Byte |
Up to 7-bits, encoded as 7-bit ASCII or ISO 646 |
00000000 0xxxxxxx |
0xxxxxxxx |
|
|
|
8 to 11 bits |
00000yyy yyxxxxxx |
110yyyyy |
10xxxxxx |
|
|
16 bits (BMP) |
zzzzyyyy yyxxxxxx |
1110zzzz |
10yyyyyy |
10xxxxxx |
|
21 bits, Code Planes 1-16 |
000uuuuu zzzzyyyy yyxxxxxx |
11110uuu |
10uuzzzz |
10yyyyyy |
10xxxxxx |
During decoding, the number of bytes in each UTF-8 byte sequence can be immediately determined from the first byte of each sequence.
Legal UTF-8 byte sequences shall conform to Unicode Technical Report 27 as summarized in Table B.2.
Unicode address range |
1st Byte |
2nd Byte |
3rd Byte |
4th Byte |
U+0000 to U+007F |
00…7F |
|
|
|
U+0080 to U+07FF |
C2...DF |
80…BF |
|
|
U+0800 to U+0FFF |
E0 |
A0...BF |
80...BF |
|
U+1000 to U+FFFF |
E1…EF |
80...BF |
80...BF |
|
U+10000 to U+3FFFF |
F0 |
90…BF |
80…BF |
80…BF |
U+40000 to U+FFFFF |
F1…F3 |
80…BF |
80…BF |
80…BF |
U+100000 to U+10FFFF |
F4 |
80…BF |
80…BF |
80…BF |