about UTF- 8

发表于:2007-05-26来源:作者:点击数: 标签:
UTF-8compaction mode is principally designed to support data systems with8-bit communications paths. AnnexBUTF- 8 UTF-8compaction mode is principally designed to support data systems with8-bit communications paths. It has the clearadv ant

UTF-8 compaction mode is principally designed to support data systems with 8-bit communications paths.

Annex B UTF- 8


UTF-8 compaction mode is principally designed to support data systems with 8-bit communications paths. It has the clear advantage that the character addresses U+0000hex to U+007Fhex, corresponding ASCII (and ISO 646:1991) values 00hex to 7Fhex are represented by single octets of the same value. It is straightforward both to generate and parse and produces reasonable compaction.


Input and output of up to 21-bit Unicode 3 character addresses for all 1 114 112 characters on the 17 Code Planes 0 through 16 can be cumbersome in normal byte-oriented data systems. In Table B.1, the length of the binary data representation of characters to be encoded (ignoring leading zero bits) determines how many UTF-8 bytes are required.


Table B.1: UTF- 8 byte sequences for Unicode character addresses


Data type and length


Unicode address

(binary format)


1st Byte


2nd Byte


3rd Byte


4th Byte


Up to 7-bits, encoded as 7-bit ASCII or ISO 646


00000000 0xxxxxxx


0xxxxxxxx








8 to 11 bits


00000yyy yyxxxxxx


110yyyyy


10xxxxxx






16 bits (BMP)


zzzzyyyy yyxxxxxx


1110zzzz


10yyyyyy


10xxxxxx




21 bits, Code Planes 1-16


000uuuuu zzzzyyyy yyxxxxxx


11110uuu


10uuzzzz


10yyyyyy


10xxxxxx


During decoding, the number of bytes in each UTF-8 byte sequence can be immediately determined from the first byte of each sequence.


Legal UTF-8 byte sequences shall conform to Unicode Technical Report 27 as summarized in Table B.2.






Table B.2 – Unicode address ranges for legal UTF-8 byte sequences


Unicode address range


1st Byte


2nd Byte


3rd Byte


4th Byte

U+0000 to U+007F

00…7F




U+0080 to U+07FF

C2...DF

80…BF



U+0800 to U+0FFF

E0

A0...BF

80...BF


U+1000 to U+FFFF

E1…EF

80...BF

80...BF


U+10000 to U+3FFFF

F0

90…BF

80…BF

80…BF

U+40000 to U+FFFFF

F1…F3

80…BF

80…BF

80…BF

U+100000 to U+10FFFF

F4

80…BF

80…BF

80…BF


原文转自:http://www.ltesting.net