about UTF- 8

发表于:2007-07-04来源:作者:点击数: 标签:
UTF-8compaction mode is principally designed to support data systems with8-bit communications paths. AnnexBUTF- 8 UTF-8compaction mode is principally designed to support data systems with8-bit communications paths. It has the clearadv ant

UTF-8compaction mode is principally designed to support data systems with8-bit communications paths.

AnnexBUTF- 8


UTF-8compaction mode is principally designed to support data systems with8-bit communications paths. It has the clearadvantage that the character addresses U+0000hex toU+007Fhex, corresponding ASCII (and ISO 646:1991) values00hex to 7Fhex are represented by single octetsof the same value. It is straightforward both to generate and parseand produces reasonable compaction.


Inputand output of up to 21-bit Unicode 3 character addresses for all 1114 112 characters on the 17 Code Planes 0 through 16 can becumbersome in normal byte-oriented data systems. In Table B.1, thelength of the binary data representation of characters to be encoded(ignoring leading zero bits) determines how many UTF-8 bytes arerequired.


TableB.1: UTF- 8 byte sequences for Unicode character addresses


Datatype and length


Unicodeaddress

(binaryformat)


1stByte


2ndByte


3rdByte


4thByte


Upto 7-bits, encoded as 7-bit ASCII or ISO 646


000000000xxxxxxx


0xxxxxxxx








8to 11 bits


00000yyyyyxxxxxx


110yyyyy


10xxxxxx






16bits (BMP)


zzzzyyyyyyxxxxxx


1110zzzz


10yyyyyy


10xxxxxx




21bits, Code Planes 1-16


000uuuuuzzzzyyyy yyxxxxxx


11110uuu


10uuzzzz


10yyyyyy


10xxxxxx


Duringdecoding, the number of bytes in each UTF-8 byte sequence can beimmediately determined from the first byte of each sequence.


LegalUTF-8 byte sequences shall conform to Unicode Technical Report 27as summarized in Table B.2.






TableB.2 – Unicode address ranges for legal UTF-8 byte sequences


Unicodeaddress range


1stByte


2ndByte


3rdByte


4thByte

U+0000to U+007F

00…7F




U+0080to U+07FF

C2...DF

80…BF



U+0800to U+0FFF

E0

A0...BF

80...BF


U+1000to U+FFFF

E1…EF

80...BF

80...BF


U+10000to U+3FFFF

F0

90…BF

80…BF

80…BF

U+40000to U+FFFFF

F1…F3

80…BF

80…BF

80…BF

U+100000to U+10FFFF

F4

80…BF

80…BF

80…BF


原文转自:http://www.ltesting.net