UTF-8compaction mode is principally designed to support data systems with8-bit communications paths.
AnnexBUTF- 8
UTF-8compaction mode is principally designed to support data systems with8-bit communications paths. It has the clearadvantage that the character addresses U+0000hex toU+007Fhex, corresponding ASCII (and ISO 646:1991) values00hex to 7Fhex are represented by single octetsof the same value. It is straightforward both to generate and parseand produces reasonable compaction.
Inputand output of up to 21-bit Unicode 3 character addresses for all 1114 112 characters on the 17 Code Planes 0 through 16 can becumbersome in normal byte-oriented data systems. In Table B.1, thelength of the binary data representation of characters to be encoded(ignoring leading zero bits) determines how many UTF-8 bytes arerequired.
TableB.1: UTF- 8 byte sequences for Unicode character addresses
Datatype and length | Unicodeaddress (binaryformat) | 1stByte | 2ndByte | 3rdByte | 4thByte |
Upto 7-bits, encoded as 7-bit ASCII or ISO 646 | 000000000xxxxxxx | 0xxxxxxxx | |||
8to 11 bits | 00000yyyyyxxxxxx | 110yyyyy | 10xxxxxx | ||
16bits (BMP) | zzzzyyyyyyxxxxxx | 1110zzzz | 10yyyyyy | 10xxxxxx | |
21bits, Code Planes 1-16 | 000uuuuuzzzzyyyy yyxxxxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx |
Duringdecoding, the number of bytes in each UTF-8 byte sequence can beimmediately determined from the first byte of each sequence.
LegalUTF-8 byte sequences shall conform to Unicode Technical Report 27as summarized in Table B.2.
TableB.2 – Unicode address ranges for legal UTF-8 byte sequences
Unicodeaddress range | 1stByte | 2ndByte | 3rdByte | 4thByte |
U+0000to U+007F | 00…7F | |||
U+0080to U+07FF | C2...DF | 80…BF | ||
U+0800to U+0FFF | E0 | A0...BF | 80...BF | |
U+1000to U+FFFF | E1…EF | 80...BF | 80...BF | |
U+10000to U+3FFFF | F0 | 90…BF | 80…BF | 80…BF |
U+40000to U+FFFFF | F1…F3 | 80…BF | 80…BF | 80…BF |
U+100000to U+10FFFF | F4 | 80…BF | 80…BF | 80…BF |
文章来源于领测软件测试网 https://www.ltesting.net/