Network Working Group M. Crispin
Request for Comments: 4042 Panda Programming
Category: Informational 1 April 2005
UTF-9 and UTF-18
Efficient Transformation Formats of Unicode
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2005).
Abstract
ISO-10646 defines a large character set called the Universal
Character Set (UCS), which encompasses most of the world's writing
systems. The same set of codepoints is defined by Unicode, which
further defines additional character properties and other
implementation details. By policy of the relevant standardization
committees, changes to Unicode and amendments and additions to
ISO/IEC 646 track each other, so that the character repertoires and
code point assignments remain in synchronization.
The current representation formats for Unicode (UTF-7, UTF-8, UTF-16)
are not storage and computation efficient on platforms that utilize
the 9 bit nonet as a natural storage unit instead of the 8 bit octet.
This document describes a transformation format of Unicode that takes
advantage of the nonet so that the format will be storage and
computation efficient.
1. Introduction
A number of Internet sites utilize platforms that are not based upon
the traditional 8-bit byte or octet. One such platform is the PDP-
10, which is based upon a 36-bit word. On these platforms, it is
wasteful to represent data in octets, since 4 bits are left unused in
each word. The 9-bit nonet is a much more sensible representation.
Although these platforms support IETF standards, many of these
platforms still utilize a text representation based upon the septet,
which is only suitable for [US-ASCII] (although it has been used for
various ISO 10646 national variants).
To maximize international and multi-lingual interoperability, the IAB
has recommended ([IAB-CHARACTER]) that [ISO-10646] be the default
coded character set.
Although other transformation formats of [UNICODE] exist, and
conceivably can be used on nonet-oriented machines (most notably
[UTF-8]), they suffer significant disadvantages:
[UTF-8]
requires one to three octets to represent codepoints in the
Basic Multilingual Plane (BMP), four octets to represent
[UNICODE] codepoints outside the BMP, and six octets to
represent non-[UNICODE] codepoints. When stored in nonets,
this results in as many as four wasted bits per [UNICODE]
character.
[UTF-16]
requires a hexadecet to represent codepoints in the BMP, and
two hexadecets to represent [UNICODE] codepoints outside the
BMP. When stored in nonet pairs, this results in as many as
four wasted bits per [UNICODE] character. This transformation
format requires complex surrogates to represent codepoints
outside the BMP, and can not represent non-[UNICODE] codepoints
at all.
[UTF-7]
requires one to five septets to represent codepoints in the
BMP, and as many as eight septets to represent codepoints
outside the BMP. When stored in nonets, this results in as
many as sixteen wasted bits per character. This transformation
format requires very complex and computationally expensive
shifting and "modified BASE64" processing, and can not
represent non-[UNICODE] codepoints at all.
By comparison, UTF-9 uses one to two nonets to represent codepoints
in the BMP, three nonets to represent [UNICODE] codepoints outside
the BMP, and three or four nonets to represent non-[UNICODE]
codepoints. There are no wasted bits, and as the examples in this
document demonstrate, the computational processing is minimal.
Transformation between [UTF-8] and UTF-9 is straightforward, with
most of the complexity in the handling of [UTF-8]. It is hoped that
future extensions to protocols such as SMTP will permit the use of
UTF-9 in these protocols between nonet platforms without the use of
[UTF-8] as an "on the wire" format.
Similarly, transformation between [UNICODE] codepoints and UTF-18 is
also quite simple. Although (like UCS-2) UTF-18 only represents a
subset of the available [UNICODE] codepoints, it encompasses the
non-private codepoints that are currently assigned in [UNICODE].
1.1. Conventions Used in This Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14, RFC 2119
[KEYWORDS].
2. Overview
UTF-9 encodes [UNICODE] codepoints in the low order 8 bits of a
nonet, using the high order bit to indicate continuation. Surrogates
are not used.
[UNICODE] codepoints in the range U+0000 - U+00FF ([US-ASCII] and
Latin 1) are represented by a single nonet; codepoints in the range
U+0100 - U+FFFF (the remainder of the BMP) are represented by two
nonets; and codepoints in the range U+1000 - U+10FFFF (remainder of
[UNICODE]) are represented by three nonets.
Non-[UNICODE] codepoints in [ISO-10646] (that is, codepoints in the
range 0x110000 - 0x7fffffff) can also be represented in UTF-9 by
obvious extension, but this is not discussed further as these
codepoints have been removed from [ISO-10646] by ISO.
UTF-18 encodes [UNICODE] codepoints in the Basic Multilingual Plane
(BMP, plane 0), Supplementary Multilingual Plane (SMP, plane 1),
Supplementary Ideographic Plane (SIP, plane 2), and Supplementary
Special-purpose Plane (SSP, plane 14) in a single 18-bit value. It
does not encode planes 3 though 13, which are currently unused; nor
planes 15 or 16, which are private spaces.
Normally, UTF-9 and UTF-18 should only be used in the context of 9
bit storage and transport. Although some protocols, e.g., [FTP],
support transport of nonets, the current IETF protocol suite is quite
deficient in this area. The IETF is urged to take action to improve
IETF protocol support for nonets.
3. UTF-9 Definition
A UTF-9 stream represents [ISO-10646] codepoints using 9 bit nonets.
The low order 8-bits of a nonet is an octet, and the high order bit
indicates continuation.
UTF-9 does not use surrogates; consequently a UTF-16 value must be
transformed into the UCS-4 equivalent, and U+D800 - U+DBFF are never
transmitted in UTF-9.
Octets of the [UNICODE] codepoint value are then copied into
successive UTF-9 nonets, starting with the most-significant non-zero
octet. All but the least significant octet have the continuation bit
set in the associated nonet.
Examples:
Character Name UTF-9 (in octal)
--------- ---- ----------------
U+0041 LATIN CAPITAL LETTER A 101
U+00C0 LATIN CAPITAL LETTER A WITH GRAVE 300
U+0391 GREEK CAPITAL LETTER ALPHA 403 221
U+611B
文章来源于领测软件测试网 https://www.ltesting.net/
版权所有(C) 2003-2010 TestAge(领测软件测试网)|领测国际科技(北京)有限公司|软件测试工程师培训网 All Rights Reserved
北京市海淀区中关村南大街9号北京理工科技大厦1402室 京ICP备2023014753号-2
技术支持和业务联系:info@testage.com.cn 电话:010-51297073