INTERNET DRAFT J. Abela Expires: 23 June 1998 HSC 23 December 1997 UTF-9, a transformation format of UCS Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". To learn the current status of any Internet-Draft, please check the 1id-abstracts.txt listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast). Distribution of this document is unlimited. Abstract ISO/IEC 10646 defines a multi-octet character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. Multi-octet characters, however, are not compatible with many current applications and protocols, and this has led to the development of a few so-called UCS transformation formats (UTF), each with different characteristics. UTF-9, the object of this memo, has the characteristic of preserving the full ISO-Latin1 range, providing compatibility with file systems, parsers and other software that rely on ISO-Latin1 values. ISO-Latin1 is almost as widespread as ASCII in many countries, especially in most of western Europe, and is the default character set for HTML. A compatible encoding seems desirable, where possible. 1. Introduction ISO/IEC 10646-1 [ISO-10646] defines a multi-octet character set called the Universal Character Set (UCS), which encompasses most of the world's writing systems. Two multi-octet encodings are defined, a four-octet per character encoding called UCS-4 and a two-octet per character encoding called UCS-2, able to address only the first 64K characters of the UCS (the Basic Multilingual Plane, BMP), outside of which there are currently no assignments. It is noteworthy that the same set of characters is defined by the Unicode standard [UNICODE], which further defines additional character properties and other application details of great interest to implementors, but does not have the UCS-4 encoding. Up to the present time, changes in Unicode and amendments to ISO/IEC 10646 have tracked each other, so that the character repertoires and code point assignments have remained in sync. The relevant standardization committees have committed to maintain this very useful synchronism. The UCS-2 and UCS-4 encodings, however, are hard to use in many current applications and protocols that assume 8 or even 7 bit characters. Even newer systems able to deal with 16 bit characters cannot process UCS-4 data. This situation has led to the development of so-called UCS transformation formats (UTF), each with different characteristics. UTF-1 has only historical interest, having been removed from ISO/IEC 10646. UTF-7 has the quality of encoding the full BMP repertoire using only octets with the high-order bit clear (7 bit US-ASCII values, [US-ASCII]), and is thus deemed a mail-safe encoding ([RFC2152]). UTF-8 uses all bits of an octet, but has the quality of preserving the full US-ASCII range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for an US-ASCII character, and nothing else. UTF-9, the object of this memo, has the quality of preserving the full ISO-Latin1 range: ISO-Latin1 characters are encoded in one octet having the normal ISO-Latin1 value. UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire into pairs of UCS-2 values from a reserved range. UTF-16 impacts UTF-9 in that UCS-2 values from the reserved range must be treated specially in the UTF-9 transformation. UTF-9 encodes UCS-2 or UCS-4 characters as a varying number of octets, where the number of octets, and the value of each, depend on the integer value assigned to the character in ISO/IEC 10646. This transformation format has the following characteristics (all values are in hexadecimal): - Character values from 0000 0000 to 0000 007F and 0000 00A0 to 0000 00FF (Latin1 repertoire) correspond to octets 00 to 7F and A0 to FF (8 bit Latin1 values). A direct consequence is that a plain Latin1 string is also a valid UTF-9 string. Note that Latin1 octets in a UTF-9 string may be non-Latin1 characters. - US-ASCII values do not appear otherwise in a UTF-9 encoded character stream. This provides compatibility with file systems or other software (e.g. the printf() function in C libraries) that parse based on US-ASCII values but are transparent to other values. However, note that Latin1 octets in a UTF-9 stream may be non-Latin1 characters when used as part of multi-octet sequences. - Round-trip conversion is easy between UTF-9 and either of UCS-4, UCS-2. - The first octet of a multi-octet sequence indicates the number of octets in the sequence. - UTF-9 encoding length is never bigger than UTF-8. - unlike UTF-8, there is no reliable way to find character boundaries in a UTF-9 octet stream. UTF-9 is heavily based on UTF-8 definition. More information about UTF, Unicode, and their various versions can be found in RFC-2044. UTF-9 definition In UTF-9, characters are encoded using sequences of 1 to 5 octets. The only octet of a "sequence" of one is in the ranges 00 to 7F or A0-FF. In a sequence of n octets, n>1, the initial octet is in the range 80 to 9F. This octet specifies the length of the sequence and contains value bits if in the range 80 to 8F. All the bits of the remaining octets are used to encode the character. The table below summarizes the format of these different octet types. The letter x indicates bits available for encoding bits of the UCS-4 character value. UCS-4 range (hex) UTF-9 octet sequence (binary) 0000 0000-0000 007F 0xxxxxxx 0000 00A0-0000 00BF 101xxxxx 0000 00C0-0000 00FF 11xxxxxx 0000 0100-0000 07FF 1000xxxx 1xxxxxxx 0000 0800-0000 FFFF 100100xx 1xxxxxxx 1xxxxxxx 0001 0000-007F FFFF 100101xx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0080 0000-7FFF FFFF 10011xxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx Examples The Latin1 sequence "Nol" should be encoded as follows: UCS-2: 004E 006F 00EB 006C UTF-9: 4E 6F EB 6C UTF-8: 4E 6F C3AB 6C The UCS-2 sequence "A." should be encoded as follows: UCS-2: 0041 2262 0391 002E UTF-9: 41 90 C4 E2 87 91 2E UTF-8: 41 E2 89 A2 CE 91 2E The UCS-2 sequence representing the Hangul characters for the Korean word "hangugo" should be encoded as follows: UCS-2: D55C AD6D C5B4 UTF-9: 93 AA DC 92 DA ED 93 8B B4 UTF-8: ED 95 9C EA B5 AD EC 96 B4 Security Considerations Implementors of UTF-9 need to consider the security aspects of how they handle illegal UTF-9 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-9 parser by sending it an octet sequence that is not permitted by the UTF-9 syntax. A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-9 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal two-octet sequence 80 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F 2E 80 AE 2F. Acknowledgments Most of the text of this memo comes from the UTF-8 memo from Francois Yergeau. The following have participated in the drafting of this memo: Antoine Leca and Francois Yergeau Bibliography [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. Five amendments and a technical corrigendum have been published up to now. [RFC2152] D. Goldsmith, M. Davis, "UTF-7: A Mail-safe Transformation Format of Unicode", RFC 1642, Taligent inc., May 1997. (Obsoletes RFC1642) [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.0", Addison-Wesley, 1996. [US-ASCII] Coded Character Set--7-bit American Standard Code for Information Interchange, ANSI X3.4-1986. Author's Address Jerome Abela Herve Schauer Consultants 142, rue de Rivoli 75001 Paris France Phone: +33 141 409 700 Fax: +33 141 409 709 EMail: Jerome.Abela@hsc.fr