Internet-Draft H. Alvestrand draft-alvestrand-i18n-howto-00.txt Cisco Systems Target Category: Informational January 2001 Expires: July 2001 Protocol Redesigner's Handbook û volume i18n Guidelines for internationalization of protocols Status of this Memo The file name of this memo is draft-alvestrand-i18n-howto-00.txt This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Comments on this draft should be sent to the mailing list intloc@ops.ietf.org. This is NOT an open mailing list. Abstract This document attempts to give guidelines for the people who have to deal with existing protocols where issues of languages and character sets were not considered from the beginning, and tries to help them a little along the way. Some of the advice might also be useful for people designing new protocols. Guidelines for protocol internationalization Harald Alvestrand draft-alvestrand-i18n-howto-00.txt Expires July 2001 1. Introduction Human beings on our planet have, past and present, used a number of languages. These have been represented in a number of media using a variety of encoding systems, most commonly in scripts using some kinds of characters. These days, humans tend to want to use the Internet to communicate between themselves, and to interact with information stores on the Net. This means that they have to use Internet protocols to communicate. And they will want to represent the encodings they are used to from off the Net when they use the Internet protocols. And they expect the Right Thing to happen. This document talks about what doing the Right Thing means. 2. Classes of information Most protocols are designed with pieces that belong in various categories: . Protocol elements, defined by the protocol designer, never shown to the user, and never changed. . Managed-namespace identifiers, defined by some orderly process, intended to be used by any protocol user anywhere . Global-scope identifiers, intended for visibility to any user who has an use for them anywhere, but not completely managed by a central authority . Local-scope identifiers, intended for visibility to a small set of users, but may be visible in several contexts . Data content, intended for visibility within a certain context only Internationalizing a piece in this context means making it capable of representing information relevant to any user, no matter which script or language this user uses. This may involve dealing with character representation, processing rules, language tagging or other functions as appropriate. For each element to be considered, there are 3 alternatives: 1. State that the element is a textual element for which the user decides the appropriate content. Basically, it has to be internationalized. 2. State that the element has to be in a very limited representation (such as the A-Z 0-9 character repertoire) so that it can be globally recognized and entered draft-alvestrand-i18n-howto-00.txt [Page 2] Guidelines for protocol internationalization Harald Alvestrand draft-alvestrand-i18n-howto-00.txt Expires July 2001 3. State that the element is immutable, invisible and inviolable, and therefore internationalization is irrelevant. Internationalization requirements started out with data content (MIME for email, for instance), and are working their way up the chain. For a long time (see RFC XXX, for instance), we thought that global-scope identifiers like DNS names should be kept in category 2 (limited repertoire), but increasing pressure from the community of people who do not use ASCII in their daily lives has led to a reconsideration here (IDN). The current thinking of the authors of this document, which is suggested as IETF policy, is that protocol elements should NOT be internationalized; their values should be either binary or invariant- subset ASCII. This makes testing and debugging easier, and does not limit the expressive power of any protocol. 3. Designing Internet internationalization 3.1 Basic concepts for the Internet The fundamental difference between common internationalization/localization and Internet protocol internationalization is this: ON THE INTERNET, THE TWO ENDS OF THE COMMUNICATION ARE NOT IN THE SAME PLACE. This means, in particular, that: . The two ends of the communication do not share a common context such as a "locale", or even a country. . The two ends of the communication do not necessarily have ANY common knowledge except for the implementation of the protocol. With implementations in local networks, not even Internet access can be assumed, so even reference to Internet-accessible resources are not guaranteed to work. This means that: . ALL information required for correct operation of the protocol must be specified in the protocol documentation, or be carried in-band . When user preferences are involved, where multiple values are possible, the specification must guarantee a least common subset of identifiers, and properly handle the enumeration of identifiers (for instance by IANA registration). When discussing internationalization, it is also very important to use common terminology. The terminology of this field is littered with seemingly simple words that are used for different things by different people, with "character set", "script" and "language" being high on the list of abused terms. Refer to a commonly accepted set of definitions, such as [Hoffman]. draft-alvestrand-i18n-howto-00.txt [Page 3] Guidelines for protocol internationalization Harald Alvestrand draft-alvestrand-i18n-howto-00.txt Expires July 2001 3.2 Internationalization components outside IETF scope Internationalizing a program or a service involves much more than the protocols. But these other matters are not IETF issues, and do not impinge upon the IETF standards process except indirectly. In particular: . The IETF does not standardize user interfaces. This means that input methods, display methods and display characteristics are out of scope for the IETF. (However, information about such methods and characteristics may at times have to be communicated using parameters of IETF protocols.) . The IETF does not standardize data repositories. . The IETF does not standardize APIs, except for the rare case of an API to a protocol This also means that the presentation of data, and conversions upon data performed in order to do presentation, is outside the scope of IETF standards. The IETF standards are concerned with communicating the data needed, not how the data are presented. 3.3 Operations likely to be impacted by internationalization A basic level of internationalization is text representation. A protocol where it is not possible to send an Arabic letter SAD (U+0635), and let the recipient recognize this as such, is useless for communication in Arabic. This was addressed in RFC 2277, "IETF Policy on Character Sets and Languages". This is sufficient for handling text where that text is not treated further by the protocol endpoint entities. But there are a number of things that make more trouble: . Matching. If the protocol has any operation where one party gives a text element, and the other party performs an action based on the content of that text element, matching must take place. This needs specification. . Sorting. If the protocol ever recognizes "ordered" or "less than" in any way, shape or form on textual elements, sorting needs specification. . Canonicalization. If the protocol ever expects to binary compare two objects for equality, or compute checksums over the objects as done for digital signatures, the implementations will often want to ensure that when a human looking at the data in the object thinks that it is unchanged, it actually compares equal. The most common method of doing this is to define a single "canonical" form for the data. . Field truncation. In single-byte encodings, one is guaranteed that a field value produced by truncating a longer value is at least a valid string. With multibyte encodings, this is not the case; with draft-alvestrand-i18n-howto-00.txt [Page 4] Guidelines for protocol internationalization Harald Alvestrand draft-alvestrand-i18n-howto-00.txt Expires July 2001 variable-length encodings like UTF-8, there is no way to know without inspecting the string where legal truncation points may be. . Checks for legal and illegal characters. In some cases, one wants to specify things like "no spaces". One then has to consider whether this means no SPACE (U+0020) no space (Unicode class Sp) or no separators (a class that includes TAB, for instance). . (more here) 4. Specific sorting, matching and canonicalization options The cardinal rule of protocol internationalization should be: DO NOT INVENT ANYTHING IF YOU CAN AVOID IT. There are a number of ready-made things available, and a number of pitfalls that these things have already dealt with. However, there is no substitute for actually understanding the tools you are using. (specifics: Unicode identifier definiton, UTF-8, ACAP/IMAP comparator library, IDN nameprepà..suggestions!) 4.1 Internationalized encodings When you transport I18N script across the wire, you don't actually transport the script itself. You are transporting the bits which represent the script. How the bits are assembled and disassembled from scripts are dependent on character sets and encodings. There is no hard and fast rule what character sets and encodings are appriopriate for I18N. But it is recommended that ISO10646 for character sets, and UTF-8 or UTF-32 for encodings. I18N is not just a simple "8-bit clean" problem. ISO10646 is a 31bit character set and thus "8-bit" is technically not sufficient. An encoding is how you transport an I18N script through your constrainted enviroment. 4.2 Normalization Normalization is needed when you want canonical forms of scripts, e.g. in cases when you need to do matching, comparsion or sorting of I18N elements. If normalization is needed, a good starting reference would be ... Do remember that Normalization maybe an one-way function which may not preserve the original form. draft-alvestrand-i18n-howto-00.txt [Page 5] Guidelines for protocol internationalization Harald Alvestrand draft-alvestrand-i18n-howto-00.txt Expires July 2001 5. Security Considerations The security implications of improperly done internationalizations can be considerable. For instance: . If one does not specify whether input lengths are counted in characters or octets, buffer overflows are likely. . If multiple representations of the same character are allowed, multiple items can appear to the user to have the same name, even though they are distinct. This can be used as an attack. (Note that this is hard to avoid û Greek uppercase Aleph and ASCII uppercase A may look VERY much like each other in common fonts, and it does not make sense to outlaw either of those.) . Signature failures due to improper canonicalization are a security problem, too (more here) 6. Acknowledgements This document has benefited from many rounds of review and comments in various fora of the IETF and the Internet working groups. Any list of contributors is bound to be incomplete; please regard the following as only a selection from the group of people who have contributed to make this document what it is today. In alphabetical order: Patrik Faltstrom (apologies for the lack of internationalization), Paul Hoffman, James Seng 7. Author's Address Harald Tveit Alvestrand Cisco Systems Weidemanns vei 27 7043 Trondheim NORWAY EMail: Harald@Alvestrand.no Phone: +47 73 50 33 52 8. References [ISO 639] draft-alvestrand-i18n-howto-00.txt [Page 6] Guidelines for protocol internationalization Harald Alvestrand draft-alvestrand-i18n-howto-00.txt Expires July 2001 ISO 639:1988 (E/F) - Code for the representation of names of languages - The International Organization for Standardization, 1st edition, 1988-04-01 Prepared by ISO/TC 37 - Terminology (principles and coordination). Note that a new version (ISO 639-1:2000) is in preparation at the time of this writing. [ISO 639-2] ISO 639-2:1998 - Codes for the representation of names of languages -- Part 2: Alpha-3 code - edition 1, 1998-11-01, 66 pages, prepared by a Joint Working Group of ISO TC46/SC4 and ISO TC37/SC2. [ISO 3166] ISO 3166:1988 (E/F) - Codes for the representation of names of countries - The International Organization for Standardization, 3rd edition, 1988-08-15. [RFC 1327] Kille, S., "Mapping between X.400 (1988) / ISO 10021 and RFC 822", RFC 1327, University College London, May 1992. [RFC 1521] Borenstein, N., and N. Freed, "MIME Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies", RFC 1521, Bellcore, Innosoft, September 1993. [RFC 2026] The Internet Standards Process -- Revision 3. S. Bradner. October 1996. [RFC 2028] The Organizations Involved in the IETF Standards Process. R. Hovey, S. Bradner. October 1996. [RFC 2119] Key words for use in RFCs to Indicate Requirement Levels. S. Bradner. March 1997. [RFC 2234] Augmented BNF for Syntax Specifications: ABNF. D. Crocker, Ed., P. Overell, November 1997. [RFC 2616] Hypertext Transfer Protocol -- HTTP/1.1. R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee. June 1999. [RFC 2860] draft-alvestrand-i18n-howto-00.txt [Page 7] Guidelines for protocol internationalization Harald Alvestrand draft-alvestrand-i18n-howto-00.txt Expires July 2001 Memorandum of Understanding Concerning the Technical Work of the Internet Assigned Numbers Authority. B. Carpenter, F. Baker, M. Roberts. June 2000. draft-alvestrand-i18n-howto-00.txt [Page 8]