Internet-Draft                                       H. Alvestrand
draft-alvestrand-i18n-howto-00.txt
                                                     Cisco Systems
Target Category: Informational                                     
                                                      January 2001
                                                Expires: July 2001


Protocol Redesigner's Handbook û volume i18n

Guidelines for internationalization of protocols


Status of this Memo
     The file name of this memo is draft-alvestrand-i18n-howto-00.txt
     This document is an Internet-Draft and is in full conformance with
     all provisions of Section 10 of RFC 2026.
     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as Internet-
     Drafts.
     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-Drafts
     as reference material or to cite them other than as "work in
     progress."
     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt
     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.
Comments on this draft should be sent to the mailing list
intloc@ops.ietf.org. This is NOT an open mailing list.

Abstract
This document attempts to give guidelines for the people who have to
deal with existing protocols where issues of  languages and character
sets were not considered from the beginning, and tries to help them a
little along the way. Some of the advice might also be useful for
people designing new protocols.
Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-00.txt               Expires July 2001

1. Introduction


Human beings on our planet have, past and present, used a number of
languages.
These have been represented in a number of media using a variety of
encoding systems, most commonly in scripts using some kinds of
characters.
These days, humans tend to want to use the Internet to communicate
between themselves, and to interact with information stores on the Net.
This means that they have to use Internet protocols to communicate. And
they will want to represent the encodings they are used to from off the
Net when they use the Internet protocols.
And they expect the Right Thing to happen.
This document talks about what doing the Right Thing means.

2. Classes of information
Most protocols are designed with pieces that belong in various
categories:
. Protocol elements, defined by the protocol designer, never shown to
  the user, and never changed.
. Managed-namespace identifiers, defined by some orderly process,
  intended to be used by any protocol user anywhere
. Global-scope identifiers, intended for visibility to any user who has
  an use for them anywhere, but not completely managed by a central
  authority
. Local-scope identifiers, intended for visibility to a small set of
  users, but may be visible in several contexts
. Data content, intended for visibility within a certain context only
Internationalizing a piece in this context means making it capable of
representing information relevant to any user, no matter which script
or language this user uses. This may involve dealing with character
representation, processing rules, language tagging or other functions
as appropriate.
For each element to be considered, there are 3 alternatives:
1. State that the element is a textual element for which the user
  decides the appropriate content.  Basically, it has to be
  internationalized.
2. State that the element has to be in a very limited representation
  (such as the A-Z 0-9 character repertoire) so that it can be globally
  recognized and entered


draft-alvestrand-i18n-howto-00.txt                       [Page 2]
Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-00.txt               Expires July 2001

3. State that the element is immutable, invisible and inviolable, and
  therefore internationalization is irrelevant.


Internationalization requirements started out with data content (MIME
for email, for instance), and are working their way up the chain. For a
long time (see RFC XXX, for instance), we thought that global-scope
identifiers like DNS names should be kept in category 2 (limited
repertoire), but increasing pressure from the community of people who
do not use ASCII in their daily lives has led to a reconsideration here
(IDN).
The current thinking of the authors of this document, which is
suggested as IETF policy, is that protocol elements should NOT be
internationalized; their values should be either binary or invariant-
subset ASCII. This makes testing and debugging easier, and does not
limit the expressive power of any protocol.

3. Designing Internet internationalization

3.1 Basic concepts for the Internet
The fundamental difference between common
internationalization/localization and Internet protocol
internationalization is this:
ON THE INTERNET, THE TWO ENDS OF THE COMMUNICATION ARE NOT IN THE SAME
PLACE.
This means, in particular, that:
. The two ends of the communication do not share a common context such
  as a "locale", or even a country.
. The two ends of the communication do not necessarily have ANY common
  knowledge except for the implementation of the protocol. With
  implementations in local networks, not even Internet access can be
  assumed, so even reference to Internet-accessible resources are not
  guaranteed to work.
This means that:
. ALL information required for correct operation of the protocol must
  be specified in the protocol documentation, or be carried in-band
. When user preferences are involved, where multiple values are
  possible, the specification must guarantee a least common subset of
  identifiers, and properly handle the enumeration of identifiers (for
  instance by IANA registration).
When discussing internationalization, it is also very important to use
common terminology. The terminology of this field is littered with
seemingly simple words that are used for different things by different
people, with "character set", "script" and "language" being high on the
list of abused terms. Refer to a commonly accepted set of definitions,
such as [Hoffman].

draft-alvestrand-i18n-howto-00.txt                       [Page 3]
Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-00.txt               Expires July 2001

3.2 Internationalization components outside IETF scope
Internationalizing a program or a service involves much more than the
protocols. But these other matters are not IETF issues, and do not
impinge upon the IETF standards process except indirectly.
In particular:
. The IETF does not standardize user interfaces. This means that input
  methods, display methods and display characteristics are out of scope
  for the IETF. (However, information about such methods and
  characteristics may at times have to be communicated using parameters
  of IETF protocols.)
. The IETF does not standardize data repositories.
. The IETF does not standardize APIs, except for the rare case of an
  API to a protocol
This also means that the presentation of data, and conversions upon
data performed in order to do presentation, is outside the scope of
IETF standards. The IETF standards are concerned with communicating the
data needed, not how the data are presented.

3.3 Operations likely to be impacted by internationalization
A basic level of internationalization is text representation. A
protocol where it is not possible to send an Arabic letter SAD
(U+0635), and let the recipient recognize this as such, is useless for
communication in Arabic.
This was addressed in RFC 2277, "IETF Policy on Character Sets and
Languages".
This is sufficient for handling text where that text is not treated
further by the protocol endpoint entities. But there are a number of
things that make more trouble:
. Matching. If the protocol has any operation where one party gives a
  text element, and the other party performs an action based on the
  content of that text element, matching must take place. This needs
  specification.
. Sorting. If the protocol ever recognizes "ordered" or "less than" in
  any way, shape or form on textual elements, sorting needs
  specification.
. Canonicalization. If the protocol ever expects to binary compare two
  objects for equality, or compute checksums over the objects as done
  for digital signatures, the implementations will often want to ensure
  that when a human looking at the data in the object thinks that it is
  unchanged, it actually compares equal. The most common method of
  doing this is to define a single "canonical" form for the data.
. Field truncation. In single-byte encodings, one is guaranteed that a
  field value produced by truncating a longer value is at least a valid
  string. With multibyte encodings, this is not the case; with


draft-alvestrand-i18n-howto-00.txt                       [Page 4]
Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-00.txt               Expires July 2001

  variable-length encodings like UTF-8, there is no way to know without
  inspecting the string where legal truncation points may be.
. Checks for legal and illegal characters. In some cases, one wants to
  specify things like "no spaces". One then has to consider whether
  this means no SPACE (U+0020) no space (Unicode class Sp) or no
  separators (a class that includes TAB, for instance).
. (more here)

4. Specific sorting, matching and canonicalization options
The cardinal rule of protocol internationalization should be:
DO NOT INVENT ANYTHING IF YOU CAN AVOID IT.
There are a number of ready-made things available, and a number of
pitfalls that these things have already dealt with.
However, there is no substitute for actually understanding the tools
you are using.
(specifics: Unicode identifier definiton, UTF-8, ACAP/IMAP comparator
library, IDN nameprepà..suggestions!)

4.1 Internationalized encodings
When you transport I18N script across the wire, you don't actually
transport
the script itself. You are transporting the bits which represent the
script.
How the bits are assembled and disassembled from scripts are dependent
on
character sets and encodings.

There is no hard and fast rule what character sets and encodings are
appriopriate for I18N. But it is recommended that ISO10646 for
character sets,
and UTF-8 or UTF-32 for encodings.

I18N is not just a simple "8-bit clean" problem.
ISO10646 is a 31bit character set and thus "8-bit" is technically not
sufficient. An encoding is how you transport an I18N script through
your constrainted enviroment.


4.2 Normalization
Normalization is needed when you want canonical forms of scripts, e.g.
in
cases when you need to do matching, comparsion or sorting of I18N
elements. If
normalization is needed, a good starting reference would be ...

Do remember that Normalization maybe an one-way function which may not
preserve the original form.

draft-alvestrand-i18n-howto-00.txt                       [Page 5]
Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-00.txt               Expires July 2001


5. Security Considerations
The security implications of improperly done internationalizations can
be considerable.
For instance:
. If one does not specify whether input  lengths are counted in
  characters or octets, buffer overflows are likely.
. If multiple representations of the same character are allowed,
  multiple items can appear to the user to have the same name, even
  though they are distinct. This can be used as an attack.
  (Note that this is hard to avoid û Greek uppercase Aleph and ASCII
  uppercase A may look VERY much like each other in common fonts, and
  it does not make sense to outlaw either of those.)
. Signature failures due to improper canonicalization are a security
  problem, too
(more here)

6. Acknowledgements
This document has benefited from many rounds of review and comments in
various fora of the IETF and the Internet working groups.
Any list of contributors is bound to be incomplete; please regard the
following as only a selection from the group of people who have
contributed to make this document what it is today.
In alphabetical order:
Patrik Faltstrom (apologies for the lack of internationalization), Paul
Hoffman, James Seng

7. Author's Address
Harald Tveit Alvestrand
Cisco Systems
Weidemanns vei 27
7043 Trondheim
NORWAY
EMail: Harald@Alvestrand.no
Phone: +47 73 50 33 52

8. References

[ISO 639]


draft-alvestrand-i18n-howto-00.txt                       [Page 6]
Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-00.txt               Expires July 2001

     ISO 639:1988 (E/F) - Code for the representation of names of
     languages - The International Organization for Standardization,
     1st edition, 1988-04-01 Prepared by ISO/TC 37 - Terminology
     (principles and coordination).
     Note that a new version (ISO 639-1:2000) is in preparation at the
     time of this writing.
[ISO 639-2]
     ISO 639-2:1998 - Codes for the representation of names of
     languages -- Part 2: Alpha-3 code  - edition 1, 1998-11-01, 66
     pages, prepared by a Joint Working Group of ISO TC46/SC4 and ISO
     TC37/SC2.

[ISO 3166]
     ISO 3166:1988 (E/F) - Codes for the representation of names of
     countries - The International Organization for Standardization,
     3rd edition, 1988-08-15.
[RFC 1327]
     Kille, S., "Mapping between X.400 (1988) / ISO 10021 and RFC 822",
     RFC 1327, University College London, May 1992.
[RFC 1521]
     Borenstein, N., and N. Freed, "MIME Part One: Mechanisms for
     Specifying and Describing the Format of Internet Message Bodies",
     RFC 1521, Bellcore, Innosoft, September 1993.
[RFC 2026]
     The Internet Standards Process -- Revision 3. S. Bradner. October
     1996.
[RFC 2028]
     The Organizations Involved in the IETF Standards Process. R.
     Hovey, S. Bradner. October 1996.
[RFC 2119]
     Key words for use in RFCs to Indicate Requirement Levels. S.
     Bradner. March 1997.
[RFC 2234]
     Augmented BNF for Syntax Specifications: ABNF. D. Crocker, Ed., P.
Overell, November 1997.
[RFC 2616]
     Hypertext Transfer Protocol -- HTTP/1.1. R. Fielding, J. Gettys,
     J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee. June
     1999.
[RFC 2860]


draft-alvestrand-i18n-howto-00.txt                       [Page 7]
Guidelines for protocol internationalization     Harald Alvestrand
draft-alvestrand-i18n-howto-00.txt               Expires July 2001

     Memorandum of Understanding Concerning the Technical Work of the
     Internet Assigned Numbers Authority. B. Carpenter, F. Baker, M.
     Roberts. June 2000.


draft-alvestrand-i18n-howto-00.txt                       [Page 8]