Network Working Group J. Klensin Internet-Draft October 22, 2003 Expires: April 21, 2004 Internationalization of Email Addresses draft-klensin-emailaddr-i18n-01.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http:// www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 21, 2004. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract Internationalization of electronic mail addresses is, if anything, more important than the already-completed effort for domain names. In most of the contexts in which they are used, domain names can be hidden within or as part of various types of references. Email addresses, by contrast, are crucial: use of names of people or organizations as, or as part of, the email local part is, for obvious reasons, a well-established tradition on the network. Preventing people from spelling their names correctly is, in the long term, inexcusable. At the same time, email addresses pose a number of special problems -- they are more difficult than simple domain names in some respects, but actually easier in others. This document discusses the issues with internationalization of email addresses, explains why some obvious approaches are incompatible with the Klensin Expires April 21, 2004 [Page 1] Internet-Draft Internationalization of Email Addresses October 2003 definitions and use of Internet mail, and proposes a solution that is likely to serve users and the network well for the long term. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. History, Context, and Design Constraints . . . . . . . . . . 4 2.1 The Presentation Issue . . . . . . . . . . . . . . . . . . . 4 2.2 MUAs, MTAs, addresses, and learning from MIME and ESMTP . . 5 2.3 An MUA-only-based Solution is Not Necessary . . . . . . . . 7 2.3.1 Obtaining an Internationalized Email Address . . . . . . . . 7 2.3.2 Relay environment . . . . . . . . . . . . . . . . . . . . . 9 2.3.3 Internationalizing the Sender . . . . . . . . . . . . . . . 9 2.4 A Solution Based on MUA Changes Alone is Unworkable . . . . 10 2.4.1 MX Diversion . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 Embedded commands . . . . . . . . . . . . . . . . . . . . . 10 2.5 Encoding the Whole Address String . . . . . . . . . . . . . 10 2.6 Looking back and looking forward . . . . . . . . . . . . . . 11 2.7 Summary of Design Issues and Tradeoffs . . . . . . . . . . . 11 3. A Mail Transport-level Protocol . . . . . . . . . . . . . . 12 3.1 General Principles and Objectives . . . . . . . . . . . . . 12 3.2 Framework for the Internationalization Extension . . . . . . 12 3.3 The Address Internationalization Service Extension . . . . . 13 3.4 Extended Mailbox Address Syntax . . . . . . . . . . . . . . 14 3.5 Additional ESMTP Changes and Clarifications . . . . . . . . 14 3.5.1 The Initial SMTP Exchange . . . . . . . . . . . . . . . . . 15 3.5.2 Trace Fields . . . . . . . . . . . . . . . . . . . . . . . . 15 3.6 Protocol Loose Ends . . . . . . . . . . . . . . . . . . . . 15 3.6.1 Punycode in Domain Names? . . . . . . . . . . . . . . . . . 15 3.6.2 Local Character Codes in Local Parts? . . . . . . . . . . . 15 3.6.3 Restrictions on Characters in Local Part? . . . . . . . . . 16 3.6.4 Requirement for 8BITMIME? . . . . . . . . . . . . . . . . . 16 3.6.5 Message Header and Body Issues with MTA Approach? . . . . . 16 3.6.6 Variant Addresses (Aliases) in a Command Verb . . . . . . . 17 3.6.7 The Received field 'for' clause . . . . . . . . . . . . . . 17 4. Impact on the MUA and on Message Headers . . . . . . . . . . 17 5. Internationalization and Full Localization . . . . . . . . . 17 6. Advice to Designers and Operators of Mail-receiving Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7. Security considerations . . . . . . . . . . . . . . . . . . 20 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 20 9. An Appeal . . . . . . . . . . . . . . . . . . . . . . . . . 20 Normative References . . . . . . . . . . . . . . . . . . . . 21 Informative References . . . . . . . . . . . . . . . . . . . 21 Author's Address . . . . . . . . . . . . . . . . . . . . . . 22 Intellectual Property and Copyright Statements . . . . . . . 23 Klensin Expires April 21, 2004 [Page 2] Internet-Draft Internationalization of Email Addresses October 2003 1. Introduction Internationalization of electronic mail addresses is, if anything, more important than the already-completed effort for domain names. In most of the contexts in which they are used, domain names can be hidden within, or as part of, various types of references or the references themselves may be hidden. It also remains controversial whether internationalization of domain names is actually necessary, no matter how attractive and important it may appear at first glance. Email addresses, by contrast, are crucial: use of names of people or organizations as, or as part of, the email local part is, for obvious reasons, a well-established tradition on the network. Preventing people from spelling their names correctly is, in the long term, inexcusable. However, while it is tempting to ignore them, email addresses pose a number of special problems. Unlike domain names --and, consequently, the domain part of an email address (after the last "@")-- the local part (or mailbox name) is essentially unconstrained with regard to syntax or the characters used. There are no special delimiters comparable to the period used to separate domain name labels, there is no standardized structure comparable to the domain name system's hierarchy, and it has always been a firm protocol requirement that no host other than the one to which final delivery is made is permitted to parse or interpret the address (see section 2.3.10 of [RFC2821]). In some respects, this makes things much more difficult: it is far more difficult to know what behavior will cause existing systems to cease working properly. In others, it actually makes them easier, since the originating system is not required, indeed, must not, understand how the receiving one will interpret an address. The balance of this document explores these issues in more detail. While much of the description here depends on the abstractions of "Mail Transfer Agent" ("MTA") and "Mail User Agent" ("MUA"), it is important to understand that those terms and the underlying concepts postdate the design of the Internet's email architecture and the "protocols on the wire" principle. The latter two concepts have prevented any strong and standardized distinctions about how MTAs and MUAs interact on a given origin or destination host (or even whether they are separate). This document assumes a reasonable understanding of the protocols and terminology of the most recent core email standards documented in RFC 2821 [RFC2821] and RFC 2822 [RFC2822]. In its present internet-draft form, the document contains a great deal of explanatory material and rationale for the approach chosen. The actual protocol material appears almost entirely in Section 3, Klensin Expires April 21, 2004 [Page 3] Internet-Draft Internationalization of Email Addresses October 2003 especially Section 3.2 through Section 3.4 and in Section 4. If it appears to be a candidate for standards-track publication, the explanatory material, rationale, and most of the other background materials should be removed to a separate document. Those who wish to skip the reasoning and comparison to other alternatives in this document and examine the protocol proposal should skip to those sections. 2. History, Context, and Design Constraints Several key issues in how email works and is handled impose significant constraints on the solution space. Email is often used as a transport mechanism for information that will be acted on by computers, not merely read by people. While the approach is not common, some of the systems that use it that way encode routing, processing, or validation information into the envelope address fields. More commonly, recipient systems use special address formats to encode local routing or priority information. In recent years, some of these addressing techniques have become important anti-spam tools for some users and communities. These techniques have a long history. Most or all of them conform to email standards and practices that, in turn, go back to the first uses of email on the ARPANet. Backward-compatibility --not damaging the interoperability of standards-conforming programs that are now deployed and working correctly-- makes it inappropriate to make decisions by conducting user surveys and concluding that "not too many" people will be hurt. Any new system must preserve existing practices and flexibilities unless there are overwhelming reasons -- e.g., an absence of plausible alternatives -- to not do so. Historically, when one of these approaches has required that the email address local part be partitioned into components that are then interpreted differently or in some special sequence, the information has been organized according to some lexical convention, typically either based on one or more delimiters or on some sort of position and length notation (or a mixture of the two for different purposes). Either may be applied left-to-right or right-to-left and, again, we have a history of each, including the notorious "!a!b!c!d%e%f" local parts. 2.1 The Presentation Issue Before continuing, it is important to note that any internationalization system, regardless of how it is implemented at the protocol level, will require changes at the user interface level if it is to function in a way that end users consider reasonable. Unless addresses are presented to the user in familiar characters and formats, the user's perception will be, not of internationalization Klensin Expires April 21, 2004 [Page 4] Internet-Draft Internationalization of Email Addresses October 2003 and behavior that is user and culturally friendly, but of a relatively hostile environment. One think we have almost certainly learned from nearly forty years of experience with email is that users strongly prefer email addresses that closely resemble names to those involving, e.g., user ID numbers or complex coding that makes the local part appear as gibberish. Indeed, that principle --of wanting local parts to appear intelligible-- is arguably the entire reason for wanting to internationalize these addresses. If a user sees "xn--fltstrm-5wa1o" (a punycode form) or "F=E4ltstr=F6m" (the MIME quoted-printable form), rather than the correctly-written localized string, the result is almost certain to be unhappiness. 2.2 MUAs, MTAs, addresses, and learning from MIME and ESMTP The development and deployment of MIME [RFC2045] provided a number of important lessons for the community about how to design extensions and enhanced features without harm to the installed and conforming email system. Perhaps the most important of these was that it is easier, and often more expedient, to make changes that have impact only on mail user agents. If it is possible to make changes that way --generally changes that involve only message headers and the message body or body parts-- users who need particular features can switch to user agents that support them or press for those features in the user agents they have already selected. Even in the worst case in which support for features the user considers critical is not readily available, it is possible, with proper user agent design, to save the entire message to a file and then use stand-alone software to interpret the information and perform the desired functions. Providing these functions in the message headers and body permits them to be moved opaquely through the mail transport system, thus avoiding any requirement to modify originating or delivery MTAs or intermediate relays. In practice, the user may have little control over those systems. Since changes to them typically impacts large numbers of users, those who are responsible for them are often reluctant to make changes in response to the needs of a few users. It is hence reasonable to conclude that, if it is feasible to support address internationalization strictly at the MUA level, keeping the internationalized addresses opaque to the transport system, that is a more desirable approach than requiring MTA changes. The MUA-only approach has been carefully examined by others [I-D.hoffman-imaa]. This document argues that 1. addressing is a fundamental MTA-level function, 2. some of the complexities encountered when trying to encode addresses so as to avoid MTA interactions are symptoms that attempting to "hide" the MTA function so that it can be handled by MUAs is not an architecturally desirable approach, Klensin Expires April 21, 2004 [Page 5] Internet-Draft Internationalization of Email Addresses October 2003 3. the restrictions on email uses and syntax required to provide internationalization at MUA level are unnecessarily risky, and almost certainly damaging, to deployed email infrastructure, and 4. MTA-level solutions are feasible, architecturally more elegant, and perhaps not as difficult to deploy in relevant communities as the strongest advocates of the MUA-only approach appear to imagine. See Section 2.7 for additional discussion on this point. The decision as to what to do in message bodies and formats (e.g., [RFC2822] and MIME [RFC2045]) and what to handle in message transport (i.e., [E]SMTP) is critical because, as discussed below, the level at which something is handled is both determined by, and determines, how information is appropriately encoded. This decision ultimately depends on the application of two principles: 1. If body content is opaque, anything still visible to transport requires transport negotiation. 2. Anything an MTA -- origin, relay, MX, gateway, delivery -- needs to understand or process must be handled as part of mail transport. The discussion below might be titled "why the MTA must get involved". Whether mail addresses meet these criteria, and hence must be comprehensible in transport, depends on how much the sending MUA needs to know to construct, and the delivery MTA needs to know to deliver, a message. Traditionally, we have kept the former knowledge level at zero: if a sender produces "!a!b!c@example.com" in response to information that it is a valid address, it still does not know whether this is a "bang path" or a slightly-perverse name for a single mailbox. Is "xyz%def@example.com" a specification for routing to mailbox "xyz" on host "def" or a mailbox on the example.com host named "xyz%def". Are "foo+bar@..." or "foo-baz@..." subaddresses "bar" and "baz" for the mailbox "foo", or are they simple addresses? Is "jjoneschem@labs.example.com" a local mailbox on that host or an instruction to route mail to "jjones" in the chemistry department? Under the rules established in [RFC0821] and [RFC1123], as summarized and updated in [RFC2821], all of those decisions are up to "example.com", its MX alternatives, or hosts in that domain, and they may make very local decisions about them. For example, "xyz%def" might be a mailbox while "xyz%ghi" might be a route; "foo-baz" might be a subaddress while "foo-blog" might be a mailbox. The sender cannot, in the general case, know. Worse, while non-alphanumeric characters like "+", "-", and "%" have been used in these examples, delimiters for subaddresses, implicit routing, embedded commands, and so on are, again, up to the destination MTA and its interpretations. "X" might be as good a delimiter as "+". It might even be a better one in some Klensin Expires April 21, 2004 [Page 6] Internet-Draft Internationalization of Email Addresses October 2003 applications. And, since local-parts are defined as case-sensitive, "x" might be a normal address character in the same address in which "X" was an important delimiter. Of course, in a completely non-ASCII environment, it would make sense to substitute characters from the local script for "+", "-", "%", and so on. If one wants a string completely in local language (i.e., non-ASCII) characters, then there may be no desire to break that convention in order to use an ASCII delimiter (see Section 5) for additional discussion of this issue. It is not even necessary to use a delimiter to support some forms or subaddressing or local routing. Suppose an organization adopted the convention that externally-visible email address local parts were structured as, e.g., a three-letter department code, followed by a five-letter code representing the individual, optionally followed by a code representing a project. Many organizations use just such systems and there is no way (and no need) for an email sender to understand the system or whether it is actually used for mail routing internally. Consequently, the idea of a sender breaking an address up into its component parts and encoding those parts separately, or even just doing an encoding in sections that preserves the positions of the delimiters (as measured from the left) is an impossibility without major, incompatible, and retroactive changes in how mail addressing is defined. 2.3 An MUA-only-based Solution is Not Necessary 2.3.1 Obtaining an Internationalized Email Address One of the classic arguments for an approach based on MUA changes only (to international addresses or anything else) is that users will be able to install and use solutions on their own, even if the administrators of their systems are unenthused about the particular function or extension and delay, or decline, to install it. That argument was certainly true for MIME, especially in the presence of the capability to store messages as files and apply post-MUA tools. But it does not seem to apply for email addresses. In general, users cannot create email accounts, or aliases controlling delivery of messages from external systems. Those accounts and aliases must be created by system administrators responsible for the mail servers. If they are not sympathetic to internationalized mailbox names, such names will not exist on the receiving system. Having apparatus to send those names through the protocols will be essentially useless: a message that bounces because the relevant account or mailbox does not exist will bounce equally well whether the target address is in ASCII Klensin Expires April 21, 2004 [Page 7] Internet-Draft Internationalization of Email Addresses October 2003 or in some other script and whether or not the receiving MTA is required to explicitly agree to access internationalized addresses. Conversely, if the administrators of the mail system host are sympathetic to internationalization, it is reasonable to expect that appropriate software can and will be installed at the MTA level. An apparent important exception to the position taken in the above paragraph arises for subscription, often free, email services such as those operated under the "Hotmail", "Netscape", and "Juno" names. Some of these systems permit users to select their own names (local parts) through an automated process. If the user creates a mailbox using an encoded name, users with MUAs that support the encoding will be able to sent mail using a name in the user's preferred characters. But the user cannot know what capabilities the correspondents will have available, and hence must give our both the name in local characters and the encoded form. This is unlikely to be considered desirable. More important, if the user has presentation software that recognizes the coding conventions, then he or she will be able to see the original-language names in incoming messages. Consider this practice from a user point of view. First, the domain names for these systems will generally continue to be in ASCII, so the goal of an email address that is entirely or predominantly in the user's language will be unattainable. If the domain names are non-ASCII (i.e., are IDNA encodings of non-ASCII strings), it is reasonable to assume that an operator who would choose such a name would be willing to internationalize its MTA. Second, such systems are most often accessed through web-based interfaces where most email header information appears to the user browser as running text. Because an email local part can, today, take on the form of almost any ASCII string, it is not reasonable to expect that a browser, even one with some localized functions, will be able to accurately detect an imbedded, specially-coded, mailbox local part and correctly decode and render it. Heuristics based on detection of an at-sign ("@") will, of course, work for many, perhaps most, cases, but will also produce a certain number of false positives, perhaps destroying URLs or examples in the text. It is worth noting that any recognition and decoding of local parts using a local encoding relies on heuristics that may fail: all such strings are historically-valid email local parts, and, unlike the DNS situation, it is impossible to conduct a reliable survey to determine that no one is using any particular encoding form, especially if the encoding indicator appears embedded in the local part string, rather than as a prefix. By contrast, if the MTA sees a Unicode string, and Unicode strings are placed in message headers and message bodies as needed, the transition may be more difficult, but no long-term user confusion or exposure to ugly encodings will be necessary. Klensin Expires April 21, 2004 [Page 8] Internet-Draft Internationalization of Email Addresses October 2003 2.3.2 Relay environment As in many other areas with email, the difficulties with an MTA-based model for internationalization of addresses arise, not when the originating MTA communicates directly with the delivery MTA, but when relay MTAs are involved. If the both the sending and receiving systems support internationalized addresses, it is still possible that an intermediate relay will not do so, forcing mail to bounce that could be delivered if there were a direct connection between sender and receiver. But, as with the installation of email addresses on a system, relays do not get inserted in the mail path by accident. If internationalized addresses are important to the destination host, its administrators will chose lower-preference MX hosts or other relays that can support internationalized addresses. 2.3.3 Internationalizing the Sender If we assume a destination host that can accept, and properly handle, an internationalized address, and we assume that any MX-designated intermediaries for that host will be chosen to be similarly capable, one situation is left in which it would be advantageous to have an MUA-only-based solution. If a originating/ sending system is not capable of generating or sending an internationalized address, but the prospective receiving system is, it would be good to enable the originating user to generate and somehow send to the relevant address. This is a real issue, and deserves some serious consideration. But it seems better to find a good temporary, transitional, mechanism for it than to permanently burden the email system with an uncomfortable mechanism just to accommodate this case. One example of a transitional mechanism might be to use ESMTP tunneling over MIME [RFC2442] to route the address and message to a friendly gateway host that would unpack the message and transmit it using this specification. Other examples, less attractive at first glance but still plausible, would include defining and using small variations on the message encapsulation mechanisms that are integral to MIME [RFC2046], or the more complex encapsulation designed for HTML [RFC2557], to accomplish the same purpose. So, a user with an MUA that has the capability to handle an internationalized address, but who does not have access to an originating MTA with the capabilities defined here, may be given access to a reasonable transition strategy until the needed capabilities are available. Note that this does not require an open relay, since all of the user authentication capabilities of ESMTP [RFC2554] and SUBMIT [RFC2476] would be available. One can even imagine a service with a per-message charging system, which would Klensin Expires April 21, 2004 [Page 9] Internet-Draft Internationalization of Email Addresses October 2003 presumably encourage rapid upgrading. 2.4 A Solution Based on MUA Changes Alone is Unworkable The examples given above are, perhaps obviously, not the only ones. Other issues arise with intermediate MX relay and gateway hosts, commands embedded in local parts, and special formats used in gateways to other environments, among other cases. 2.4.1 MX Diversion If the domain part of an email address is associated with several MX records and the mail is delivered to one of them that is not the best preference host, the receiving host is not required to use SMTP. If, instead, it performs some gateway function, it may need to inspect or alter the local part to determine how to route and deliver the message. If the local part were encoded in some fashion that prevented that inspection process, and the MTA was not aware that it needed to apply special techniques, mail delivery might well fail. 2.4.2 Embedded commands In addition to the address forms with special syntax or semantics described elsewhere, systems have been developed that embed commands in address local parts. These might, of course, use entirely different syntax parts and formats than are typical in conventional addresses and, in an internationalized environment, might reasonably use character coding conventions that are neither ASCII nor Unicode-based. A number of specialized applications of email do require, or recommend, specific syntax in the local part. These are identified, not to indicate that they are the only cases (they are not) but to reinforce the point that one must be quite cautious in doing anything that makes global assumptions about local part syntax and significant characters. These applications include local part explicit routing with the "percent hack" [RFC1123], gateways to and from X.400 environments [RFC2156], and gateways to fax systems [RFC3192]. 2.5 Encoding the Whole Address String Much of the above demonstrates why selective encoding of parts of the local-part string is not practical, will exclude many important cases, or will subject users to permanent use of the crytpic encoded forms. Why, then, not encode the entire string and insist that the delivery MTA recognize the presence of an encoded form and do whatever decoding is needed before it does other processing? There are three major reasons to approach the problem this way: Klensin Expires April 21, 2004 [Page 10] Internet-Draft Internationalization of Email Addresses October 2003 1. Any change in address syntax interpretation is likely to be a major, incompatible, change, since we do not now impose any restrictions on how an MTA is organized or even on how, or whether, the MTA and MUA functions are actually divided up on a given host. Converting user agents to handle international forms of addresses in a way that does not produce user astonishment is likely to be a major undertaking, regardless of what is done to the protocols and at what level. 2. Imposing a requirement that MTAs "understand" local-parts so that they can be partially decoded as part of mail routing would seem to defeat the main goal of encoding internationalized strings into a compact ASCII-compatible form, i.e., to keep MTAs from needing to understand the extended naming system 3. We potentially have three different encodings of an internationalized string: the one used by the MTA, the one used by the MUA, and the one seen by the user through applications software or the operating system's display interface. Having all three of these identical or closely compatible is desirable from the standpoint of user understanding and debugging. Having them different can cause many "interesting" problems, e.g., having to return an error message that uses different coding, and hence might represent an entirely different string, than the string the user put into the process. Instead, it would seem sensible to move from a straightforward encoding of mail addresses in ASCII to a straightforward encoding in Unicode via UTF-8 [RFC2277], imposing only those restrictions on the characters in the local part that are implied by Unicode itself. 2.6 Looking back and looking forward Another principle is implied by some of the discussion above. Internationalization measures for the Internet will be with us for as long as there are multiple languages and scripts in the world, i.e., probably forever. If a satisfactory long-term solution can be found, and a reasonable transition strategy can be defined for it, it is much better to optimize for the long term. The alternative of making things more difficult or less functional -- for the transport, the MUA, and/or the user interface system-- forever in order to save some small effort in transition, or even to make the transition a few months faster, represents a very poor tradeoff. 2.7 Summary of Design Issues and Tradeoffs Each of the above subsections describes a strong case for continuing to treat addressing as an MTA function, opaque except at the end systems. The main alternative is to rely on the sending system being able to understand the addressing system of the target host, and any Klensin Expires April 21, 2004 [Page 11] Internet-Draft Internationalization of Email Addresses October 2003 relays accessed through MX relays, potentially needing to be able to remove IDN encoding ("punycode" or otherwise) in order to determine how to process or route the message. That alternative violates a long-standing and important design principle of Internet email, complicates a number of other cases, and does not offer sufficient transition advantages to be worth any of those difficulties. The protocol proposed here takes a giant step toward true internationalization of electronic mail, providing a good functional approximation to what we might have done several decades ago had Unicode and the necessary understanding been available. It does not go as far as one could imagine going in providing address forms that would be compatible with local styles and models all over the world. The issues in considering, and taking, those extra steps are discussed in Section 5. 3. A Mail Transport-level Protocol 3.1 General Principles and Objectives 1. Whatever encoding used should apply to the whole address and be directly compatible with software used at the user interface. 2. An SMTP relay must either recognize the format explicitly, agreeing to do so via an ESMTP option, or bounce the message so that the sender can make another plan. 3. If any charset other than UTF-8 or punycode is permitted and used for the local part, its interpretation at the "what does this mean" level is the responsibility of the receiving MTA. 3.2 Framework for the Internationalization Extension The following service extension is defined: 1. the name of the SMTP service extension is "Internationalized Addresses"; 2. the EHLO keyword value associated with this extension is "I18N"; 3. No parameter values are defined for this EHLO keyword value. In order to permit future (although unanticipated) extensions, the EHLO response MUST NOT contain any parameters. If a parameter appears, the SMTP client that is conformant to this version of this specification MUST treat the ESMTP response as if the I18N keyword did not appear. 4. no parameters are added to any SMTP command. [[Note in draft: A variation on this is probably excess complexity, rather than a good tradeoff, but should be considered in terms of whether it would be a good transitional aid. It would be possible to permit an optional parameter on the MAIL and RCPT commands that would specify an all-ASCII address to be used if an Klensin Expires April 21, 2004 [Page 12] Internet-Draft Internationalization of Email Addresses October 2003 MTA (SMTP Sender) encounters an SMTP Receiver that does not support this extension. Such a parameter might be called "AddressVariant" or even just "alias". It would be especially useful in error handling if used on the MAIL command. ]] 5. no additional SMTP verbs are defined by this extension. Most of the remainder of this memo specifies how support for the extension affects the behavior of an SMTP client and server and what message header changes it implies. 3.3 The Address Internationalization Service Extension In the absence of this extension, SMTP clients and servers are constrained to using only those addresses permitted by RFC 2821. The local parts of those addresses may be made up of any ASCII characters, although certain of them must be quoted as specified there. It is notable in an internationalization context that there is a long history on some systems of using over struck ASCII characters (a character, a backspace, and another character) within a quoted string to approximate non-ASCII characters. This form of internationalization should probably be phased out as this extension becomes widely deployed but backward-compatibility considerations require that it continue to be supported. An SMTP Server that announces this extension MUST be prepared to accept a UTF-8 string [RFC2279] in any position in which RFC 2821 specifies that a "mailbox" may appear. That string must be parsed only as specified in RFC 2821, i.e., by separating the mailbox into source route, local part and domain part, using only the characters colon (U+003A), comma (U+002C), and at-sign (U+0040) as specified there. Once isolated by this parsing process, the local part MUST be treated as opaque unless the SMTP Server is the final delivery MTA. Any domain names that are to be looked up in the DNS MUST be processed into punycode form as specified in IDNA [RFC3490] unless they are already in that form. Any domain names that are to be compared to local strings SHOULD be checked for validity and then MUST be compared as specified in IDNA. An SMTP Client that receives the I18N extension keyword MAY transmit a mailbox name as an internationalized string in UTF-8 form. It MAY transmit the domain part of that string in either punycode (derived from the IDNA process) or UTF-8 form but, if it sends the domain in UTF-8, it SHOULD first verify that the string is valid for a domain name according to IDNA rules. As required by RFC 2821, it MUST not attempt to parse, evaluate, or transform the local part in any way. If the I18N SMTP extension is not offered by the Server, the SMTP Client MUST not transmit an internationalized address. Instead, it MUST either return the message to the user as undeliverable or replace it, using some process outside the scope of this Klensin Expires April 21, 2004 [Page 13] Internet-Draft Internationalization of Email Addresses October 2003 specification such as a directory lookup, with a local-part that conforms to the syntax rules of RFC 2821. 3.4 Extended Mailbox Address Syntax RFC 2821, section 4.1.2, defines the syntax of a mailbox as Mailbox = Local-part "@" Domain Local-part = Dot-string / Quoted-string ; MAY be case-sensitive Dot-string = Atom *("." Atom) Atom = 1*atext Quoted-string = DQUOTE *qcontent DQUOTE Domain = (sub-domain 1*("." sub-domain)) / address-literal sub-domain = Let-dig [Ldh-str] (see that document for productions and definitions not provided here -- their details are not important to understanding this specification). The key changes made by this specification are, informally, to o Change the definition of "sub-domain" to permit either the definition above or a UTF-8 (or other, see Section 3.6.1) string representing a label that is conformant with IDNA [RFC3490]. That sub-domain string MUST NOT contain the characters "@" or ".". o Change the definition of "Atom" to permit either the definition above or a UTF-8 (or other, see Section 3.6.3) string. That string MUST NOT contain any of the ASCII characters (either graphics or controls) that are not permitted in "atext"; it is otherwise unrestricted. 3.5 Additional ESMTP Changes and Clarifications The mail transport process involves addresses ("mailboxes") and domain names in contexts in addition to the MAIL and RCPT commands and extended alternatives to them. In general, the rule is that, when RFC 2821 specifies a mailbox, UTF-8 is used for the entire string; when it specifies a domain name, the name should be in punycode form if its raw form is non-ASCII. The following subsections list and discuss all of the relevant cases. [[Note in draft: I hope]] Klensin Expires April 21, 2004 [Page 14] Internet-Draft Internationalization of Email Addresses October 2003 3.5.1 The Initial SMTP Exchange When an SMTP or ESMTP connection is opened, the server sends a "banner" response consisting of the 220 reply code and some information. The client then sends the EHLO command. Since the client cannot know whether the server supports internationalized addresses until after it receives the response from EHLO, any domain names that appear in this dialogue, or in responses to EHLO, must be in hostname form, i.e., internationalized ones must be in punycode form. 3.5.2 Trace Fields Internationalized domain names in Received fields should be transmitted in Unicode form. Addresses in "for" clauses need further examination and might be treated differently depending on whether 8BITMIME is a requirement for internationalized addresses. The reasoning in the introductory portion of Section 4 strongly suggests that these addresses be in Unicode form, rather than some specialized encoding, but a counterargument is that users do not look at Received fields and, if there is a standard encoding available that is completely interoperable and information-preserving, it should be used for both domain names and addresses (perhaps in a comment or other supplemental information). 3.6 Protocol Loose Ends These issues should be resolved, and this section eliminated, before the document is considered complete. 3.6.1 Punycode in Domain Names? It is not clear whether the flexibility of being able to pass domain names in punycode, as well as UTF-8, form is needed. If it is not, it should be eliminated as excess complexity. 3.6.2 Local Character Codes in Local Parts? There are some reasons for permitting local-parts to be written in locally-used character codes, i.e., in other than the UTF-8 encoding of UNICODE. It clearly increases flexibility, and the mailbox part can be defined as a simple octet string (as it essentially is in the sections above). We can reasonably expect that some systems, operating in local environments, will use local character codes no matter what we specify. On the other hand, having an application presented with an octet (or bit) string and not knowing what charset is involved would wreak havoc on any attempt to intelligently display local parts: if one cannot know the character coding being used, then Klensin Expires April 21, 2004 [Page 15] Internet-Draft Internationalization of Email Addresses October 2003 it is not possible to accurately decode the characters and display appropriate character glyphs. Use of local coding also implies an encoding for the local part different from that for the domain part -- any MTA in the path must be able to resolve the domain part into something that can be looked up in the DNS and resolved and that, in turn, requires a globally-known encoding. 3.6.3 Restrictions on Characters in Local Part? This specification is extremely liberal about what can be included in a UTF-8 string that represents a local-part. In return, it effectively prohibits the use of quoted strings, or quoted characters, in non-ASCII local parts. Quoted strings and characters in local parts have, in general, been nothing but trouble and there appears to be no reason to carry that trouble forward into an internationalized world (and the much greater complexity that quoting in that environment might imply). There may also be a strong case for applying restrictions, e.g., by use of a stringprep [RFC3454] profile that would eliminate particularly problematic characters while not forcing, e.g., even an approximation to case-mapping (remember that ASCII local-parts are inherently case sensitive, even though local systems are encouraged to not take advantage of that feature). 3.6.4 Requirement for 8BITMIME? This extension is carefully defined to be independent of "8BITMIME". However, given the length of time 8BITMIME has been around, the amount of deployment of it that exists, and the rather low likelihood that any MTA implementer in his or her right mind will go to the trouble of implementing this extension without also implementing 8BITMIME, it may be sensible to permit this extension only if 8BITMIME also appears. 3.6.5 Message Header and Body Issues with MTA Approach? By viewing i18n addresses as an MTA problem, this document may not address all of the interesting 2822/MIME and MUA implementation and presentation style issues. In particular, if both this extension and 8BITMIME are in use, is it sensible to drop the requirement for RFC 2047/ 2231 encoding of personal name fields? And, whether or not that requirement is dropped, is the MUA description of Section 4 adequate? Klensin Expires April 21, 2004 [Page 16] Internet-Draft Internationalization of Email Addresses October 2003 3.6.6 Variant Addresses (Aliases) in a Command Verb A determination should be made as to whether a parameter to the MAIL and RCPT commands that would specify an alternate, ASCII-only, address is desirable and the text in Section 3.2, item 4, corrected accordingly. 3.6.7 The Received field 'for' clause Decide what to do about the value of the "for" clause in Received fields. See Section 3.5.2. 4. Impact on the MUA and on Message Headers In addition to the Received headers, mentioned above, there are many other places in MUAs or in user presentation in which email addresses or domain names appear. Each one, whether the conventional From/To/ Cc header fields, or Message-IDs, or In-Reply-To fields that may contain addresses or domain names, or in message bodies or elsewhere, must be examined from an internationalization perspective. The user will expect to mailbox and domain names in local characters, and to see them consistently: a situation in which an address is coded one way in a "From" field, another way in a signature line in the body of a the message, and, apparently arbitrarily, in one or the other of those forms in Return-Path, Received, or reference fields, will create confusion and frustration. Variations on that problem will exist with any internationalization method, whether transport or MUA-only in structure. Perhaps, if we have to live with it for a short time as a transition activity, that is worthwhile. But the only practical way to avoid it, in both the medium and the longer term, is to have the encodings used in transport be as nearly as possible the same as the encodings used in message headers and message bodies. ...More discussion on specific headers to be supplied in the next version... 5. Internationalization and Full Localization Whenever one considers a new protocol, or revision of an existing one, for internationalization or other aspects of support for an improved user interface, important tradeoffs arise. These tradeoffs can be described in several ways, e.g., o Simplicity versus localization capability o User convenience, especially within a particular area or culture versus global interoperability o and so on Maximum global interoperability is obtained by confining a protocol Klensin Expires April 21, 2004 [Page 17] Internet-Draft Internationalization of Email Addresses October 2003 to an very limited number of characters, ideally ones that are easily distinguishable by people. The historical choice in this regard has been the 26 upper-case ASCII letters, plus digits, plus a very small number of special characters. It is probably no coincidence that these characters (with different, bit-minimizing, encodings) are the normal ones in early telegraphy and subsequent Telex character sets. But, as soon as users start looking at these characters, so do the complaints: text in all-upper-case is ugly, people should be able to write their names as they normally do and not in some transliterated or variant form, people should be able to communicate in their own languages using their own character sets, and so on. Ultimately, not only are the characters used in writing at issue; so are the structures for constructing, e.g., command sequences, with different preferences typically reflecting the grammatical structures of different languages. With sufficient ingenuity, all of these requirements can be accomodated, but typically at the cost of convenient use by people outside the locality or cultural group or to global interoperability. Email addresses illustrate this problem at its most difficult. They are seen and used by end users and there has been little success in hiding the forms that are actually used in the protocols. Worldwide, most communication is almost certainly among people who share languages and cultural assumptions, not in situations in which global interoperability is important (and where it is important that global interoperability be convenient and very reliable). On the other hand, situations and communications that require global interoperability are still common and are commercially and intellectually important. So the question is how far should one go. It is clearly important and sensible to accomodate local character sets, and to do so in a way that creates maximum convenience and attractive user interfaces in the long term. But, as pointed out in passing in Section 3.3, RFC2821 still requires the ASCII at-sign character to divide the local part from the domain name. If even lexical support for the long-deprecated source routes is to be provided, comma and colon must also be supported. This implies that a mailbox name that is completely in some character script other than ASCII is impossible without further changes to the email protocols. In addition, the ordering implies by the "local-part@domain" construction, usually read in English as "local part at domain", seems quite strange and foreign in some other languages and cultures. It is interesting that X.400 avoided this delimiter and ordering problem entirely by using Distinguished Names in which the various elements of an address were explicitly identified. But, when Distinguished Names appear at the presentation layer or above, they appear with the various fields identified by tags which are, themselves, keywords that use a very Klensin Expires April 21, 2004 [Page 18] Internet-Draft Internationalization of Email Addresses October 2003 restricted set of ASCII (actually ISO 646 or IA5) characters. In principle at least, the protocol extensions proposed here could be further extended to specify a separator character to distinguish local part from domain name and the order in which those names occurred. For example, the MAIL and RCPT commands could be extended with parameters like SEPARATOR="UTF-8-character" ORDER-RL to identify a form consisting of the domain name followed by the local part, separated by the designated character But, while this would not impose particularly heavy burdens on SMTP processors, it would be a potential nightmare for users, who would have no way to accurately identify the components of an email address, at least without significant out-of-band information. In addition, going that far would almost certainly touch off the debate, again, as to whether domain names should be presented in little-endian or big-endian order -- an issue that is, again, culturally sensitive as to which one feels most natural. It is not clear how far one should go, and the community should consider the issue very carefully. 6. Advice to Designers and Operators of Mail-receiving Systems As discussed above, in the historical Internet email context, the interpretation and permitted syntax for an email local-part is entirely the responsibility of the receiving system. Systems can get themselves into trouble and, more particularly, can seriously restrict the number and type of users who can send mail to their users, by poor choices of format and syntax. For example, general advice to system designers has long included "treat addresses in a case-independent fashion" and "do not use addresses that require quoting" in order to increase the odds that remote users will be able to properly compose and transmit intended addresses. In a way, that advice is an extreme generalization of the "receiver" side of the robustness principle: being generous in what one accepts implies accepting as many plausible variations of an address local-part string as possible and designing the strict forms of those strings to facilitate differentiation when it is appropriate. As one moves toward internationalization of local parts, an expanded version of these principles is useful and may be even more appropriate, even though it is neither necessary nor desirable to turn those principles into protocol requirements. For example, a receiving host should normally consider any string that would match under nameprep rules --or perhaps any string that would match under an expanded stringprep protocol-- as matching for local-part purposes. An even more "liberal" receiving host might use some sort Klensin Expires April 21, 2004 [Page 19] Internet-Draft Internationalization of Email Addresses October 2003 of variant tables for its script(s) of interest to further expand the matching rules. But, whatever extended matching rules the local host adopts, those rules are a property of that host. Senders should continue to be conservative about what they send, and relays should continue to avoid presumptions about their understanding of the content of local-parts. Receiving systems that have reason to adopt more restricted syntax rules, or interpretations of matching, should continue to be able to do so. 7. Security considerations Any expansion of permitted characters and encoding forms in email addresses raises the risk, however slight, of misdirected or undeliverable mail. The problem is worsened if address information is carried in local character sets and must be converted to some standard form. Any conversion of character sets may also be problematic for digitally-signed information. Modulo those concerns, the ideas proposed here do not introduce new security issues. 8. Acknowledgements The author acknowledges the contributions and comments of Dave Crocker in a personal conversation, and the efforts of a private discussion group, led by Paul Hoffman and Adam Costello, to develop an MUA-only solution to this problem. The author had hoped that effort would succeed, since the idea of requiring transport changes to support internationalization (or any other new function) is unattractive and should be avoided when possible. Difficulties that group has encountered in properly defining a number of boundary conditions, including appropriate delimiters for permitting internal parsing of the local part and problems with right-to-left characters and substrings, have led to the conclusion that it is time to get a specific, transport-based, approach on the table. While their ideas have inspired several of the properties of this proposal they are, of course, not responsible for the result and will probably disagree with it. Comments from Adam Costello on the first public draft were particularly helpful, and James Seng identified some internationalization issues that had not been addressed in the previous version. 9. An Appeal The author received a number of favorable comments on the general principles and design discussed in early drafts of this specification. He is not, however, able to continue its development as a one-person, or even one-person with occasional comments from Klensin Expires April 21, 2004 [Page 20] Internet-Draft Internationalization of Email Addresses October 2003 others, basis. In particular, he has almost no resources for developing MTA, MUA, or presentation code to test and demonstrate the concepts and details outlined above; without such resources, this approach will, inevitably, fail sooner or later. So those who consider the idea attractive should think about, and develop, ways to join with the author in design team and development efforts. Normative References [RFC0821] Postel, J., "Simple Mail Transfer Protocol", STD 10, RFC 821, August 1982. [RFC1123] Braden, R., "Requirements for Internet Hosts - Application and Support", STD 3, RFC 1123, October 1989. [RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC 2279, January 1998. [RFC2821] Klensin, J., "Simple Mail Transfer Protocol", RFC 2821, April 2001. [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003. [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003. [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)", RFC 3492, March 2003. Informative References [I-D.hoffman-imaa] Hoffman, P. and A. Costello, "Internationalizing Mail Addresses in Applications (IMAA)", draft-hoffman-imaa-03 (work in progress), October 2003. [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, November 1996. [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", RFC 2046, November 1996. Klensin Expires April 21, 2004 [Page 21] Internet-Draft Internationalization of Email Addresses October 2003 [RFC2056] Denenberg, R., Kunze, J. and D. Lynch, "Uniform Resource Locators for Z39.50", RFC 2056, November 1996. [RFC2156] Kille, S., "MIXER (Mime Internet X.400 Enhanced Relay): Mapping between X.400 and RFC 822/MIME", RFC 2156, January 1998. [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, January 1998. [RFC2442] Freed, N., Newman, D. and Hoy, M., "The Batch SMTP Media Type", RFC 2442, November 1998. [RFC2476] Gellens, R. and J. Klensin, "Message Submission", RFC 2476, December 1998. [RFC2554] Myers, J., "SMTP Service Extension for Authentication", RFC 2554, March 1999. [RFC2556] Bradner, S., "OSI connectionless transport services on top of UDP Applicability Statement for Historic Status", RFC 2556, March 1999. [RFC2557] Palme, F., Hopmann, A., Shelness, N. and E. Stefferud, "MIME Encapsulation of Aggregate Documents, such as HTML (MHTML)", RFC 2557, March 1999. [RFC2822] Resnick, P., "Internet Message Format", RFC 2822, April 2001. [RFC3192] Allocchio, C., "Minimal FAX address format in Internet Mail", RFC 3192, October 2001. [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of Internationalized Strings ("stringprep")", RFC 3454, December 2002. Author's Address John C Klensin 1770 Massachusetts Ave, #322 Cambridge, MA 02140 USA Phone: +1 617 491 5735 EMail: john-ietf@jck.com Klensin Expires April 21, 2004 [Page 22] Internet-Draft Internationalization of Email Addresses October 2003 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assignees. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION Klensin Expires April 21, 2004 [Page 23] Internet-Draft Internationalization of Email Addresses October 2003 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Klensin Expires April 21, 2004 [Page 24]