Internet Engineering Task Force N. Teint Internet-Draft March 23, 2010 Intended status: Experimental Expires: September 24, 2010 Extending Internationalised Domain Names in Applications to Other Protocols (X-IDNA) draft-teint-xidna-base-00 Abstract Prior to Internationalised Domain Names in Applications (IDNA), there has been no standard method for domains, names, addresses and similar identifiers to use characters outside the ASCII repertoire. This still applies to many identifiers that are no domain names, such as email addresses (local-part), newsgroup names, etc. This document extends the mechanism defined in IDNA to other protocols and their identifiers. As with IDNA, these identifiers may be drawn from a large repertoire (Unicode) and are mapped to backward-compatible identifiers using only ASCII characters. For valid domain names, X-IDNA produces the same encoding as IDNA, even when these domain names are embedded in other addresses. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on September 24, 2010. Teint Expires September 24, 2010 [Page 1] Internet-Draft Extending IDNA to Other Protocols March 2010 Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the BSD License. Teint Expires September 24, 2010 [Page 2] Internet-Draft Extending IDNA to Other Protocols March 2010 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2. Rationale . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 5 1.4. IDNA 2008 . . . . . . . . . . . . . . . . . . . . . . . . 5 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1. Addresses, Normalised Addresses and Labels . . . . . . . . 6 2.2. ACE Prefix . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3. Address slots . . . . . . . . . . . . . . . . . . . . . . 7 2.4. Characters . . . . . . . . . . . . . . . . . . . . . . . . 8 3. Requirements and Applicability . . . . . . . . . . . . . . . . 9 3.1. Requirements . . . . . . . . . . . . . . . . . . . . . . . 9 3.2. Applicability and X-IDNA Profiles . . . . . . . . . . . . 9 4. Address Conversion . . . . . . . . . . . . . . . . . . . . . . 11 4.1. Address Input . . . . . . . . . . . . . . . . . . . . . . 11 4.2. Conversion To Unicode . . . . . . . . . . . . . . . . . . 11 4.3. Address Normalisation . . . . . . . . . . . . . . . . . . 13 4.4. Unicode Normalisation . . . . . . . . . . . . . . . . . . 14 4.5. Extraction of Labels and Delimiters . . . . . . . . . . . 14 4.6. A-Label Input . . . . . . . . . . . . . . . . . . . . . . 16 4.7. Validation and Character List Testing . . . . . . . . . . 16 4.8. Punycode Conversion . . . . . . . . . . . . . . . . . . . 18 4.9. Re-Assembly . . . . . . . . . . . . . . . . . . . . . . . 18 5. Address Validation . . . . . . . . . . . . . . . . . . . . . . 19 5.1. Label Validation . . . . . . . . . . . . . . . . . . . . . 19 5.1.1. Hyphen Restrictions . . . . . . . . . . . . . . . . . 19 5.1.2. Leading Combining Marks . . . . . . . . . . . . . . . 19 5.1.3. Contextual Rules . . . . . . . . . . . . . . . . . . . 19 5.1.4. Labels Containing Characters Written Right to Left . . 20 5.1.5. Successful Punycode Encoding . . . . . . . . . . . . . 20 5.2. Other Syntax Restrictions . . . . . . . . . . . . . . . . 20 5.3. Local Restrictions . . . . . . . . . . . . . . . . . . . . 20 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 8. Security Considerations . . . . . . . . . . . . . . . . . . . 23 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 24 9.1. Normative References . . . . . . . . . . . . . . . . . . . 24 9.2. Informative References . . . . . . . . . . . . . . . . . . 25 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 26 Teint Expires September 24, 2010 [Page 3] Internet-Draft Extending IDNA to Other Protocols March 2010 1. Introduction 1.1. Overview X-IDNA works by extracting anything from the address that fits the syntax of a valid domain name "label", i.e. strings that roughly match the "LDH" syntax for "A-labels" and "U-labels". These extracted, putative labels are then put through a conversion the normative part of which is identical to the normative part of IDNA2008. The characters that do not form labels, the separators, are solely drawn from the ASCII repertoire (potentially mapped from Unicode lookalikes) and thus need no internationalisation. Special processing, called address normalisation, ensures that addresses considered equivalent in a protocol that allows arbitrary "quoting" or "escaping" produce the same "labels". X-IDNA Profiles state to which (part of) address specifications X-IDNA is applied and what steps have to be taken for address normalisation. 1.2. Rationale Unlike other methods for address internationalisation (such as allowing UTF-8), using X-IDNA, as IDNA, allows the graceful introduction of internationalised addresses not only by avoiding upgrades to existing infrastructure (such as DNS servers and mail transport agents), but also by allowing some limited use of internationalised addresses in applications by using the ASCII- encoded representation of the labels containing non-ASCII characters. While such names are user-unfriendly to read and type, and hence not optimal for user input, they can be used as a last resort to allow rudimentary usage of internationalised addresses. For example, they might be the best choice for display if it were known that relevant fonts were not available on the user's computer. For protocols that have been extended to allow Unicode addresses to be used directly, X-IDNA also provides a way to "downgrade" the addresses that does not require lookups in a database or transmission of alternative ASCII addresses. When strings covered by one profile of X-IDNA end up in places that are covered by a different profile, it does not matter whether the labels are converted to a A-Labels first and then put into the other string or vice-versa, provided that due care has been taken in defining these profiles. Teint Expires September 24, 2010 [Page 4] Internet-Draft Extending IDNA to Other Protocols March 2010 The same is true for IDNA and X-IDNA as long as the domain name is valid, i.e. as long as it consists entirely of LDH-Labels. For example, if a valid domain name is put into the local-part of an email address, the conversion of the domain's U-Labels to A-Labels will be identical in IDNA and X-IDNA. This property is best demonstrated by the examples in [I-D.teint-xidna-zonefile]. 1.3. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 1.4. IDNA 2008 As X-IDNA essentially piggy-backs on IDNA 2008, the reader ought to be familiar with the following specifications: [I-D.ietf-idnabis-defs], [I-D.ietf-idnabis-protocol], [I-D.ietf-idnabis-mappings], [I-D.ietf-idnabis-tables], [I-D.ietf-idnabis-bidi] and [I-D.ietf-idnabis-rationale] Teint Expires September 24, 2010 [Page 5] Internet-Draft Extending IDNA to Other Protocols March 2010 2. Definitions 2.1. Addresses, Normalised Addresses and Labels An "address" is defined in this document to be a protocol element used in addressing network ressources, such as domain names, names, addresses and similar identifiers. An address is in a protocol- dependent format. An X-IDNA profile is required to specify which protocol elements constitute an address and how to map addresses to normalised addresses. For the purposes of this specification, the definition of an address need not contain the complete "address field" defined in the other protocol. For example, a protocol that has strings consisting of locally-assigned names, domain names and numeric addresses may specify that the locally-assigned names are covered by X-IDNA whereas the domain names are covered by IDNA and the numeric addresses are not internationalised. A different protocol that allows address specifications containing both strings used for ressource identification and free-form text intended for human consumption may specify that X-IDNA applies to the the ressource identification part of the address specification whereas the human-readable text is covered by a different internationalisation protocol. A "normalised address" is defined in this document to be an address that is in a format suitable as input to the generic X-IDNA protocol defined in this document. A normalised address consists of one or more labels, each separated by one or more separators. It is produced in the normalisation steps (see Section 4.3 to Section 4.4) of the X-IDNA protocol defined in this document. For the purposes of this document, a normalised address need not be suitable for equivalence comparision. A "label" is defined in this document to be a part of a normalised address that does not contain a separator. It is produced in the label extraction step (see Section 4.5) of the X-IDNA protocol defined in this document. This definition of "label" is a generalisation of that found in [I-D.ietf-idnabis-defs]: the term is applied to strings that are not part of a domain name. The definition of "A-Label", "fake A-Label", "U-Label", "LDH-Label", "R-LDH-Label" and "NR-LDH-Label" is taken from [I-D.ietf-idnabis-defs] but applied to strings that are not part of a domain name. A "separator" is a (base) character that appears between labels and is passed through the X-IDNA process as-is. Separators are also extracted in the label extraction step (see Section 4.5) of the Teint Expires September 24, 2010 [Page 6] Internet-Draft Extending IDNA to Other Protocols March 2010 X-IDNA protocol defined in this document. An "ASCII address" is an address that consists entirely of base characters. It can contain A-Labels, fake A-Labels, R-LDH-Labels and NR-LDH-Labels and separators. A "Unicode address" may consist of both base and extended characters. It can contain A-Labels, fake A-Labels, R-LDH-Labels, NR-LDH-Labels, U-Labels and separators. An "internationalised address" is either an ASCII address or a Unicode address. 2.2. ACE Prefix The "ACE prefix" is defined in this document to be a string of ASCII characters "xn--" that appears at the beginning of every A-Label. "ACE" stands for "ASCII-Compatible Encoding". 2.3. Address slots An "address slot" is defined in this document to be a protocol element or a function argument or a return value (and so on) explicitly designated for carrying an "address". Examples of address name slots include the email address following in the parameter to the SMTP MAIL or RCPT commands or the "From:" field of an email message header; the newsgroup name appearing in Netnews; and the domain name in the QNAME field of a DNS query. A string that has the syntax of an address but that appears in general text is not in a address slot. For example, an email address appearing in the plain text body of an email message is not occupying an address name slot. An "X-IDNA-aware address slot" is defined to be an address slot explicitly designated for carrying an internationalised address as defined in profiles based on this document. The designation may be static (for example, in the specification of the protocol or interface) or dynamic (for example, as a result of negotiation in an interactive session). An "X-IDNA-unaware address slot" is defined for this set of documents to be any address slot that is not an X-IDNA-aware address slot. Obviously, this includes any address slot whose specification predates X-IDNA or a profile defined for these types of addresses. For domain names, it includes any domain name slot (as defined in [I-D.ietf-idnabis-defs], Section 2.3.2.6) that predates IDNA. These definitions are generalisations of the "domain name slots" defined in [I-D.ietf-idnabis-defs], Section 2.3.2.6. Teint Expires September 24, 2010 [Page 7] Internet-Draft Extending IDNA to Other Protocols March 2010 2.4. Characters A "base character" is a Unicode character in the range U+0000..U+ 007F. (These Unicode characters correspond to the "ASCII" character set and are also known as "ASCII characters".) An "extended character" is a Unicode character that is not a base character, i.e. a character in the range U+0080 up to the maximum Unicode codepoint (U+10FFFF as of [Unicode]). Teint Expires September 24, 2010 [Page 8] Internet-Draft Extending IDNA to Other Protocols March 2010 3. Requirements and Applicability 3.1. Requirements X-IDNA makes the following requirements: 1. Whenever an address is put into an X-IDNA-unaware address slot (see Section 2.3), it MUST contain only base characters and follow the syntax mandated by the specification defining the address type. If the address type is case-sensitive, any A-Label within the address MUST be converted to lowercase. 2. A-Labels and U-Labels Section 2.1 within an address MUST be compared using equivalent forms: either both A-Label forms or both U-Label forms. Because A-Labels and U-Labels can be transformed into each other without loss of information, these comparisons are equivalent. A pair of A-Labels MUST be compared as case-insensitive ASCII. U-Labels MUST be compared as-is, without case-folding or other intermediate steps. Note that it is not necessary to validate labels in order to compare them and that successful comparison does not imply validity. In many cases, not limited to comparison, validation may be important for other reasons and SHOULD be performed. Separators and labels that are neither A-Labels nor U-Labels (e.g., fake A-Labels and NR-LDH-Labels) MUST not match any A-Label or U-Label; other than that, X-IDNA does not make requirements for these (e.g., NR-LDH-Labels may be compared with or without case-folding, depending on the address slot type.) 3. When addresses are not validated, they MUST conform to the requirements of Section 4. When addresses are validated, they MUST conform to the requirements of Section 5. 3.2. Applicability and X-IDNA Profiles The application of X-IDNA to any type of address depends on an additional specification that provides the details. These other specifications are referred to as X-IDNA Profiles. Each definition of an X-IDNA Profile MUST include all of the following: 1. The protocol or protocols to which X-IDNA is applied and the syntax elements of these protocols that represent an X-IDNA- unaware address slot. Teint Expires September 24, 2010 [Page 9] Internet-Draft Extending IDNA to Other Protocols March 2010 2. The steps necessary to map any syntactically valid address to a normalised address (as described in Section 4.3) 3. The parts of the protocol that require checking the address for validity (as described in Section 5). A specification MAY also include: o Syntax extensions for X-IDNA-aware address slots. o Methods for interoperation with other ways of address internationalisation. Teint Expires September 24, 2010 [Page 10] Internet-Draft Extending IDNA to Other Protocols March 2010 4. Address Conversion Before an Internationalised Address is put into a X-IDNA-unaware slot, it MUST be converted to an ASCII Address using the following procedure. Although some validity checks are necessary to avoid serious problems with the protocol, the tests are permissive and rely on the assumption that names that can be successfully used are valid. That assumption is, however, a weak one because the presence of wild cards in the receiving system might cause a string that has not been explicitly defined and validated to be successfully used as an address. This procedure in a generalisation of the Domain Name Lookup Protocol defined in [I-D.ietf-idnabis-protocol], Section 5. 4.1. Address Input The user supplies a string in the local character set, for example by typing it or clicking on, or copying and pasting, a resource identifier, e.g., a URI ([RFC3986]) or IRI ([RFC3987]) from which the address is extracted. Alternately, some process not directly involving the user may read the string from a file or obtain it in some other way. Processing in this step and that specified in Section 4.2, Section 4.3 and Section 4.4 are local matters, to be accomplished prior to actual invocation of X-IDNA. 4.2. Conversion To Unicode The string is converted from the local character set into Unicode, if it is not already in Unicode. Depending on local needs, this conversion may involve mapping some characters into other characters as well as coding conversions. This section defines a general algorithm that applications ought to implement in order to produce Unicode code points that will be valid under the IDNA protocol. An application might implement the full mapping as described below, or can choose a different mapping. In fact, an application might want to implement a full mapping that is substantially compatible with the original IDNA protocol instead of the algorithm given here. The general algorithm that an application (or the input method provided by an operating system) ought to use is relatively Teint Expires September 24, 2010 [Page 11] Internet-Draft Extending IDNA to Other Protocols March 2010 straightforward: 1. Upper case characters are mapped to their lower case equivalents by using the algorithm for mapping case in Unicode characters. 2. Full-width and half-width characters (those defined with Decomposition Types and ) are mapped to their decomposition mappings as shown in the Unicode character database. 3. The application can also choose to map some or all of the following separator characters to ASCII characters which are roughly equivalent: * mapped to U+0020 (SPACE): + all characters having a canonical or compatibility decomposition to U+0020 * mapped to U+0022 (QUOTATION MARK): + U+201C (LEFT DOUBLE QUOTATION MARK) + U+201D (RIGHT DOUBLE QUOTATION MARK) + U+201E (DOUBLE LOW-9 QUOTATION MARK) + U+201F (DOUBLE HIGH-REVERSED-9 QUOTATION MARK) + U+2033 (DOUBLE PRIME) + U+301D (REVERSED DOUBLE PRIME QUOTATION MARK) + U+301E (DOUBLE PRIME QUOTATION MARK) + U+301F (LOW DOUBLE PRIME QUOTATION MARK) * mapped to U+0027 (APOSTROPHE): + U+2018 (LEFT SINGLE QUOTATION MARK) + U+2019 (RIGHT SINGLE QUOTATION MARK) + U+201B (SINGLE HIGH-REVERSED-9 QUOTATION MARK) + U+2032 (PRIME) Teint Expires September 24, 2010 [Page 12] Internet-Draft Extending IDNA to Other Protocols March 2010 * mapped to U+002C (COMMA): + U+201A (SINGLE LOW-9 QUOTATION MARK) + U+3001 (IDEOGRAPHIC COMMA) * mapped to U+002E (FULL STOP): + U+3002 (IDEOGRAPHIC FULL STOP) * mapped to U+003C (LESS-THAN SIGN): + U+2039 (SINGLE LEFT-POINTING ANGLE QUOTATION MARK) * mapped to U+003E (GREATER-THAN SIGN): + U+203A (SINGLE RIGHT-POINTING ANGLE QUOTATION MARK) * mapped to U+007C (VERTICAL LINE): + U+00A6 (BROKEN BAR) * mapped to U+007E (TILDE): + U+301C (WAVE DASH) Unicode Normalisation ought to be deferred until after the Address Normalisation defined in the following section. 4.3. Address Normalisation The string is then normalised as specified in the X-IDNA Profile applicable for the type of address slot into which the string is intented to be put. This step maps a syntactically valid address to a normalised address, from which labels can be extracted. It is defined by the X-IDNA Profile. The rules for normalisations defined by X-IDNA Profiles are: o If the protocol to which the X-IDNA profile applies considers two addresses as equivalent, the normalisation procedure MUST produce strings that produces exactly the same labels in the following step, the label extraction. (However, the X-IDNA Profile may specify a normalisation that does not produce output directly suitable for equivalence comparision.) o The runs of label characters produced by the normalisation SHOULD be as long as possible. Especially, the normalisation SHOULD Teint Expires September 24, 2010 [Page 13] Internet-Draft Extending IDNA to Other Protocols March 2010 treat extended characters as "unreserved", "text", etc. It SHOULD NOT mechanically insert "quote" or "escape" characters between characters that can be part of a label. These two requirements are usually met if the normalisation removes all optional separators ("quotes" or "escape" characters), especially those appearing between characters that can be part of a label. For example, if a protocols allows all characters to be "quoted" by preceeding it with an optional U+002F (SOLIDUS), this character needs to be removed if it is followed by any letter, digit, hyphen or extended character; if a protocol allows non-significant DOUBLE QUOTEs (U+0034) anywhere in the string, these need to be removed if they appear within a label; and so on. o Extended characters SHOULD be either mapped to ASCII characters or treated as belonging to the character class that needs no "quoting" or "escaping" otherwise defined by the protocol. This requires special care for extended characters in implementations. Leagacy code may mechanically escape or quote them because they have traditionally not been in the category defined as "unreserved", "text", etc. For example, an extended character may be assigned to the character class "atext" defined in &RFC5322; for the purposes of the normalisation. An application MAY also choose to map invalid addresses to syntactically valid addresses, for example by "escaping" or "quoting" problematic base characters; or it MAY reject invalid addresses in this step. 4.4. Unicode Normalisation Depending on local needs, the string can then be mapped using Unicode Normalization Form C (NFC). 4.5. Extraction of Labels and Delimiters The normalised address is then split into putative labels and separators. The labels extracted in this step have not been checked for conformance, are therefore referred to as "putative". The putative labels will start with a letter, digit or extended character, which may be followed by a string that consists of zero or more letters, digits, extended characters or the HYPHEN-MINUS, and Teint Expires September 24, 2010 [Page 14] Internet-Draft Extending IDNA to Other Protocols March 2010 ends in a letter, digit or extended character. The delimiters will be base characters that are not digits or letters. The label extraction differs significantly from the procedure specified in [I-D.ietf-idnabis-mappings] by adding a number of additional separators in addition to U+002E (FULL STOP). The purpose of this procedure is to ensure that all labels are IDNA- valid (as defined in [I-D.ietf-idnabis-defs]) regardless of the range of non-LDH characters allowed in the address. o The following Unicode codepoints are always part of a label: * U+0030 to U+0039 (DIGIT ZERO to DIGIT NINE) * U+0041 to U+005A (LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) * U+0061 to U+007A (LATIN SMALL LETTER A to LATIN SMALL LETTER Z) * all above U+0080, inclusive That is, all ASCII digits and letters and all non-ASCII codepoints. o The following Unicode codepoints are always separators: * U+0000 to U+002C ( NULL to COMMA) * U+002E to U+002F (FULL STOP to SOLIDUS) * U+002A to U+0040 (COLON to COMMERCIAL AT) * U+005B to U+0060 (LEFT SQUARE BRACKET to GRAVE ACCENT) * U+007B to U+007F (LEFT CURLY BRACKET to DELETE) That is, all ASCII codepoints except for the hyphen, digits and letters. o The following Unicode codepoint is part of a label or a separator, depending on context: * U+002D (HYPHEN-MINUS) A sequence of one or more HYPHEN-MINUS characters is composed of separators exactly if the run Teint Expires September 24, 2010 [Page 15] Internet-Draft Extending IDNA to Other Protocols March 2010 * appears at the beginning or end of the address, OR * is preceeded or followed by a separator. That is, hyphens that would start or end a label (and thus violate the syntax for LDH-Labels), are separators; other hyphens are part of a label. 4.6. A-Label Input If a putative label extracted in the previous step appears to be an A-Label (i.e., it starts in "xn--", interpreted case-insensitively and does not contain extended characters), the application MAY attempt to convert it to a U-Label, first ensuring that the A-Label is entirely in lower case (converting it to lower case if necessary), and apply the tests of Section 4.7 and the conversion of Section 4.8 to that form. If the label is converted to Unicode (i.e., to U-Label form) using the Punycode decoding algorithm, then the processing specified in those two sections MUST be performed. If any of these steps fails or rejects the label (i.e., it's a fake A-Label), the putative label MUST be used as-is. o the original MUST be used if the X-IDNA profile allows R-LDH- Labels, and o it MAY be rejected if the X-IDNA profile does not allow R-LDH- Labels. If a label consists entirely of base characters, the following two steps (Section 4.7 and Section 4.8) are skipped for this label. 4.7. Validation and Character List Testing The putative labels extracted in the previous step are checked to verify that all characters that appear in it are valid as input to X-IDNA processing. As discussed above, this check is liberal in order to allow for compatibility with future extensions. Putative labels with any of the following characteristics MUST be rejected in this step: o Labels that are not in NFC ([UAX15]) o Labels containing "--" (two consecutive hyphens) in the third and fourth character positions Teint Expires September 24, 2010 [Page 16] Internet-Draft Extending IDNA to Other Protocols March 2010 o Labels whose first character is a combining mark (see Section 2.11 of [Unicode]) o Labels containing prohibited code points, i.e., those that are assigned to the "DISALLOWED" category of [I-D.ietf-idnabis-tables] o Labels containing code points that are identified in [I-D.ietf-idnabis-tables] as "CONTEXTJ", i.e., requiring exceptional contextual rule processing on lookup, but that do not conform to those rules. Note that this implies that a rule needs to be defined, not null: a character that requires a contextual rule but for which the rule is null is treated in this step as having failed to conform to the rule o Labels containing code points that are identified in [I-D.ietf-idnabis-tables] as "CONTEXTO", but for which no such rule appears in the table of rules. Applications resolving DNS names or carrying out equivalent operations are not required to test contextual rules for "CONTEXTO" characters, only to verify that a rule is defined (although they MAY make such tests to provide better protection or give better information to the user) o Labels containing code points that are unassigned in the version of Unicode being used by the application, i.e., in the "UNASSIGNED" category of [I-D.ietf-idnabis-tables] This requirement means that the application needs to use a list of unassigned characters that is matched to the version of Unicode that is being used for the other requirements in this section. It is not required that the application know which version of Unicode is being used; that information might be part of the operating environment in which the application is running. In addition, the application SHOULD apply the following test: o Verification that the string is compliant with the requirements for right to left characters, specified in [I-D.ietf-idnabis-bidi]. This test may be omitted in special circumstances, such as when the lookup application knows that the conditions are enforced elsewhere, because an attempt to use such strings as an address will almost certainly lead to an error except when wild cards are present on a receiving system. However, applying the test is likely to give an earlier detection and much better information about the reason for a failure -- information that may be usefully passed to the user when that is feasible -- than later failure alone. For all other strings, the lookup application MUST rely on the Teint Expires September 24, 2010 [Page 17] Internet-Draft Extending IDNA to Other Protocols March 2010 protocol using the address to determine the validity of the address and the characters they contain. If they can successfully be used, they are presumed to be valid; if they are not, their possible validity is not relevant. While an application may reasonably issue warnings about strings it believes may be problematic, applications that decline to process a string that conforms to the rules above (i.e., does not allow putting it into an address slot) are not in conformance with this protocol. 4.8. Punycode Conversion The string that has now been validated is converted to ACE form by applying the Punycode algorithm to the string and then adding the ACE prefix. 4.9. Re-Assembly The A-Labels resulting from the conversion in Section 4.8 or supplied directly (see Section 4.6) is combined with the delimiters (see Section 4.5), in the original order, to form an A-Adress. The ASCII Address can then be put into an X-IDNA-unaware address slot and be used as a normal address. The use of this address can obviously either succees or fail (resulting in a lookup failure, bounce message, etc.). Teint Expires September 24, 2010 [Page 18] Internet-Draft Extending IDNA to Other Protocols March 2010 5. Address Validation Whenever an X-IDNA Profile mandates that the addresses be validated, the following procedure MUST be followed. Addresses ought to be validated whenever an address is defined, registered, etc. An X-IDNA Profiles defines when and by whom addresses are validated. 5.1. Label Validation In order to validate individual labels embedded in the address, it is normalised as specified in Section 4.3 and then the labels are extracted as specified in Section 4.5. The following validation steps apply to the extracted labels. The labels (in the form of a Unicode string, i.e., a string that at least superficially appears to be a U-label) are then examined, performing tests that require examination of more than one character. Character order is considered to be the on-the-wire order, not the display order. 5.1.1. Hyphen Restrictions If the Unicode string contains non-base characters, it MUST NOT contain "--" (two consecutive HYPHEN-MINUS characters) in the third and fourth character positions. Note: The Unicode string will not start or end with a "-" (HYPHEN- MINUS) at this point. 5.1.2. Leading Combining Marks The Unicode string MUST NOT begin with a combining mark or combining character (see Section 2.11 of [Unicode] for an exact definition). 5.1.3. Contextual Rules The Unicode string MUST NOT contain any characters whose validity is context-dependent, unless the validity is positively confirmed by a contextual rule. To check this, each code-point marked as CONTEXTJ or CONTEXTO in [I-D.ietf-idnabis-tables] MUST have a non-null rule. If such a code-point is missing a rule, it is invalid. If the rule exists but the result of applying the rule is negative or inconclusive, the proposed label is invalid. Teint Expires September 24, 2010 [Page 19] Internet-Draft Extending IDNA to Other Protocols March 2010 5.1.4. Labels Containing Characters Written Right to Left If the proposed label contains any characters that are written from right to left it MUST meet the BIDI criteria [I-D.ietf-idnabis-bidi] 5.1.5. Successful Punycode Encoding The Unicode string MUST be convertible to ACE form using the Punycode algorithm ([RFC3492]), i.e., it MUST NOT cause an overflow. 5.2. Other Syntax Restrictions In A-Address form, the address MUST conform to the syntax defined by the X-IDNA-unaware specification for the address type. This MAY include length restrictions, syntax restrictions regarding separators, etc. 5.3. Local Restrictions In addition to the rules and tests above, there are many reasons why a site, registry, or administrators could reject an address. The responsible entity is expected to establish or follow policies about addresses they wish to define or register. Policies are likely to be informed by the local languages and the scripts that are used to write them and may depend on many factors including what characters are in the label (for example, an address may be rejected based on other addresses already registered). The same considerations as for IDNA registrations apply; see Section 3.2 of [I-D.ietf-idnabis-rationale] for a discussion and recommendations about registry policies. While X-IDNA, unlike IDNA 2008, allows addresses that contain fake A-Labels and R-LDH-Labels, responsible entities ought to avoid such addresses except when backwards-compatibility requires them. Teint Expires September 24, 2010 [Page 20] Internet-Draft Extending IDNA to Other Protocols March 2010 6. Acknowledgements The larger part of this specifications was directly lifted from IDNA 2008, and would not have been possible without the excellent work that went into these specifications. Teint Expires September 24, 2010 [Page 21] Internet-Draft Extending IDNA to Other Protocols March 2010 7. IANA Considerations This memo includes no request to IANA. The definition of an X-IDNA Profiles ought to be coordinated with the entity that controls the specification of the address slot type to which X-IDNA is applied, instead. Teint Expires September 24, 2010 [Page 22] Internet-Draft Extending IDNA to Other Protocols March 2010 8. Security Considerations X-IDNA shares the Security Considerations for IDNA, which are described in [I-D.ietf-idnabis-defs], except for the special issues associated with right to left scripts and characters. The latter are discussed in [I-D.ietf-idnabis-bidi]. In addition, each X-IDNA Profile will require additional Security Considerations, which MUST be discussed in the document defining the Profile. Teint Expires September 24, 2010 [Page 23] Internet-Draft Extending IDNA to Other Protocols March 2010 9. References 9.1. Normative References [I-D.ietf-idnabis-bidi] Alvestrand, H. and C. Karp, "Right-to-left scripts for IDNA", draft-ietf-idnabis-bidi-07 (work in progress), January 2010. [I-D.ietf-idnabis-defs] Klensin, J., "Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework", draft-ietf-idnabis-defs-13 (work in progress), January 2010. [I-D.ietf-idnabis-mappings] Resnick, P. and P. Hoffman, "Mapping Characters in IDNA", draft-ietf-idnabis-mappings-05 (work in progress), October 2009. [I-D.ietf-idnabis-protocol] Klensin, J., "Internationalized Domain Names in Applications (IDNA): Protocol", draft-ietf-idnabis-protocol-18 (work in progress), January 2010. [I-D.ietf-idnabis-rationale] Klensin, J., "Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale", draft-ietf-idnabis-rationale-17 (work in progress), January 2010. [I-D.ietf-idnabis-tables] Faltstrom, P., "The Unicode code points and IDNA", draft-ietf-idnabis-tables-09 (work in progress), January 2010. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)", RFC 3492, March 2003. [UAX15] Davis, M., Whistler, K., and M. Duerst, "Unicode Normalization Forms, Revision 31", UAX #15, September 2009, . Teint Expires September 24, 2010 [Page 24] Internet-Draft Extending IDNA to Other Protocols March 2010 [Unicode] Unicode Consortium, "Unicode Standard, Version 5.2", December 2009, . 9.2. Informative References [I-D.teint-xidna-zonefile] Teint, N., "An X-IDNA Profile for DNS Zone Master Files", draft-teint-xidna-zonefile-00 (work in progress), March 2010. [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005. [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource Identifiers (IRIs)", RFC 3987, January 2005. Teint Expires September 24, 2010 [Page 25] Internet-Draft Extending IDNA to Other Protocols March 2010 Author's Address Nick Teint Email: nick.teint@googlemail.com Teint Expires September 24, 2010 [Page 26]