Internet Engineering Task Force                                 N. Teint
Internet-Draft                                            March 23, 2010
Intended status: Experimental
Expires: September 24, 2010


   Extending Internationalised Domain Names in Applications to Other
                           Protocols (X-IDNA)
                       draft-teint-xidna-base-00

Abstract

   Prior to Internationalised Domain Names in Applications (IDNA), there
   has been no standard method for domains, names, addresses and similar
   identifiers to use characters outside the ASCII repertoire.  This
   still applies to many identifiers that are no domain names, such as
   email addresses (local-part), newsgroup names, etc.

   This document extends the mechanism defined in IDNA to other
   protocols and their identifiers.  As with IDNA, these identifiers may
   be drawn from a large repertoire (Unicode) and are mapped to
   backward-compatible identifiers using only ASCII characters.

   For valid domain names, X-IDNA produces the same encoding as IDNA,
   even when these domain names are embedded in other addresses.

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on September 24, 2010.


Teint                  Expires September 24, 2010               [Page 1]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the BSD License.


Teint                  Expires September 24, 2010               [Page 2]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
     1.1.  Overview . . . . . . . . . . . . . . . . . . . . . . . . .  4
     1.2.  Rationale  . . . . . . . . . . . . . . . . . . . . . . . .  4
     1.3.  Requirements Language  . . . . . . . . . . . . . . . . . .  5
     1.4.  IDNA 2008  . . . . . . . . . . . . . . . . . . . . . . . .  5
   2.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  6
     2.1.  Addresses, Normalised Addresses and Labels . . . . . . . .  6
     2.2.  ACE Prefix . . . . . . . . . . . . . . . . . . . . . . . .  7
     2.3.  Address slots  . . . . . . . . . . . . . . . . . . . . . .  7
     2.4.  Characters . . . . . . . . . . . . . . . . . . . . . . . .  8
   3.  Requirements and Applicability . . . . . . . . . . . . . . . .  9
     3.1.  Requirements . . . . . . . . . . . . . . . . . . . . . . .  9
     3.2.  Applicability and X-IDNA Profiles  . . . . . . . . . . . .  9
   4.  Address Conversion . . . . . . . . . . . . . . . . . . . . . . 11
     4.1.  Address Input  . . . . . . . . . . . . . . . . . . . . . . 11
     4.2.  Conversion To Unicode  . . . . . . . . . . . . . . . . . . 11
     4.3.  Address Normalisation  . . . . . . . . . . . . . . . . . . 13
     4.4.  Unicode Normalisation  . . . . . . . . . . . . . . . . . . 14
     4.5.  Extraction of Labels and Delimiters  . . . . . . . . . . . 14
     4.6.  A-Label Input  . . . . . . . . . . . . . . . . . . . . . . 16
     4.7.  Validation and Character List Testing  . . . . . . . . . . 16
     4.8.  Punycode Conversion  . . . . . . . . . . . . . . . . . . . 18
     4.9.  Re-Assembly  . . . . . . . . . . . . . . . . . . . . . . . 18
   5.  Address Validation . . . . . . . . . . . . . . . . . . . . . . 19
     5.1.  Label Validation . . . . . . . . . . . . . . . . . . . . . 19
       5.1.1.  Hyphen Restrictions  . . . . . . . . . . . . . . . . . 19
       5.1.2.  Leading Combining Marks  . . . . . . . . . . . . . . . 19
       5.1.3.  Contextual Rules . . . . . . . . . . . . . . . . . . . 19
       5.1.4.  Labels Containing Characters Written Right to Left . . 20
       5.1.5.  Successful Punycode Encoding . . . . . . . . . . . . . 20
     5.2.  Other Syntax Restrictions  . . . . . . . . . . . . . . . . 20
     5.3.  Local Restrictions . . . . . . . . . . . . . . . . . . . . 20
   6.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21
   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 22
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 23
   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 24
     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 24
     9.2.  Informative References . . . . . . . . . . . . . . . . . . 25
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 26


Teint                  Expires September 24, 2010               [Page 3]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


1.  Introduction

1.1.  Overview

   X-IDNA works by extracting anything from the address that fits the
   syntax of a valid domain name "label", i.e. strings that roughly
   match the "LDH" syntax for "A-labels" and "U-labels".
   These extracted, putative labels are then put through a conversion
   the normative part of which is identical to the normative part of
   IDNA2008.

   The characters that do not form labels, the separators, are solely
   drawn from the ASCII repertoire (potentially mapped from Unicode
   lookalikes) and thus need no internationalisation.

   Special processing, called address normalisation, ensures that
   addresses considered equivalent in a protocol that allows arbitrary
   "quoting" or "escaping" produce the same "labels".

   X-IDNA Profiles state to which (part of) address specifications
   X-IDNA is applied and what steps have to be taken for address
   normalisation.

1.2.  Rationale

   Unlike other methods for address internationalisation (such as
   allowing UTF-8), using X-IDNA, as IDNA, allows the graceful
   introduction of internationalised addresses not only by avoiding
   upgrades to existing infrastructure (such as DNS servers and mail
   transport agents), but also by allowing some limited use of
   internationalised addresses in applications by using the ASCII-
   encoded representation of the labels containing non-ASCII characters.
   While such names are user-unfriendly to read and type, and hence not
   optimal for user input, they can be used as a last resort to allow
   rudimentary usage of internationalised addresses.  For example, they
   might be the best choice for display if it were known that relevant
   fonts were not available on the user's computer.

   For protocols that have been extended to allow Unicode addresses to
   be used directly, X-IDNA also provides a way to "downgrade" the
   addresses that does not require lookups in a database or transmission
   of alternative ASCII addresses.

   When strings covered by one profile of X-IDNA end up in places that
   are covered by a different profile, it does not matter whether the
   labels are converted to a A-Labels first and then put into the other
   string or vice-versa, provided that due care has been taken in
   defining these profiles.


Teint                  Expires September 24, 2010               [Page 4]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


   The same is true for IDNA and X-IDNA as long as the domain name is
   valid, i.e. as long as it consists entirely of LDH-Labels.  For
   example, if a valid domain name is put into the local-part of an
   email address, the conversion of the domain's U-Labels to A-Labels
   will be identical in IDNA and X-IDNA.

   This property is best demonstrated by the examples in
   [I-D.teint-xidna-zonefile].

1.3.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

1.4.  IDNA 2008

   As X-IDNA essentially piggy-backs on IDNA 2008, the reader ought to
   be familiar with the following specifications:
   [I-D.ietf-idnabis-defs], [I-D.ietf-idnabis-protocol],
   [I-D.ietf-idnabis-mappings], [I-D.ietf-idnabis-tables],
   [I-D.ietf-idnabis-bidi] and [I-D.ietf-idnabis-rationale]


Teint                  Expires September 24, 2010               [Page 5]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


2.  Definitions

2.1.  Addresses, Normalised Addresses and Labels

   An "address" is defined in this document to be a protocol element
   used in addressing network ressources, such as domain names, names,
   addresses and similar identifiers.  An address is in a protocol-
   dependent format.  An X-IDNA profile is required to specify which
   protocol elements constitute an address and how to map addresses to
   normalised addresses.
   For the purposes of this specification, the definition of an address
   need not contain the complete "address field" defined in the other
   protocol.  For example, a protocol that has strings consisting of
   locally-assigned names, domain names and numeric addresses may
   specify that the locally-assigned names are covered by X-IDNA whereas
   the domain names are covered by IDNA and the numeric addresses are
   not internationalised.  A different protocol that allows address
   specifications containing both strings used for ressource
   identification and free-form text intended for human consumption may
   specify that X-IDNA applies to the the ressource identification part
   of the address specification whereas the human-readable text is
   covered by a different internationalisation protocol.

   A "normalised address" is defined in this document to be an address
   that is in a format suitable as input to the generic X-IDNA protocol
   defined in this document.  A normalised address consists of one or
   more labels, each separated by one or more separators.  It is
   produced in the normalisation steps (see Section 4.3 to Section 4.4)
   of the X-IDNA protocol defined in this document.
   For the purposes of this document, a normalised address need not be
   suitable for equivalence comparision.

   A "label" is defined in this document to be a part of a normalised
   address that does not contain a separator.  It is produced in the
   label extraction step (see Section 4.5) of the X-IDNA protocol
   defined in this document.
   This definition of "label" is a generalisation of that found in
   [I-D.ietf-idnabis-defs]: the term is applied to strings that are not
   part of a domain name.

   The definition of "A-Label", "fake A-Label", "U-Label", "LDH-Label",
   "R-LDH-Label" and "NR-LDH-Label" is taken from
   [I-D.ietf-idnabis-defs] but applied to strings that are not part of a
   domain name.

   A "separator" is a (base) character that appears between labels and
   is passed through the X-IDNA process as-is.  Separators are also
   extracted in the label extraction step (see Section 4.5) of the


Teint                  Expires September 24, 2010               [Page 6]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


   X-IDNA protocol defined in this document.

   An "ASCII address" is an address that consists entirely of base
   characters.  It can contain A-Labels, fake A-Labels, R-LDH-Labels and
   NR-LDH-Labels and separators.

   A "Unicode address" may consist of both base and extended characters.
   It can contain A-Labels, fake A-Labels, R-LDH-Labels, NR-LDH-Labels,
   U-Labels and separators.

   An "internationalised address" is either an ASCII address or a
   Unicode address.

2.2.  ACE Prefix

   The "ACE prefix" is defined in this document to be a string of ASCII
   characters "xn--" that appears at the beginning of every A-Label.
   "ACE" stands for "ASCII-Compatible Encoding".

2.3.  Address slots

   An "address slot" is defined in this document to be a protocol
   element or a function argument or a return value (and so on)
   explicitly designated for carrying an "address".  Examples of address
   name slots include the email address following in the parameter to
   the SMTP MAIL or RCPT commands or the "From:" field of an email
   message header; the newsgroup name appearing in Netnews; and the
   domain name in the QNAME field of a DNS query.  A string that has the
   syntax of an address but that appears in general text is not in a
   address slot.  For example, an email address appearing in the plain
   text body of an email message is not occupying an address name slot.

   An "X-IDNA-aware address slot" is defined to be an address slot
   explicitly designated for carrying an internationalised address as
   defined in profiles based on this document.  The designation may be
   static (for example, in the specification of the protocol or
   interface) or dynamic (for example, as a result of negotiation in an
   interactive session).

   An "X-IDNA-unaware address slot" is defined for this set of documents
   to be any address slot that is not an X-IDNA-aware address slot.
   Obviously, this includes any address slot whose specification
   predates X-IDNA or a profile defined for these types of addresses.
   For domain names, it includes any domain name slot (as defined in
   [I-D.ietf-idnabis-defs], Section 2.3.2.6) that predates IDNA.

   These definitions are generalisations of the "domain name slots"
   defined in [I-D.ietf-idnabis-defs], Section 2.3.2.6.


Teint                  Expires September 24, 2010               [Page 7]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


2.4.  Characters

   A "base character" is a Unicode character in the range U+0000..U+
   007F. (These Unicode characters correspond to the "ASCII" character
   set and are also known as "ASCII characters".)

   An "extended character" is a Unicode character that is not a base
   character, i.e. a character in the range U+0080 up to the maximum
   Unicode codepoint (U+10FFFF as of [Unicode]).


Teint                  Expires September 24, 2010               [Page 8]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


3.  Requirements and Applicability

3.1.  Requirements

   X-IDNA makes the following requirements:

   1.  Whenever an address is put into an X-IDNA-unaware address slot
       (see Section 2.3), it MUST contain only base characters and
       follow the syntax mandated by the specification defining the
       address type.

       If the address type is case-sensitive, any A-Label within the
       address MUST be converted to lowercase.

   2.  A-Labels and U-Labels Section 2.1 within an address MUST be
       compared using equivalent forms: either both A-Label forms or
       both U-Label forms.  Because A-Labels and U-Labels can be
       transformed into each other without loss of information, these
       comparisons are equivalent.  A pair of A-Labels MUST be compared
       as case-insensitive ASCII.  U-Labels MUST be compared as-is,
       without case-folding or other intermediate steps.  Note that it
       is not necessary to validate labels in order to compare them and
       that successful comparison does not imply validity.  In many
       cases, not limited to comparison, validation may be important for
       other reasons and SHOULD be performed.

       Separators and labels that are neither A-Labels nor U-Labels
       (e.g., fake A-Labels and NR-LDH-Labels) MUST not match any
       A-Label or U-Label; other than that, X-IDNA does not make
       requirements for these (e.g., NR-LDH-Labels may be compared with
       or without case-folding, depending on the address slot type.)

   3.  When addresses are not validated, they MUST conform to the
       requirements of Section 4.  When addresses are validated, they
       MUST conform to the requirements of Section 5.

3.2.  Applicability and X-IDNA Profiles

   The application of X-IDNA to any type of address depends on an
   additional specification that provides the details.  These other
   specifications are referred to as X-IDNA Profiles.

   Each definition of an X-IDNA Profile MUST include all of the
   following:

   1.  The protocol or protocols to which X-IDNA is applied and the
       syntax elements of these protocols that represent an X-IDNA-
       unaware address slot.


Teint                  Expires September 24, 2010               [Page 9]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


   2.  The steps necessary to map any syntactically valid address to a
       normalised address (as described in Section 4.3)

   3.  The parts of the protocol that require checking the address for
       validity (as described in Section 5).

   A specification MAY also include:

   o  Syntax extensions for X-IDNA-aware address slots.

   o  Methods for interoperation with other ways of address
      internationalisation.


Teint                  Expires September 24, 2010              [Page 10]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


4.  Address Conversion

   Before an Internationalised Address is put into a X-IDNA-unaware
   slot, it MUST be converted to an ASCII Address using the following
   procedure.

   Although some validity checks are necessary to avoid serious problems
   with the protocol, the tests are permissive and rely on the
   assumption that names that can be successfully used are valid.  That
   assumption is, however, a weak one because the presence of wild cards
   in the receiving system might cause a string that has not been
   explicitly defined and validated to be successfully used as an
   address.

   This procedure in a generalisation of the Domain Name Lookup Protocol
   defined in [I-D.ietf-idnabis-protocol], Section 5.

4.1.  Address Input

   The user supplies a string in the local character set, for example by
   typing it or clicking on, or copying and pasting, a resource
   identifier, e.g., a URI ([RFC3986]) or IRI ([RFC3987]) from which the
   address is extracted.  Alternately, some process not directly
   involving the user may read the string from a file or obtain it in
   some other way.

   Processing in this step and that specified in Section 4.2,
   Section 4.3 and Section 4.4 are local matters, to be accomplished
   prior to actual invocation of X-IDNA.

4.2.  Conversion To Unicode

   The string is converted from the local character set into Unicode, if
   it is not already in Unicode.

   Depending on local needs, this conversion may involve mapping some
   characters into other characters as well as coding conversions.  This
   section defines a general algorithm that applications ought to
   implement in order to produce Unicode code points that will be valid
   under the IDNA protocol.

   An application might implement the full mapping as described below,
   or can choose a different mapping.  In fact, an application might
   want to implement a full mapping that is substantially compatible
   with the original IDNA protocol instead of the algorithm given here.

   The general algorithm that an application (or the input method
   provided by an operating system) ought to use is relatively


Teint                  Expires September 24, 2010              [Page 11]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


   straightforward:

   1.  Upper case characters are mapped to their lower case equivalents
       by using the algorithm for mapping case in Unicode characters.

   2.  Full-width and half-width characters (those defined with
       Decomposition Types <wide> and <narrow>) are mapped to their
       decomposition mappings as shown in the Unicode character
       database.

   3.  The application can also choose to map some or all of the
       following separator characters to ASCII characters which are
       roughly equivalent:

       *  mapped to U+0020 (SPACE):

          +  all characters having a canonical or compatibility
             decomposition to U+0020

       *  mapped to U+0022 (QUOTATION MARK):

          +  U+201C (LEFT DOUBLE QUOTATION MARK)

          +  U+201D (RIGHT DOUBLE QUOTATION MARK)

          +  U+201E (DOUBLE LOW-9 QUOTATION MARK)

          +  U+201F (DOUBLE HIGH-REVERSED-9 QUOTATION MARK)

          +  U+2033 (DOUBLE PRIME)

          +  U+301D (REVERSED DOUBLE PRIME QUOTATION MARK)

          +  U+301E (DOUBLE PRIME QUOTATION MARK)

          +  U+301F (LOW DOUBLE PRIME QUOTATION MARK)

       *  mapped to U+0027 (APOSTROPHE):

          +  U+2018 (LEFT SINGLE QUOTATION MARK)

          +  U+2019 (RIGHT SINGLE QUOTATION MARK)

          +  U+201B (SINGLE HIGH-REVERSED-9 QUOTATION MARK)

          +  U+2032 (PRIME)


Teint                  Expires September 24, 2010              [Page 12]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


       *  mapped to U+002C (COMMA):

          +  U+201A (SINGLE LOW-9 QUOTATION MARK)

          +  U+3001 (IDEOGRAPHIC COMMA)

       *  mapped to U+002E (FULL STOP):

          +  U+3002 (IDEOGRAPHIC FULL STOP)

       *  mapped to U+003C (LESS-THAN SIGN):

          +  U+2039 (SINGLE LEFT-POINTING ANGLE QUOTATION MARK)

       *  mapped to U+003E (GREATER-THAN SIGN):

          +  U+203A (SINGLE RIGHT-POINTING ANGLE QUOTATION MARK)

       *  mapped to U+007C (VERTICAL LINE):

          +  U+00A6 (BROKEN BAR)

       *  mapped to U+007E (TILDE):

          +  U+301C (WAVE DASH)

   Unicode Normalisation ought to be deferred until after the Address
   Normalisation defined in the following section.

4.3.  Address Normalisation

   The string is then normalised as specified in the X-IDNA Profile
   applicable for the type of address slot into which the string is
   intented to be put.  This step maps a syntactically valid address to
   a normalised address, from which labels can be extracted.  It is
   defined by the X-IDNA Profile.

   The rules for normalisations defined by X-IDNA Profiles are:

   o  If the protocol to which the X-IDNA profile applies considers two
      addresses as equivalent, the normalisation procedure MUST produce
      strings that produces exactly the same labels in the following
      step, the label extraction.  (However, the X-IDNA Profile may
      specify a normalisation that does not produce output directly
      suitable for equivalence comparision.)

   o  The runs of label characters produced by the normalisation SHOULD
      be as long as possible.  Especially, the normalisation SHOULD


Teint                  Expires September 24, 2010              [Page 13]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


      treat extended characters as "unreserved", "text", etc.  It SHOULD
      NOT mechanically insert "quote" or "escape" characters between
      characters that can be part of a label.

      These two requirements are usually met if the normalisation
      removes all optional separators ("quotes" or "escape" characters),
      especially those appearing between characters that can be part of
      a label.

      For example, if a protocols allows all characters to be "quoted"
      by preceeding it with an optional U+002F (SOLIDUS), this character
      needs to be removed if it is followed by any letter, digit, hyphen
      or extended character; if a protocol allows non-significant DOUBLE
      QUOTEs (U+0034) anywhere in the string, these need to be removed
      if they appear within a label; and so on.

   o  Extended characters SHOULD be either mapped to ASCII characters or
      treated as belonging to the character class that needs no
      "quoting" or "escaping" otherwise defined by the protocol.

      This requires special care for extended characters in
      implementations.  Leagacy code may mechanically escape or quote
      them because they have traditionally not been in the category
      defined as "unreserved", "text", etc.

      For example, an extended character may be assigned to the
      character class "atext" defined in &RFC5322; for the purposes of
      the normalisation.

   An application MAY also choose to map invalid addresses to
   syntactically valid addresses, for example by "escaping" or "quoting"
   problematic base characters; or it MAY reject invalid addresses in
   this step.

4.4.  Unicode Normalisation

   Depending on local needs, the string can then be mapped using Unicode
   Normalization Form C (NFC).

4.5.  Extraction of Labels and Delimiters

   The normalised address is then split into putative labels and
   separators.  The labels extracted in this step have not been checked
   for conformance, are therefore referred to as "putative".

   The putative labels will start with a letter, digit or extended
   character, which may be followed by a string that consists of zero or
   more letters, digits, extended characters or the HYPHEN-MINUS, and


Teint                  Expires September 24, 2010              [Page 14]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


   ends in a letter, digit or extended character.
   The delimiters will be base characters that are not digits or
   letters.

   The label extraction differs significantly from the procedure
   specified in [I-D.ietf-idnabis-mappings] by adding a number of
   additional separators in addition to U+002E (FULL STOP).

   The purpose of this procedure is to ensure that all labels are IDNA-
   valid (as defined in [I-D.ietf-idnabis-defs]) regardless of the range
   of non-LDH characters allowed in the address.

   o  The following Unicode codepoints are always part of a label:

      *  U+0030 to U+0039 (DIGIT ZERO to DIGIT NINE)

      *  U+0041 to U+005A (LATIN CAPITAL LETTER A to LATIN CAPITAL
         LETTER Z)

      *  U+0061 to U+007A (LATIN SMALL LETTER A to LATIN SMALL LETTER Z)

      *  all above U+0080, inclusive

      That is, all ASCII digits and letters and all non-ASCII
      codepoints.

   o  The following Unicode codepoints are always separators:

      *  U+0000 to U+002C (<control> NULL to COMMA)

      *  U+002E to U+002F (FULL STOP to SOLIDUS)

      *  U+002A to U+0040 (COLON to COMMERCIAL AT)

      *  U+005B to U+0060 (LEFT SQUARE BRACKET to GRAVE ACCENT)

      *  U+007B to U+007F (LEFT CURLY BRACKET to <control>DELETE)

      That is, all ASCII codepoints except for the hyphen, digits and
      letters.

   o  The following Unicode codepoint is part of a label or a separator,
      depending on context:

      *  U+002D (HYPHEN-MINUS)

      A sequence of one or more HYPHEN-MINUS characters is composed of
      separators exactly if the run


Teint                  Expires September 24, 2010              [Page 15]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


      *  appears at the beginning or end of the address, OR

      *  is preceeded or followed by a separator.

      That is, hyphens that would start or end a label (and thus violate
      the syntax for LDH-Labels), are separators; other hyphens are part
      of a label.

4.6.  A-Label Input

   If a putative label extracted in the previous step appears to be an
   A-Label (i.e., it starts in "xn--", interpreted case-insensitively
   and does not contain extended characters), the application MAY
   attempt to convert it to a U-Label, first ensuring that the A-Label
   is entirely in lower case (converting it to lower case if necessary),
   and apply the tests of Section 4.7 and the conversion of Section 4.8
   to that form.

   If the label is converted to Unicode (i.e., to U-Label form) using
   the Punycode decoding algorithm, then the processing specified in
   those two sections MUST be performed.

   If any of these steps fails or rejects the label (i.e., it's a fake
   A-Label), the putative label MUST be used as-is.

   o  the original MUST be used if the X-IDNA profile allows R-LDH-
      Labels, and

   o  it MAY be rejected if the X-IDNA profile does not allow R-LDH-
      Labels.

   If a label consists entirely of base characters, the following two
   steps (Section 4.7 and Section 4.8) are skipped for this label.

4.7.  Validation and Character List Testing

   The putative labels extracted in the previous step are checked to
   verify that all characters that appear in it are valid as input to
   X-IDNA processing.  As discussed above, this check is liberal in
   order to allow for compatibility with future extensions.

   Putative labels with any of the following characteristics MUST be
   rejected in this step:

   o  Labels that are not in NFC ([UAX15])

   o  Labels containing "--" (two consecutive hyphens) in the third and
      fourth character positions


Teint                  Expires September 24, 2010              [Page 16]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


   o  Labels whose first character is a combining mark (see Section 2.11
      of [Unicode])

   o  Labels containing prohibited code points, i.e., those that are
      assigned to the "DISALLOWED" category of [I-D.ietf-idnabis-tables]

   o  Labels containing code points that are identified in
      [I-D.ietf-idnabis-tables] as "CONTEXTJ", i.e., requiring
      exceptional contextual rule processing on lookup, but that do not
      conform to those rules.  Note that this implies that a rule needs
      to be defined, not null: a character that requires a contextual
      rule but for which the rule is null is treated in this step as
      having failed to conform to the rule

   o  Labels containing code points that are identified in
      [I-D.ietf-idnabis-tables] as "CONTEXTO", but for which no such
      rule appears in the table of rules.  Applications resolving DNS
      names or carrying out equivalent operations are not required to
      test contextual rules for "CONTEXTO" characters, only to verify
      that a rule is defined (although they MAY make such tests to
      provide better protection or give better information to the user)

   o  Labels containing code points that are unassigned in the version
      of Unicode being used by the application, i.e., in the
      "UNASSIGNED" category of [I-D.ietf-idnabis-tables]
      This requirement means that the application needs to use a list of
      unassigned characters that is matched to the version of Unicode
      that is being used for the other requirements in this section.  It
      is not required that the application know which version of Unicode
      is being used; that information might be part of the operating
      environment in which the application is running.

   In addition, the application SHOULD apply the following test:

   o  Verification that the string is compliant with the requirements
      for right to left characters, specified in
      [I-D.ietf-idnabis-bidi].

   This test may be omitted in special circumstances, such as when the
   lookup application knows that the conditions are enforced elsewhere,
   because an attempt to use such strings as an address will almost
   certainly lead to an error except when wild cards are present on a
   receiving system.  However, applying the test is likely to give an
   earlier detection and much better information about the reason for a
   failure -- information that may be usefully passed to the user when
   that is feasible -- than later failure alone.

   For all other strings, the lookup application MUST rely on the


Teint                  Expires September 24, 2010              [Page 17]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


   protocol using the address to determine the validity of the address
   and the characters they contain.  If they can successfully be used,
   they are presumed to be valid; if they are not, their possible
   validity is not relevant.  While an application may reasonably issue
   warnings about strings it believes may be problematic, applications
   that decline to process a string that conforms to the rules above
   (i.e., does not allow putting it into an address slot) are not in
   conformance with this protocol.

4.8.  Punycode Conversion

   The string that has now been validated is converted to ACE form by
   applying the Punycode algorithm to the string and then adding the ACE
   prefix.

4.9.  Re-Assembly

   The A-Labels resulting from the conversion in Section 4.8 or supplied
   directly (see Section 4.6) is combined with the delimiters (see
   Section 4.5), in the original order, to form an A-Adress.

   The ASCII Address can then be put into an X-IDNA-unaware address slot
   and be used as a normal address.  The use of this address can
   obviously either succees or fail (resulting in a lookup failure,
   bounce message, etc.).


Teint                  Expires September 24, 2010              [Page 18]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


5.  Address Validation

   Whenever an X-IDNA Profile mandates that the addresses be validated,
   the following procedure MUST be followed.

   Addresses ought to be validated whenever an address is defined,
   registered, etc.  An X-IDNA Profiles defines when and by whom
   addresses are validated.

5.1.  Label Validation

   In order to validate individual labels embedded in the address, it is
   normalised as specified in Section 4.3 and then the labels are
   extracted as specified in Section 4.5.

   The following validation steps apply to the extracted labels.  The
   labels (in the form of a Unicode string, i.e., a string that at least
   superficially appears to be a U-label) are then examined, performing
   tests that require examination of more than one character.  Character
   order is considered to be the on-the-wire order, not the display
   order.

5.1.1.  Hyphen Restrictions

   If the Unicode string contains non-base characters, it MUST NOT
   contain "--" (two consecutive HYPHEN-MINUS characters) in the third
   and fourth character positions.

   Note: The Unicode string will not start or end with a "-" (HYPHEN-
   MINUS) at this point.

5.1.2.  Leading Combining Marks

   The Unicode string MUST NOT begin with a combining mark or combining
   character (see Section 2.11 of [Unicode] for an exact definition).

5.1.3.  Contextual Rules

   The Unicode string MUST NOT contain any characters whose validity is
   context-dependent, unless the validity is positively confirmed by a
   contextual rule.  To check this, each code-point marked as CONTEXTJ
   or CONTEXTO in [I-D.ietf-idnabis-tables] MUST have a non-null rule.
   If such a code-point is missing a rule, it is invalid.  If the rule
   exists but the result of applying the rule is negative or
   inconclusive, the proposed label is invalid.


Teint                  Expires September 24, 2010              [Page 19]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


5.1.4.  Labels Containing Characters Written Right to Left

   If the proposed label contains any characters that are written from
   right to left it MUST meet the BIDI criteria [I-D.ietf-idnabis-bidi]

5.1.5.  Successful Punycode Encoding

   The Unicode string MUST be convertible to ACE form using the Punycode
   algorithm ([RFC3492]), i.e., it MUST NOT cause an overflow.

5.2.  Other Syntax Restrictions

   In A-Address form, the address MUST conform to the syntax defined by
   the X-IDNA-unaware specification for the address type.  This MAY
   include length restrictions, syntax restrictions regarding
   separators, etc.

5.3.  Local Restrictions

   In addition to the rules and tests above, there are many reasons why
   a site, registry, or administrators could reject an address.

   The responsible entity is expected to establish or follow policies
   about addresses they wish to define or register.  Policies are likely
   to be informed by the local languages and the scripts that are used
   to write them and may depend on many factors including what
   characters are in the label (for example, an address may be rejected
   based on other addresses already registered).

   The same considerations as for IDNA registrations apply; see Section
   3.2 of [I-D.ietf-idnabis-rationale] for a discussion and
   recommendations about registry policies.

   While X-IDNA, unlike IDNA 2008, allows addresses that contain fake
   A-Labels and R-LDH-Labels, responsible entities ought to avoid such
   addresses except when backwards-compatibility requires them.


Teint                  Expires September 24, 2010              [Page 20]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


6.  Acknowledgements

   The larger part of this specifications was directly lifted from IDNA
   2008, and would not have been possible without the excellent work
   that went into these specifications.


Teint                  Expires September 24, 2010              [Page 21]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


7.  IANA Considerations

   This memo includes no request to IANA.

   The definition of an X-IDNA Profiles ought to be coordinated with the
   entity that controls the specification of the address slot type to
   which X-IDNA is applied, instead.


Teint                  Expires September 24, 2010              [Page 22]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


8.  Security Considerations

   X-IDNA shares the Security Considerations for IDNA, which are
   described in [I-D.ietf-idnabis-defs], except for the special issues
   associated with right to left scripts and characters.  The latter are
   discussed in [I-D.ietf-idnabis-bidi].

   In addition, each X-IDNA Profile will require additional Security
   Considerations, which MUST be discussed in the document defining the
   Profile.


Teint                  Expires September 24, 2010              [Page 23]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


9.  References

9.1.  Normative References

   [I-D.ietf-idnabis-bidi]
              Alvestrand, H. and C. Karp, "Right-to-left scripts for
              IDNA", draft-ietf-idnabis-bidi-07 (work in progress),
              January 2010.

   [I-D.ietf-idnabis-defs]
              Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Definitions and Document Framework",
              draft-ietf-idnabis-defs-13 (work in progress),
              January 2010.

   [I-D.ietf-idnabis-mappings]
              Resnick, P. and P. Hoffman, "Mapping Characters in IDNA",
              draft-ietf-idnabis-mappings-05 (work in progress),
              October 2009.

   [I-D.ietf-idnabis-protocol]
              Klensin, J., "Internationalized Domain Names in
              Applications (IDNA): Protocol",
              draft-ietf-idnabis-protocol-18 (work in progress),
              January 2010.

   [I-D.ietf-idnabis-rationale]
              Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Background, Explanation, and
              Rationale", draft-ietf-idnabis-rationale-17 (work in
              progress), January 2010.

   [I-D.ietf-idnabis-tables]
              Faltstrom, P., "The Unicode code points and IDNA",
              draft-ietf-idnabis-tables-09 (work in progress),
              January 2010.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
              for Internationalized Domain Names in Applications
              (IDNA)", RFC 3492, March 2003.

   [UAX15]    Davis, M., Whistler, K., and M. Duerst, "Unicode
              Normalization Forms, Revision 31", UAX #15,
              September 2009,
              <http://unicode.org/reports/tr15/tr15-31.html>.


Teint                  Expires September 24, 2010              [Page 24]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


   [Unicode]  Unicode Consortium, "Unicode Standard, Version 5.2",
              December 2009,
              <http://www.unicode.org/versions/Unicode5.2.0/>.

9.2.  Informative References

   [I-D.teint-xidna-zonefile]
              Teint, N., "An X-IDNA Profile for DNS Zone Master Files",
              draft-teint-xidna-zonefile-00 (work in progress),
              March 2010.

   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
              Resource Identifier (URI): Generic Syntax", STD 66,
              RFC 3986, January 2005.

   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
              Identifiers (IRIs)", RFC 3987, January 2005.


Teint                  Expires September 24, 2010              [Page 25]

Internet-Draft      Extending IDNA to Other Protocols         March 2010


Author's Address

   Nick Teint

   Email: nick.teint@googlemail.com


Teint                  Expires September 24, 2010              [Page 26]