Internet Draft                                D. Crocker
     draft-crocker-idn-idn-00.txt Brandenburg InternetWorking
     Expires in six months                       23 June 2002
                              
                              
                              
                              
            Internationalized Domain Names (IDN)
     
     
     Status of this Memo
     
     This document is an Internet-Draft and is in full
     conformance with all provisions of Section 10 of
     RFC2026.
     
     Internet-Drafts are working documents of the Internet
     Engineering Task Force (IETF), its areas and its working
     groups. Note that other groups may also distribute
     working documents as Internet-Drafts.
     
     Internet-Drafts are draft documents valid for a maximum
     of six months and may be updated, replaced, or obsoleted
     by other documents at any time. It is inappropriate to
     use Internet-Drafts as reference material or to cite
     them other than as "work in progress."
     
     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt
     
     The list of Internet-Draft Shadow Directories can be
     accessed at http://www.ietf.org/shadow.html.
     
     
     Abstract
     
     Globalization of the Internet requires that domain names
     be able use characters outside the ASCII repertoire.
     This document specifies internationalized domain names
     (IDNs) and defines initial domain name constructs in
     which IDNs can be used.  IDNs use characters drawn from
     a large repertoire (Unicode).
0.   Document Change Notes --
     
     This is a revision to draft-ietf-idn-idna-09.txt.  It is
     being distributed independently to facilitate
     discussion.
     
     The goal is to gain consensus about revisions to the IDN
     working group document, specifically for the following
     changes:
        
        a. Split the document into two, one for defining
           Internationalized Domain Names (IDN) and the other for
           defining an encoding method of IDNs, namely IDNA using ACE.
        
        b. Distinguish general IDN from its specific use for host
           names (IDN-Host).  Use for host names is specified more
           precisely, in terms of a specific syntax BNF rule from the
           relevant existing DNS specification, so that IDN-Host will
           apply precisely to all DNS record fields and protocol units
           conforming to that BNF.
1.   Introduction
     
     Until now, there has been no standard method for domain
     names to use characters outside the ASCII repertoire.
     This document defines enhancements to the definition of
     domain names, to support internationalized domain names
     (IDN).  The details for doing protocol encoding of IDNs
     are specified separately.
2.   Terminology
     
     The key words "MUST", "SHALL", "REQUIRED", "SHOULD",
     "RECOMMENDED", and "MAY" in this document are to be
     interpreted as described in RFC 2119 [RFC2119].
     
     "ASCII"
          
          means US-ASCII [USASCII], a coded character set
          containing 128 characters associated with code
          points in the range 0..7F. Unicode is an extension
          of ASCII: it includes all the ASCII characters and
          associates them with the same code points.
     
     Code point
          
          refers to an integral value associated with a
          character in a coded character set.
     
     Domain name
          
          is used as a general term for strings conforming to
          [STD13].  [STD13] talks about "domain names" and
          "host names", but many people use the terms
          interchangeably. Further, because [STD13] was not
          terribly clear, many people who are sure they know
          the exact definitions of each of these terms
          disagree on the definitions. This document uses the
          terms separately.
     
     Domain name slot
          
          refers to a protocol element or a function argument
          or a return value (and so on) explicitly designated
          for carrying a domain name. Examples of domain name
          slots include: the QNAME field of a DNS query; the
          name argument of the gethostbyname() library
          function; the part of an email address following
          the at-sign (@) in the From: field of an email
          message header; and the host portion of the URI in
          the src attribute of an HTML ![]() tag.  General
          text that just happens to contain a domain name is
          not a domain name slot; for example, a domain name
          appearing in the plain text body of an email
          message is not occupying a domain name slot.
     
     Host name
          
          is a domain name conforming to STD13, with the
          naming character set limited to LDH.
     
     Internationalized host name (IDN-Host)
          
          is an IDN conforming to the STD13, except that it
          also supports non-ASCII characters from Unicode.
     
     Internationalized domain name" (IDN)
          
          is a domain name that has characters drawn from the
          restricted set of Unicode defined in <??>>
     
     Internationalized label
          
          is a label composed of characters from the Unicode
          character set; note, however, that not every string
          of Unicode characters can be an internationalized
          label.
     
     IDN-native
          
          is a domain name slot specified to hold an
          internationalized domain name. The designation may
          be static (for example, in the specification of the
          protocol or interface) or dynamic (for example, as
          a result of negotiation in an interactive session).
     
     Label
          
          is an individual part of a domain name. Labels are
          usually shown separated by dots; for example, the
          domain name "www.example.com" is composed of three
          labels: "www", "example", and "com". (The zero-
          length root label described in [STD13], which can
          be explicit as in "www.example.com." or implicit as
          in "www.example.com", is not considered a label in
          this specification.) Throughout this document the
          term "label" is shorthand for "text label", and
          "every label" means "every text label". In IDNA,
          not all text strings can be labels.
     
     LDH code points
          
          is defined to mean the codepoints associated with
          ASCII letters, digits, and the hyphen-minus; that
          is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an
          abbreviation for "letters, digits, hyphen".
     
     Unicode
          
          is a coded character set [UNICODE] containing tens
          of thousands of characters. A single Unicode code
          point is denoted by "U+" followed by four to six
          hexadecimal digits, while a range of Unicode code
          points is denoted by two hexadecimal numbers
          separated by "..", with no prefixes.
3.   International Domain Names (IDN)
3.1. Data representation
     
     This specification enhances the set of values for valid
     domain name labels from the restricted ASCII specified
     in [STD3], to include [Unicode].
     
     Mechanisms for encoding Unicode values in Domain Names
     is specified separately.  Hence this specification
     provides no detail for IDNs in "native" binary form (IDN-
     Native) or for "encoded" Unicode-based IDNs.
3.2. Dot as label separator
     
     For systems supporting IDN, wherever dot is permitted as
     a label separator, the following characters MUST be
     recognized as dots: U+002E (full stop), U+3002
     (ideographic full stop), U+FF0E (fullwidth full stop),
     U+FF61 (halfwidth ideographic full stop).
     
          
          << // Are there also multiple Unicode characters
          permitted for at-sign?  What about for slash ("/")?
          
          If not, then why is the domain name lexical
          analyzer now required to look for 4 characters
          rather than only one?
          
          This appears to be a case of putting into the
          protocol something that is, in fact, entirely a
          user-interface issue.  That some user interfaces
          will choose to map U+3002 to ASCII dot does not
          mean that it needs to be in the protocol.  // /Dave
          >>
4.   References
4.1. Normative references
     
     [STD3] Bob Braden, "Requirements for Internet Hosts --
     Communication Layers" (RFC 1122) and "Requirements for
     Internet Hosts -- Application and Support" (RFC 1123),
     STD 3, October 1989.
     
     [STD13] Paul Mockapetris, "Domain names - concepts and
     facilities" (RFC 1034) and "Domain names -
     implementation and specification" (RFC 1035), STD 13,
     November 1987.
4.2. Informative references
     
     [DNSSEC] Don Eastlake, "Domain Name System Security
     Extensions", RFC 2535, March 1999.
     
     [RFC2119] Scott Bradner, "Key words for use in RFCs to
     Indicate Requirement Levels", March 1997, RFC 2119.
     
     [UAX9] Unicode Standard Annex #9, The Bidirectional
     Algorithm,
     .
     
     [UNICODE] The Unicode Standard, Version 3.1.0: The
     Unicode Consortium. The Unicode Standard, Version 3.0.
     Reading, MA, Addison-Wesley Developers Press, 2000. ISBN
     0-201-61633-5, as amended by: Unicode Standard Annex
     #27: Unicode 3.1,
     .
     
     [USASCII] Vint Cerf, "ASCII format for Network
     Interchange", October 1969, RFC 20.
5.   Security Considerations
     
     Security on the Internet partly relies on the DNS. Thus,
     any change to the characteristics of the DNS can change
     the security of much of the Internet.
     
     This memo describes an algorithm that encodes characters
     that are not valid according to STD3 and STD13 into
     octet values that are valid. No security issues such as
     string length increases or new allowed values are
     introduced by the encoding process or the use of these
     encoded values, apart from those introduced by the ACE
     encoding itself.
     
     Domain names are used by users to connect to Internet
     servers. The security of the Internet would be
     compromised if a user entering a single
     internationalized name could be connected to different
     servers based on different interpretations of the
     internationalized domain name.
6.   Authors' Addresses
     
     Patrik Faltstrom
     Cisco Systems
     Arstaangsvagen 31 J
     S-117 43 Stockholm  Sweden
     paf@cisco.com
     
     Paul Hoffman
     Internet Mail Consortium and VPN Consortium
     127 Segre Place
     Santa Cruz, CA  95060  USA
     phoffman@imc.org
     
     Adam M. Costello
     University of California, Berkeley
     idna-spec.amc @ nicemice.net
 tag.  General
          text that just happens to contain a domain name is
          not a domain name slot; for example, a domain name
          appearing in the plain text body of an email
          message is not occupying a domain name slot.
     
     Host name
          
          is a domain name conforming to STD13, with the
          naming character set limited to LDH.
     
     Internationalized host name (IDN-Host)
          
          is an IDN conforming to the STD13, except that it
          also supports non-ASCII characters from Unicode.
     
     Internationalized domain name" (IDN)
          
          is a domain name that has characters drawn from the
          restricted set of Unicode defined in <??>>
     
     Internationalized label
          
          is a label composed of characters from the Unicode
          character set; note, however, that not every string
          of Unicode characters can be an internationalized
          label.
     
     IDN-native
          
          is a domain name slot specified to hold an
          internationalized domain name. The designation may
          be static (for example, in the specification of the
          protocol or interface) or dynamic (for example, as
          a result of negotiation in an interactive session).
     
     Label
          
          is an individual part of a domain name. Labels are
          usually shown separated by dots; for example, the
          domain name "www.example.com" is composed of three
          labels: "www", "example", and "com". (The zero-
          length root label described in [STD13], which can
          be explicit as in "www.example.com." or implicit as
          in "www.example.com", is not considered a label in
          this specification.) Throughout this document the
          term "label" is shorthand for "text label", and
          "every label" means "every text label". In IDNA,
          not all text strings can be labels.
     
     LDH code points
          
          is defined to mean the codepoints associated with
          ASCII letters, digits, and the hyphen-minus; that
          is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an
          abbreviation for "letters, digits, hyphen".
     
     Unicode
          
          is a coded character set [UNICODE] containing tens
          of thousands of characters. A single Unicode code
          point is denoted by "U+" followed by four to six
          hexadecimal digits, while a range of Unicode code
          points is denoted by two hexadecimal numbers
          separated by "..", with no prefixes.
3.   International Domain Names (IDN)
3.1. Data representation
     
     This specification enhances the set of values for valid
     domain name labels from the restricted ASCII specified
     in [STD3], to include [Unicode].
     
     Mechanisms for encoding Unicode values in Domain Names
     is specified separately.  Hence this specification
     provides no detail for IDNs in "native" binary form (IDN-
     Native) or for "encoded" Unicode-based IDNs.
3.2. Dot as label separator
     
     For systems supporting IDN, wherever dot is permitted as
     a label separator, the following characters MUST be
     recognized as dots: U+002E (full stop), U+3002
     (ideographic full stop), U+FF0E (fullwidth full stop),
     U+FF61 (halfwidth ideographic full stop).
     
          
          << // Are there also multiple Unicode characters
          permitted for at-sign?  What about for slash ("/")?
          
          If not, then why is the domain name lexical
          analyzer now required to look for 4 characters
          rather than only one?
          
          This appears to be a case of putting into the
          protocol something that is, in fact, entirely a
          user-interface issue.  That some user interfaces
          will choose to map U+3002 to ASCII dot does not
          mean that it needs to be in the protocol.  // /Dave
          >>
4.   References
4.1. Normative references
     
     [STD3] Bob Braden, "Requirements for Internet Hosts --
     Communication Layers" (RFC 1122) and "Requirements for
     Internet Hosts -- Application and Support" (RFC 1123),
     STD 3, October 1989.
     
     [STD13] Paul Mockapetris, "Domain names - concepts and
     facilities" (RFC 1034) and "Domain names -
     implementation and specification" (RFC 1035), STD 13,
     November 1987.
4.2. Informative references
     
     [DNSSEC] Don Eastlake, "Domain Name System Security
     Extensions", RFC 2535, March 1999.
     
     [RFC2119] Scott Bradner, "Key words for use in RFCs to
     Indicate Requirement Levels", March 1997, RFC 2119.
     
     [UAX9] Unicode Standard Annex #9, The Bidirectional
     Algorithm,
     .
     
     [UNICODE] The Unicode Standard, Version 3.1.0: The
     Unicode Consortium. The Unicode Standard, Version 3.0.
     Reading, MA, Addison-Wesley Developers Press, 2000. ISBN
     0-201-61633-5, as amended by: Unicode Standard Annex
     #27: Unicode 3.1,
     .
     
     [USASCII] Vint Cerf, "ASCII format for Network
     Interchange", October 1969, RFC 20.
5.   Security Considerations
     
     Security on the Internet partly relies on the DNS. Thus,
     any change to the characteristics of the DNS can change
     the security of much of the Internet.
     
     This memo describes an algorithm that encodes characters
     that are not valid according to STD3 and STD13 into
     octet values that are valid. No security issues such as
     string length increases or new allowed values are
     introduced by the encoding process or the use of these
     encoded values, apart from those introduced by the ACE
     encoding itself.
     
     Domain names are used by users to connect to Internet
     servers. The security of the Internet would be
     compromised if a user entering a single
     internationalized name could be connected to different
     servers based on different interpretations of the
     internationalized domain name.
6.   Authors' Addresses
     
     Patrik Faltstrom
     Cisco Systems
     Arstaangsvagen 31 J
     S-117 43 Stockholm  Sweden
     paf@cisco.com
     
     Paul Hoffman
     Internet Mail Consortium and VPN Consortium
     127 Segre Place
     Santa Cruz, CA  95060  USA
     phoffman@imc.org
     
     Adam M. Costello
     University of California, Berkeley
     idna-spec.amc @ nicemice.net