IETF                                                          A. Freytag
Internet-Draft                                               ASMUS, Inc.
Intended status: Standards Track                              J. Klensin
Expires: October 1, 2017
                                                             A. Sullivan
                                                               Dyn, Inc.
                                                          March 30, 2017


Those Troublesome Characters: A Registry of Unicode Code Points Needing
         Special Consieration When Used in Network Identifiers
                draft-freytag-troublesome-characters-00

Abstract

   Unicode's design goal is to be the universal character set for all
   applications.  The goal entails the inclusion of very large numbers
   of characters.  The sheer size of the repertoire increases the
   possibility of accidental or intentional use of characters that can
   cause confusion among users, particularly where linguistic context is
   ambiguous, unavailable, or impossible to determine.  A registry of
   code points that can be sometimes especially problematic may be
   useful to guide system administrators in setting parameters for
   allowable code points in an identifier system, and to aid
   applications in creating security aids for users.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on October 1, 2017.

Copyright Notice

   Copyright (c) 2017 IETF Trust and the persons identified as the
   document authors.  All rights reserved.


Freytag, et al.          Expires October 1, 2017                [Page 1]

Internet-Draft            Worrisome Characters                March 2017


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Unicode code points and identifiers . . . . . . . . . . . . .   2
   2.  Techniques already in place . . . . . . . . . . . . . . . . .   3
   3.  A registry of code points . . . . . . . . . . . . . . . . . .   4
     3.1.  Discussion  . . . . . . . . . . . . . . . . . . . . . . .   4
     3.2.  Registry initial contents . . . . . . . . . . . . . . . .   4
   4.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   5
   5.  Informative References  . . . . . . . . . . . . . . . . . . .   5
   Appendix A.  Discussion Venue . . . . . . . . . . . . . . . . . .   5
   Appendix B.  Change History . . . . . . . . . . . . . . . . . . .   5
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   6

1.  Unicode code points and identifiers

   Unicode [[CREF1: reference goes here; references mostly careless in
   this draft --ajs]] is a coded character set that aims to support
   every writing system.  Writing systems evolve over time, and are
   sometimes influenced by one another.  As a result, Unicode encodes
   many characters that, to a reader, appear to be the same thing; but
   that are encoded differently from one another.  This sort of
   difference is usually not important in written texts, because
   competent readers and writers of a language are able to compensate
   for the selection of the "wrong" character when reading or writing.

   Identifiers that are used in a network or, especially, an Internet
   context present three special problems because of the above feature
   of Unicode:

   1.  In many (perhaps most) uses of identifiers, it is either
       practically difficult or impossible to ascertain the correct
       language context in which the identifier is being used.  In the
       case of an internationalized domain name, for instance, each
       label could in principle represent a new locus of control and a
       new language context; moreover, at least some domains (such as
       the root) have an Internet-wide context and therefore do not
       really have a language context as such.  But even in the case of
       email local-parts, where a sender is likely to know at least one


Freytag, et al.          Expires October 1, 2017                [Page 2]

Internet-Draft            Worrisome Characters                March 2017


       of the languages of the receiver, the language context that was
       in use at the time the identifier was created is often unknown.

   2.  Identifiers on the network are in general exact-match systems,
       because an ambiguous identifier is problematic.  Sometimes, but
       not always, there are facilities for aliasing such that multiple
       identifiers can be put together as identity.  Such techniques are
       in any case just an extension of the exact-match approach, and do
       not work the way a competent human reader does when interpreting
       the "right" character upon seeing the "wrong" one.

   3.  Because there are many characters that may appear to be the same
       (or even, that are defined in such a way that they are all but
       guaranteed to be rendered by the same glyphs), it is fairly easy
       to create an identifier either by accident or on purpose that is
       likely to confuse even competent readers and writers of a
       language.

2.  Techniques already in place

   In the IDNA mechanism for including Unicode code points [RFC5892], a
   code point is only included when it meets the needs of
   internationalizing domain names as explained in the IDNA framework
   [RFC5894].  For identifiers beyond IDNA, the PRECIS framework
   [RFC7564] generalizes the same basic technique.  In both cases, the
   overall approach is to assume that all characters are excluded, and
   then include characters according to properties derived from the
   Unicode character properties.  This general strategy cuts the
   enormous size of the Unicode database somewhat, avoiding including
   some characters that are necessarily unsuited for use as identifiers.

   The mechanism of inclusion by derived property, however, is
   insufficient to guarantee every included character is safe for use in
   identifiers.  Some characters' properties lead them to be included
   even though they are not obviously good candidates.  In other cases,
   indvidual characters are good for inclusion, but are problematic in
   combination.  Finally, there are cases where a two characters or
   sequences are not problematic by themselves, or if used in
   alternation in the same identifier, but become problematic when their
   choice represents the only difference between otherwise identical
   identifiers.  [[CREF2: Do we want examples here? --ajs]]

   Operators of systems that create identifiers (whether through a
   registry or through peer-to-peer identifier negotiation system) need
   to make policies for characters they will permit.  Operators of
   registries, for instance, can help by adopting good registration
   policies: "Users will benefit if registries only permit characters
   from scripts that are well-understood by the registry or its


Freytag, et al.          Expires October 1, 2017                [Page 3]

Internet-Draft            Worrisome Characters                March 2017


   advisers."[RFC5894] The difficulty for many operators, however, is
   that they do not have the writing system expertise to claim any
   character is "well-understood", and they do not really have the time
   to develop that expertise.

   To help with the foregoing, a registry of Unicode code points that
   present special issues for network identifiers can help guide
   protocol and operating decisions about whether to permit a given code
   point or sequence of code points.

   In the case of registries, it is not always necessary or desirable to
   exclude characters so much as to guarantee that they are used in a
   strictly mutually exclusive way in otherwise identical identifiers.

3.  A registry of code points

3.1.  Discussion

   The registry contains three fields.  The first field, called "Code
   Point(s)", is a code point or sequence of code points.  The second,
   contains zero or more cross references to related code points.  The
   third, called "Explanation", is a free form text field that describes
   briefly the issue.  Long paragraphs are discouraged; a code point
   that needs such discussion should be discussed in a document
   somewhere.  The explanation field may contain references to
   documents, so long as the reference is stable.

   The registry is updated by Expert Review.  It ought to contain only
   code points that are significant in identifiers and that need special
   policies (including policies of exclusion).

3.2.  Registry initial contents

   [[CREF3: I'm not sure that 0259 is strictly phonetic any more because
   it does have an uppercase (which is the reason it's been disunified
   from U+01DD --af]]


Freytag, et al.          Expires October 1, 2017                [Page 4]

Internet-Draft            Worrisome Characters                March 2017


   +-----------+------------+------------------------------------------+
   | Code      | Cross      | Explanation                              |
   | Point(s)  | Reference  |                                          |
   +-----------+------------+------------------------------------------+
   | U+02BC    | U+2019     | Character is indistinguishable from a    |
   |           |            | common punctuation mark                  |
   | U+0338    |            | Not intended for use in creating letters |
   | U+0259    | U+01DD     | Phonetic character                       |
   +-----------+------------+------------------------------------------+

   Table 1: Registry of Unicode Code Points for Special Consideration in
                            Network Identifiers

4.  IANA Considerations

   The IANA Services Operator is hereby requested to create the Registry
   of Unicode Code Points for Special Consideration in Network
   Identifiers, and to populate it with the values in section
   Section 3.2.  The registry is to be updated by Expert Review.

5.  Informative References

   [RFC5892]  Faltstrom, P., Ed., "The Unicode Code Points and
              Internationalized Domain Names for Applications (IDNA)",
              RFC 5892, DOI 10.17487/RFC5892, August 2010,
              <http://www.rfc-editor.org/info/rfc5892>.

   [RFC5894]  Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Background, Explanation, and
              Rationale", RFC 5894, DOI 10.17487/RFC5894, August 2010,
              <http://www.rfc-editor.org/info/rfc5894>.

   [RFC7564]  Saint-Andre, P. and M. Blanchet, "PRECIS Framework:
              Preparation, Enforcement, and Comparison of
              Internationalized Strings in Application Protocols",
              RFC 7564, DOI 10.17487/RFC7564, May 2015,
              <http://www.rfc-editor.org/info/rfc7564>.

Appendix A.  Discussion Venue

   This Internet-Draft may be discussed on the IAB Internationalization
   public list: i18n-discuss@iab.org.

Appendix B.  Change History

   Note to RFC Editor: this section should be removed prior to
   publication as an RFC.


Freytag, et al.          Expires October 1, 2017                [Page 5]

Internet-Draft            Worrisome Characters                March 2017


   00:

      *  Initial version

Authors' Addresses

   Asmus Freytag
   ASMUS, Inc.

   Email: asmus@unicode.org


   John C Klensin
   1770 Massachusetts Ave, Ste 322
   Cambridge, MA  02140
   U.S.A.

   Email: john-ietf@jck.com


   Andrew Sullivan
   Dyn, Inc.
   150 Dow St
   Manchester, NH  03101
   U.S.A.

   Email: asullivan@dyn.com


Freytag, et al.          Expires October 1, 2017                [Page 6]