IETF A. Freytag Internet-Draft ASMUS, Inc. Intended status: Standards Track J. Klensin Expires: October 1, 2017 A. Sullivan Dyn, Inc. March 30, 2017 Those Troublesome Characters: A Registry of Unicode Code Points Needing Special Consieration When Used in Network Identifiers draft-freytag-troublesome-characters-00 Abstract Unicode's design goal is to be the universal character set for all applications. The goal entails the inclusion of very large numbers of characters. The sheer size of the repertoire increases the possibility of accidental or intentional use of characters that can cause confusion among users, particularly where linguistic context is ambiguous, unavailable, or impossible to determine. A registry of code points that can be sometimes especially problematic may be useful to guide system administrators in setting parameters for allowable code points in an identifier system, and to aid applications in creating security aids for users. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on October 1, 2017. Copyright Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. Freytag, et al. Expires October 1, 2017 [Page 1] Internet-Draft Worrisome Characters March 2017 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Unicode code points and identifiers . . . . . . . . . . . . . 2 2. Techniques already in place . . . . . . . . . . . . . . . . . 3 3. A registry of code points . . . . . . . . . . . . . . . . . . 4 3.1. Discussion . . . . . . . . . . . . . . . . . . . . . . . 4 3.2. Registry initial contents . . . . . . . . . . . . . . . . 4 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 5 5. Informative References . . . . . . . . . . . . . . . . . . . 5 Appendix A. Discussion Venue . . . . . . . . . . . . . . . . . . 5 Appendix B. Change History . . . . . . . . . . . . . . . . . . . 5 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 6 1. Unicode code points and identifiers Unicode [[CREF1: reference goes here; references mostly careless in this draft --ajs]] is a coded character set that aims to support every writing system. Writing systems evolve over time, and are sometimes influenced by one another. As a result, Unicode encodes many characters that, to a reader, appear to be the same thing; but that are encoded differently from one another. This sort of difference is usually not important in written texts, because competent readers and writers of a language are able to compensate for the selection of the "wrong" character when reading or writing. Identifiers that are used in a network or, especially, an Internet context present three special problems because of the above feature of Unicode: 1. In many (perhaps most) uses of identifiers, it is either practically difficult or impossible to ascertain the correct language context in which the identifier is being used. In the case of an internationalized domain name, for instance, each label could in principle represent a new locus of control and a new language context; moreover, at least some domains (such as the root) have an Internet-wide context and therefore do not really have a language context as such. But even in the case of email local-parts, where a sender is likely to know at least one Freytag, et al. Expires October 1, 2017 [Page 2] Internet-Draft Worrisome Characters March 2017 of the languages of the receiver, the language context that was in use at the time the identifier was created is often unknown. 2. Identifiers on the network are in general exact-match systems, because an ambiguous identifier is problematic. Sometimes, but not always, there are facilities for aliasing such that multiple identifiers can be put together as identity. Such techniques are in any case just an extension of the exact-match approach, and do not work the way a competent human reader does when interpreting the "right" character upon seeing the "wrong" one. 3. Because there are many characters that may appear to be the same (or even, that are defined in such a way that they are all but guaranteed to be rendered by the same glyphs), it is fairly easy to create an identifier either by accident or on purpose that is likely to confuse even competent readers and writers of a language. 2. Techniques already in place In the IDNA mechanism for including Unicode code points [RFC5892], a code point is only included when it meets the needs of internationalizing domain names as explained in the IDNA framework [RFC5894]. For identifiers beyond IDNA, the PRECIS framework [RFC7564] generalizes the same basic technique. In both cases, the overall approach is to assume that all characters are excluded, and then include characters according to properties derived from the Unicode character properties. This general strategy cuts the enormous size of the Unicode database somewhat, avoiding including some characters that are necessarily unsuited for use as identifiers. The mechanism of inclusion by derived property, however, is insufficient to guarantee every included character is safe for use in identifiers. Some characters' properties lead them to be included even though they are not obviously good candidates. In other cases, indvidual characters are good for inclusion, but are problematic in combination. Finally, there are cases where a two characters or sequences are not problematic by themselves, or if used in alternation in the same identifier, but become problematic when their choice represents the only difference between otherwise identical identifiers. [[CREF2: Do we want examples here? --ajs]] Operators of systems that create identifiers (whether through a registry or through peer-to-peer identifier negotiation system) need to make policies for characters they will permit. Operators of registries, for instance, can help by adopting good registration policies: "Users will benefit if registries only permit characters from scripts that are well-understood by the registry or its Freytag, et al. Expires October 1, 2017 [Page 3] Internet-Draft Worrisome Characters March 2017 advisers."[RFC5894] The difficulty for many operators, however, is that they do not have the writing system expertise to claim any character is "well-understood", and they do not really have the time to develop that expertise. To help with the foregoing, a registry of Unicode code points that present special issues for network identifiers can help guide protocol and operating decisions about whether to permit a given code point or sequence of code points. In the case of registries, it is not always necessary or desirable to exclude characters so much as to guarantee that they are used in a strictly mutually exclusive way in otherwise identical identifiers. 3. A registry of code points 3.1. Discussion The registry contains three fields. The first field, called "Code Point(s)", is a code point or sequence of code points. The second, contains zero or more cross references to related code points. The third, called "Explanation", is a free form text field that describes briefly the issue. Long paragraphs are discouraged; a code point that needs such discussion should be discussed in a document somewhere. The explanation field may contain references to documents, so long as the reference is stable. The registry is updated by Expert Review. It ought to contain only code points that are significant in identifiers and that need special policies (including policies of exclusion). 3.2. Registry initial contents [[CREF3: I'm not sure that 0259 is strictly phonetic any more because it does have an uppercase (which is the reason it's been disunified from U+01DD --af]] Freytag, et al. Expires October 1, 2017 [Page 4] Internet-Draft Worrisome Characters March 2017 +-----------+------------+------------------------------------------+ | Code | Cross | Explanation | | Point(s) | Reference | | +-----------+------------+------------------------------------------+ | U+02BC | U+2019 | Character is indistinguishable from a | | | | common punctuation mark | | U+0338 | | Not intended for use in creating letters | | U+0259 | U+01DD | Phonetic character | +-----------+------------+------------------------------------------+ Table 1: Registry of Unicode Code Points for Special Consideration in Network Identifiers 4. IANA Considerations The IANA Services Operator is hereby requested to create the Registry of Unicode Code Points for Special Consideration in Network Identifiers, and to populate it with the values in section Section 3.2. The registry is to be updated by Expert Review. 5. Informative References [RFC5892] Faltstrom, P., Ed., "The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)", RFC 5892, DOI 10.17487/RFC5892, August 2010, . [RFC5894] Klensin, J., "Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale", RFC 5894, DOI 10.17487/RFC5894, August 2010, . [RFC7564] Saint-Andre, P. and M. Blanchet, "PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols", RFC 7564, DOI 10.17487/RFC7564, May 2015, . Appendix A. Discussion Venue This Internet-Draft may be discussed on the IAB Internationalization public list: i18n-discuss@iab.org. Appendix B. Change History Note to RFC Editor: this section should be removed prior to publication as an RFC. Freytag, et al. Expires October 1, 2017 [Page 5] Internet-Draft Worrisome Characters March 2017 00: * Initial version Authors' Addresses Asmus Freytag ASMUS, Inc. Email: asmus@unicode.org John C Klensin 1770 Massachusetts Ave, Ste 322 Cambridge, MA 02140 U.S.A. Email: john-ietf@jck.com Andrew Sullivan Dyn, Inc. 150 Dow St Manchester, NH 03101 U.S.A. Email: asullivan@dyn.com Freytag, et al. Expires October 1, 2017 [Page 6]