Internet Draft Jeffrey Altman draft-ietf-krb-wg-utf8-profile-00.txt Columbia University February 12, 2002 Expires in six months Stringprep Profile for Kerberos UTF-8 Strings Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." To view the list Internet-Draft Shadow Directories, see http://www.ietf.org/shadow.html. Abstract This document describes how to prepare UTF-8 strings in order to increase the likelihood that name input and name comparison work in ways that make sense for typical users throughout the world. This is a profile of the stringprep protocol developed in the IDN working group. 1. Introduction This document specifies processing rules that will allow users to enter Kerberos Principal Names and input to cryptographic String to Key functions. It is a profile of stringprep [STRINGPREP]. This profile defines the following, as required by [STRINGPREP] - The intended applicability of the profile: internationalized host name parts - The character repertoire that is the input and output to stringprep: defined in Section 2 - The list of unassigned code points for the repertoire: defined in Appendix F. - The mappings used: defined in Section 3. - The Unicode normalization used: defined in Section 4 - The characters that are prohibited as output: Defined in section 5 1.2 Terminology The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and "MAY" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Examples in this document use the notation for code points and names from the Unicode Standard [Unicode3.1] and ISO/IEC 10646 [ISO10646]. For example, the letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER A". In the lists of prohibited characters, the "U+" is left off to make the lists easier to read. The comments for character ranges are shown in square brackets (such as "[SYMBOLS]") and do not come from the standards. 2. Character Repertoire Unicode 3.1 [Unicode3.1] is the repertoire used in this profile. The reason Unicode 3.1 was chosen instead of a version of ISO/IEC 10646 is that ISO/IEC 10646 is expected to be updated soon after this document becomes an RFC. Unicode 3.1 has the exact repertoire that is expected in the next version of ISO/IEC 10646, and is therefore used here. 3. Mapping This profile specifies stringprep mapping using the mapping table in Appendix D. That table includes all the steps described in this section. Note that text in this section describe how Appendix D was formed. It is there for people who want to understand more, but it should be ignored by implementors. Implementations of this profile MUST map based on Appendix D, not based on the descriptions in this section of how Appendix D was created. 3.1 Mapped out The following characters are simply deleted from the input (that is, they are mapped to nothing) because their presence or absence should not make two strings different. Some characters are only useful in line-based text, and are otherwise invisible and ignored. 00AD; SOFT HYPHEN 1806; MONGOLIAN TODO SOFT HYPHEN 200B; ZERO WIDTH SPACE FEFF; ZERO WIDTH NO-BREAK SPACE Variation selectors and cursive connectors select different glyphs, but do not bear semantics. 180B; MONGOLIAN FREE VARIATION SELECTOR ONE 180C; MONGOLIAN FREE VARIATION SELECTOR TWO 180D; MONGOLIAN FREE VARIATION SELECTOR THREE 200C; ZERO WIDTH NON-JOINER 200D; ZERO WIDTH JOINER 3.2 Space Character Conversions The following Unicode spaces are to be mapped to 0020; SPACE: 00A0; NO-BREAK SPACE 2000; EN QUAD 2001; EM QUAD 2002; EN SPACE 2003; EM SPACE 2004; THREE-PER-EM SPACE 2005; FOUR-PER-EM SPACE 2006; SIX-PER-EM SPACE 2007; FIGURE SPACE 2008; PUNCTUATION SPACE 2009; THIN SPACE 200A; HAIR SPACE 202F; NARROW NO-BREAK SPACE 3000; IDEOGRAPHIC SPACE 4. Normalization This profile specifies using Unicode normalization form KC, as described in [UAX15]. NOTE: There was some discussion on the mailing list that would suggest that Unicode NFKC does not properly handle the composition of normalized Hangul strings. Following the lead of the IDN working group, the Kerberos working group will not attempt to second-guess the the authors of Unicode 3.1 Annex 15 (formerly Technical Report 15) [UAX15], which specifies the normalization methods, or the Ideographic Rappaorteur Group (IRG), which is the formal subgroup of ISO/IEC JTC1/SC2/WG2 charged with approving all CJKV elements of the Unicode standards. Such issues are outside the working group's charter and its area of expertise. 5. Prohibited Output This profile specifies using the prohibition table in Appendix E. Note that the subsections below describe how Appendix E was formed. They are there for people who want to understand more, but they should be ignored by implementors. Implementations of this profile MUST map based on Appendix E, not based on the descriptions in this section of how Appendix E was created. The collected lists of prohibited code points can be found in Appendix E of this document. The lists in Appendix E MUST be used by implementations of this specification. If there are any discrepancies between the lists in Appendix E and subsections below, the lists in Appendix E always takes precedence. Some code points listed in one section would also appear in other sections. Each code point is only listed once in the tables in Appendix E. 5.1 Control characters Control characters (or characters with control function) cannot be seen and can cause unpredictable results when displayed. 0000-001F; [CONTROL CHARACTERS] 007F; DELETE 0080-009F; [CONTROL CHARACTERS] 070F; SYRIAC ABBREVIATION MARK 180E; MONGOLIAN VOWEL SEPARATOR 2028; LINE SEPARATOR 2029; PARAGRAPH SEPARATOR 206A-206F; [CONTROL CHARACTERS] FFF9-FFFC; [CONTROL CHARACTERS] 1D173-1D17A; [MUSICAL CONTROL CHARACTERS] 5.2 Private use and replacement characters Because private-use characters do not have defined meanings, they are prohibited. The private-use characters are: E000-F8FF; [PRIVATE USE, PLANE 0] F0000-FFFFD; [PRIVATE USE, PLANE 15] 100000-10FFFD; [PRIVATE USE, PLANE 16] The replacement character (U+FFFD) has no known semantic definition in a name, and is often displayed by renderers to indicate "there would be some character here, but it cannot be rendered". For example, on a computer with no Asian fonts, a name with three ideographs might be rendered with three replacement characters. FFFD; REPLACEMENT CHARACTER 5.3 Non-character code points Non-character code points are code points that have been allocated in ISO/IEC 10646 but are not characters. Because they are already assigned, they are guaranteed not to later change into characters. FDD0-FDEF; [NONCHARACTER CODE POINTS] FFFE-FFFF; [NONCHARACTER CODE POINTS] 1FFFE-1FFFF; [NONCHARACTER CODE POINTS] 2FFFE-2FFFF; [NONCHARACTER CODE POINTS] 3FFFE-3FFFF; [NONCHARACTER CODE POINTS] 4FFFE-4FFFF; [NONCHARACTER CODE POINTS] 5FFFE-5FFFF; [NONCHARACTER CODE POINTS] 6FFFE-6FFFF; [NONCHARACTER CODE POINTS] 7FFFE-7FFFF; [NONCHARACTER CODE POINTS] 8FFFE-8FFFF; [NONCHARACTER CODE POINTS] 9FFFE-9FFFF; [NONCHARACTER CODE POINTS] AFFFE-AFFFF; [NONCHARACTER CODE POINTS] BFFFE-BFFFF; [NONCHARACTER CODE POINTS] CFFFE-CFFFF; [NONCHARACTER CODE POINTS] DFFFE-DFFFF; [NONCHARACTER CODE POINTS] EFFFE-EFFFF; [NONCHARACTER CODE POINTS] FFFFE-FFFFF; [NONCHARACTER CODE POINTS] 10FFFE-10FFFF; [NONCHARACTER CODE POINTS] The non-character code points are listed the PropList.txt file from the Unicode database. 5.4 Surrogate codes The following code points are permanently reserved for use as surrogate code values in the UTF-16 encoding, will never be assigned to characters, and are therefore prohibited: D800-DFFF; [SURROGATE CODES] 5.5 Inappropriate for plain text The following characters should not appear in regular text. FFF9; INTERLINEAR ANNOTATION ANCHOR FFFA; INTERLINEAR ANNOTATION SEPARATOR FFFB; INTERLINEAR ANNOTATION TERMINATOR FFFC; OBJECT REPLACEMENT CHARACTER 5.6 Inappropriate for canonical representation The ideographic description characters allow different sequences of characters to be rendered the same way, which makes them inappropriate for host names that must have a single canonical representation. 2FF0-2FFB; [IDEOGRAPHIC DESCRIPTION CHARACTERS] 5.7 Change display properties The following characters, some of which are deprecated in ISO/IEC 10646, can cause changes in display or the order in which characters appear when rendered. 200E; LEFT-TO-RIGHT MARK 200F; RIGHT-TO-LEFT MARK 202A; LEFT-TO-RIGHT EMBEDDING 202B; RIGHT-TO-LEFT EMBEDDING 202C; POP DIRECTIONAL FORMATTING 202D; LEFT-TO-RIGHT OVERRIDE 202E; RIGHT-TO-LEFT OVERRIDE 206A; INHIBIT SYMMETRIC SWAPPING 206B; ACTIVATE SYMMETRIC SWAPPING 206C; INHIBIT ARABIC FORM SHAPING 206D; ACTIVATE ARABIC FORM SHAPING 206E; NATIONAL DIGIT SHAPES 206F; NOMINAL DIGIT SHAPES 5.8 Tagging characters The following characters are used for tagging text and are invisible. E0001; LANGUAGE TAG E0020-E007F; [TAGGING CHARACTERS] 6. Unassigned Code Points in Internationalized Host Names This profile lists the unassigned code points for Unicode 3.1 in Appendix F. The list in Appendix F MUST be used by implementations of this specification. If there are any discrepancies between the list in Appendix F and the Unicode 3.1 specification, the list Appendix F always takes precedence. 7. Security Considerations ISO/IEC 10646 has many characters that look similar. In many cases, users of security protocols might do visual matching, such as when comparing the names of trusted third parties. This profile does nothing to map similar-looking characters together. Principal names and passwords are entered by users and used within the Kerberos protocol. The security of the Internet would be compromised if a user entering a single internationalized string could be connected to different servers or denied access based on different interpretations of internationalized strings. 8. References [CharModel] Unicode Technical Report;17, Character Encoding Model. . [Glossary] Unicode Glossary, . [ISO10646] ISO/IEC 10646-1:2000. International Standard -- Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997, RFC 2119. [STRINGPREP] Paul Hoffman and Marc Blanchet, "Preparation of Internationalized Strings ("stringprep")", draft-hoffman-stringprep, work in progress [Unicode3.1] The Unicode Standard, Version 3.1.0: The Unicode Consortium. The Unicode Standard, Version 3.0. Reading, MA, Addison-Wesley Developers Press, 2000. ISBN 0-201-61633-5, as amended by: Unicode Standard Annex #27: Unicode 3.1 . [UAX15] Mark Davis and Martin Duerst. Unicode Standard Annex #15: Unicode Normalization Forms, Version 3.1.0. A. Acknowledgements This draft is based upon the work of the IETF IDN Working Group's IDN Nameprep design team. B. IANA Considerations This is a profile of stringprep. When it becomes an RFC, it should be registered in the stringprep profile registry. C. Author Contact Information Jeffrey Altman jaltman@columbia.edu Columbia University 612 West 115th Street New York NY 10025 D. Mapping Tables The following is the mapping table from Section 3. The table has three columns: - the character that is mapped from - the zero or more characters that it is mapped to - the reason for the mapping The columns are separated by semicolons. Note that the second column may be empty, or it may have one character, or it may have more than one character, with each character separated by a space. ----- Start Mapping Table ----- ... to be filled in ... ----- End Mapping Table ----- E. Prohibited Code Point List ----- Start Prohibited Table ----- ... to be filled in ... ----- End Prohibited Table ----- NOTE WELL: Software that follows this specification that will be used to check names before they are put in authoritative name servers MUST add all unassigned code pints to the list of characters that are prohibited. See Section 6 of [STRINGPREP] for more details. F. Unassigned Code Point List ----- Start Unassigned Table ----- ... to be filled in ... ----- End Unassigned Table ----- Jeffrey Altman * Sr.Software Designer C-Kermit 8.0 available now!!! The Kermit Project @ Columbia University includes Telnet, FTP and HTTP http://www.kermit-project.org/ secured with Kerberos, SRP, and kermit-support@columbia.edu OpenSSL. Interfaces with OpenSSH