INTERNET-DRAFT Soobok Lee draft-leegim-idn-hangeulchar-00.txt GyeongSeog Gim Expires 2001-Dec-27 2001-Jun-27 Hangeul NAMEPREP considerations version 1.0 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Distribution of this document is unlimited. Please send comments to the authors or to the idn working group at idn@ops.ietf.org. Abstract This document suggests Hangeul-specific NAMEPREP recommendations. It defines : - mapping tables for half-width jamo and enclosed jamo - compatibility Hangeul jamo block to be excluded from compatibility decomposition in normalization step - criteria for determining invalid syl-ipf jamo sequence - prohibited hangul filler character in KC norm output. Contents Overview Background: UCS Hangeul Hangeul Canonical Composition Hangeul Compatibility Decomposition Summarized Recommendations Comments on security implication of inter-lingual similarities Security considerations References A1. Acknowledgements A2. Authors A3. the mapping table for enclosed jamo A4. the mapping table for half-width jamo Overview A user can enter a domain name into an application program in a myriad of fashions and the characters entered in the domain name may or may not be those that are allowed in internationalized host names. Thus, there must be a way to normalize the user's input before the name is resolved in the DNS, which is the rationale for NAMEPREP. NAMEPREP design goals are : - to allow users to enter host names in applications and have the highest chance of getting the name correct. The user should not be limited to only entering exactly the characters that might have been used for domain name registration, but be able to enter characters that can be unambiguously normalized to characters in the registered domain name. - to prohibit as few characters as possible that might be used in the future and in the various contexts - to allow the widest possible set of host names as long as those host names do not cause other problems, such as conflict with other standards. The NAMEPREP process to prepare internationalized host names for use in the DNS includes the following stages [NAMEPREP]: - stage1 : mapping characters to other characters, such as to change their case, mapping out some meaningless characters - stage2 : normalizing characters using normalization form KC. KC form consists of two steps detailed in [UTR15] - compatibility decomposition - canonical composition - stage3 : excluding characters that are prohibited from appearing in internationalized host names This draft defines special Hangeul character mappings and exceptions in applying KC normalization. And this draft also defines some prohibited Hangeul characters and sequences so that Hangeul can be used safely in Internet identifiers such as IDN. The content of this draft is subject to change with further discussions and studies. Background : UCS Hangeul Korean Hangeul syllables are formed from a set of Hangeul letters, called jamo in Korean, in a regular fashion. The ISO/IEC 10646 (=Unicode Standard) contains both the complete set of precomposed modern Hangeul syllable blocks and the set of syl-ipf Hangeul jamo (= conjoining jamo in [UNICODE] ). This set of syl-ipf jamo can be used to encode all modern and old syllable blocks. For a description of syl-ipf Hangeul jamo behavior and precomposed Hangeul Syllables, see [UNICODE]. Hangeul jamo are divided into three classes: choseong (leading consonants), jungseong(vowels), and jongseong(trailing consonants). In the following paragraphs, these classes are abbreviated as L (leading consonant), V(vowel), and T (trailing consonant). And for use in composition, two invisible filler characters act as placeholders for choseong or jungseong: U+115f (Hangeul choseong filler) and U+1160 (hangeul jungseong filler). The UCS/Unicode contains a set of Hangeul Compatibility jamo (U+3130~U+318F) which consists of spacing, nonsyl-ipf Hangeul consonants and vowel elements. These characters are provided solely for compatibility with the KS X 1001 (formerly KS C 5601) standard. Unlike the characters found in the Hangeul jamo block (U+1100 .. U+11FF), the compatibility jamo characters have no syl-ipf semantics. The UCS/Unicode Standard also contains 52 half-width modern Hangeul jamo in the halfwidth and fullwidth forms (U+FFA0 .. U+FFDC) block and enclosed Hangeul syllables and jamo in the enclosed CJK letters and months block (U+3200 .. U+32FF). Enclosed ones are consisted of parenthesized jamo and circled jamo. Hangeul canonical composition Modern Hangeul syllables can be expressed with either two or three jamo, either in the form consonant + vowel or in the form consonant + vowel + consonant. There are 19 possible leading (initial) consonants (choseong), 21 vowels (jungseong), and 27 trailing (final) consonants (jongseong). Thus there are 399 possible two-jamo syllables and 10,773 possible three-jamo syllables, for a total of 11,172 modern Hangeul syllables. Each of the Hangeul syllables may be encoded by an equivalent sequence of syl-ipf jamo; however, the converse is not true because thousands of archaic Hangeul syllables may be encoded only as a sequence of syl-ipf jamo. Implementaions that use a syl-ipf jamo encoding are able to represent these archaic Hangeul syllables. The Hangeul syllables can be derived from syl-ipf jamo by a regular process of composition. The algorithm that maps a sequence of syl-ipf jamo to the encoding point for a Hangeul syllable is detailed in [UNICODE]. In canonical composition, the syl-ipf jamo sequence for modern Hangeul syllable is transformed into the modern Hangeul syllable, but the sequence for archaic Hangeul syllable and the invalid jamo sequence (defective combining character sequence) are preserved in this process. In normalization form KC, all input sequence of code points go through this canonical composition [UTR15]. If any invalid jamo sequence is detected after KC normaliation stage, as it is not displayable correctly and distinguishably, the sequence should be prohibited from being an identifier. Whether a syl-ipf jamo sequence is valid or not can be determined according to the criteria detailed in [UNICODE]. Hangeul compatibility decomposition In normalization form KC, all input code sequence go through this compatibility decomposition and then canonical composition. Every Hangeul compatibility jamo and half-width jamo have compatibility equivalent Hangeul syl-ipf jamo defined in [UNICODE_CHART]. But this equivalence does violate the semantics and combining rules for compatibility jamo sequence in [KSC5601] from which UCS compatibility jamo came. In [KSC5601], a valid compatibility jamo sequence should start with a filler followed by choseong,jungseong and jongseong (or filler) to denote a Hangeul syllable. If the sequence does not fulfill this criterion, its jamo should remain unchaged as compatibility jamo. The same for half-width Hangeul jamo. Current compatibility decomposition blindly transforms compatibility jamo sequence even without a leading filler on a jamo by jamo basis. For example, a valid jamo sequence "filler gi-eog a gi-eug" (U+3164 U+3131 U+314F U+3131) denoting a Hangeul syllable "gag"(U+AC01) is errornously transformed into "jungsong_filler chosung_gi-eog jungseong_a chosung_gi-eog" (U+1160 U+1100 U+1161 U+1100) that are canonically composed into "syllable_ga choseong_gi-eog" (U+AC00 U+1100) which are false. If this could be avoided, NAMEPREP should exclude compatibility jamo and half-width jamo from its compatibility decomposition step. And, only valid compatibility jamo sequence should be recognized and transformed into a syl-ipf jamo sequence at the mapping step before KC normalization step in NAMEPREP. Hangeul consonant sequence can be used as abbreviated form of long Hangeul syllables sequence that represent Hangeul business name. And, there may be future need to represent Hangeul syllables in compatibility jamo sequences for alternative syllable writing/ displaying scheme. In NAMEPREP KC normalization and its inner compatibility decomposition, each parethesized Hangeul jamo is transformed into its compatibility equivalent character sequence consisted of one pair of parentheses with inner Hangeul jamo and then that sequence is treated as an invalid domain due to including prohibited parenthses. Each parethesized Hangeul syllable is transformed into its compatibility equivalent character sequence consisted of one pair of parentheses with inner Hangeul syllable and then that sequence is treated as an invalid domain due to prohibited parentheses. So, we have no suggestion on these to-be-prohibited parenthesized jamo and syllables. In NAMEPREP KC normalization and its inner compatibility decomposition, Circled Hangeul jamo is transformed into its compatibility equivalent Hangeul jamo which is not appropriate in IDN context, and preferrably, this NAMEPREP process should map this circled one into the corresponding compatibility Hangeul jamo before KC normalization to bypass this inappropriate compatibility decomposition. Circled Hangeul syllable is transformed into its compatibility equivalent Hangeul Syllable and raises no problem. Summarized Recommendations KC normalization employed in NAMEPREP process does not preserve some Hangeul code semantics and so we recommend the following additional NAMEPREP actions for Hangeul codes: * Stage 1: mapping - circled Hangeul jamo = map into the corresponding Hangeul compatibility jamo code range: U+3160 ~ U+326D mapping table detailed in appendix 3. - half-width Hangeul jamo = map into the corresponding Hangeul compatibility jamo code range: U+FFA1 ~ U+FFDC mapping table detailed in appendix 4. - transform compatibility jamo sequence into syl-ipf jamo sequence with leading filler(U+3164) removed = if and only if the sequence is of filler+ L+ V+ T (or filler) form. = each resulting jamo with intended choseong or jongseong semantics implied in the input sequence * Stage 2: KC normalization - compatibility decomposition = exclude compatibility Hangeul jamo; preserve them code range: U+3130 ~ U+318F * Stage 3: prohibitions - prohibit invalid syl-ipf Hangeul jamo sequences = return error if not meaningful LV or LVT sequence - compatibility Hangeul filler (U+3164) not combined = return error Comments on security implication of inter-lingual similarities We have found many similarities between hangeul jamo and other language scripts like japanese katakana and latin. To list some of them: - hangeul jamo gi-eog and katakana hu - hangeul jamo mi-eum and katakana ro - hangeul jamo i-eung and latin 'o' - hangeul jamo ji-euth and katakana su - hangeul jamo ki-eog and katakana wo - hangeul jamo a and katakana to - hangeul syllable ma and katakana ro-to - hangeul syllable ja and katakana su-to - hangeul syllable ga and katakana hu-to - hangeul syllable i and digits '01' Some hangeul domains similiar to katakana domains can mislead some japanese to believe hangeul hostnames or hangeul email addresses are the japanese ones they trust. To mitigate these inherent security problems, there should be well-prepared registration/dispute resolution policy that can be enforced to every zone masters (including root zone and its lower-level zones) and every email account masters. Of course, whether this is feasible or not is beyond NAMEPREP scope. Security considerations This suggestion improves IDN security by prohibiting/correcting non-displayable or invalid hangeul syllables/sequences in IDN. References [IDNREQ] Requirements of Internationalized Domain Names http://www.ietf.org/internet-drafts/draft-ietf-idn-requirements-08 .txt [UNICODE] The Unicode Consortium, "The Unicode Standard", http://www.unicode.org/unicode/standard/standard.html [UNICODE_CHART] THe Unicode Code Charts http://www.unicode.org/charts/ [IDNA] Patrik Falstrom, Paul Hoffman, "Internationalizing Host Names In Applications (IDNA)", http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-02.txt [NAMEPREP] Paul Hoffman, Marc Blanchet, "Preparation of Internationalized Host Names", Feb 2001, http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-03.txt [UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms. Unicode Technical Report;15. http://www.unicode.org/unicode/reports/tr15/ [VERSION] M Blanchet "Handling versions of internationalized domain names protocols", http://www.ietf.org/internet-drafts/draft-ietf-idn-version-00.txt [ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane, Oct. 2000, with amendments. [KSC5601] Korean Standard KS C 5601- 1987 A1. Acknowledgements Dongman Lee and Yangwoo Ko made valuable contributions to narrowing down the issues of the prohibition and preservation of some hangeul characters. Thank Mark Davis for his advice on useful UNICODE reference documents. A2. Authors Soobok Lee Postel Servies, Inc. http://www.postel.co.kr Tel: +82-11-9774-2737 GyeongSeog Gim Department of Computer Engineering Pusan National University Republic of Korea Tel: +82-51-510-2292 A3. the mapping table for enclosed jamo in the format of [VERSION] version=1.0 3260;1.0;3131 3261;1.0;3134 3262;1.0;3137 3263;1.0;3139 3264;1.0;3141 3265;1.0;3142 3266;1.0;3145 3267;1.0;3147 3268;1.0;3148 3269;1.0;314A 326A;1.0;314B 326B;1.0;314C 326C;1.0;314D 326D;1.0;314E A4. the mapping table for half-width jamo in the format of [VERSION] version=1.0 FFA1;1.0;3131 FFA2;1.0;3132 FFA3;1.0;3133 FFA4;1.0;3134 FFA5;1.0;3135 FFA6;1.0;3136 FFA7;1.0;3137 FFA8;1.0;3138 FFA9;1.0;3139 FFAA;1.0;313A FFAB;1.0;313B FFAC;1.0;313C FFAD;1.0;313D FFAE;1.0;313E FFAF;1.0;313F FFB0;1.0;3140 FFB1;1.0;3141 FFB2;1.0;3142 FFB3;1.0;3143 FFB4;1.0;3144 FFB5;1.0;3145 FFB6;1.0;3146 FFB7;1.0;3147 FFB8;1.0;3148 FFB9;1.0;3149 FFBA;1.0;314A FFBB;1.0;314B FFBC;1.0;314C FFBD;1.0;314D FFBE;1.0;314E FFC2;1.0;314F FFC3;1.0;3150 FFC4;1.0;3151 FFC5;1.0;3152 FFC6;1.0;3153 FFC7;1.0;3154 FFCA;1.0;3155 FFCB;1.0;3156 FFCC;1.0;3157 FFCD;1.0;3158 FFCE;1.0;3159 FFCF;1.0;315A FFD2;1.0;315B FFD3;1.0;315C FFD4;1.0;315D FFD5;1.0;315E FFD6;1.0;315F FFD7;1.0;3160 FFDA;1.0;3161 FFDB;1.0;3162 FFDC;1.0;3163