Internet Draft Liana Ye draft-liana-idn-map-00.txt Y&D ISG Sept. 11, 2001 Expires in six months (Mar. 2002) IDN Code Exchange Mapping Structure Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract The client side of IDN [IDN] has to accomdate users of different scripts, with many existing national and internetional standards and different clients and local servers. The server side of IDN is a proven stable US-ASCII only DNS system . An Internetional Unicode standard based, national standard tabulation exchange structure called IDN-map is described. Contents 1. Introduction 2. IDN Standards Code Exchange Table 2.1 Structure of IDN Code Exchange Table 2.2 Access of IDN Code Exchange Table 3. Version control and Language tags of IDN Code Exchange Table 3.1 Language Tags 3.2 Language Tag File Format 3.3 Identification of a Tag of an Input String 4. Interface with IDN Code Exchange Map 4.1 Language Specific Modules 4.2 Script Specific Canonicalization 4.3 Language Specific Normalization and Presentation 4.4 Language Tagged IDN Label Conversions 4.5 Uniform Idn-label Protocol 5. Prefered Embodiments of IDN Code Exchange Map 1. Introduction Users from international travelers, to middle school students on Tibet Plateau, to librarians in Washington D.C. wish to have direct access to internet from their familar desktop with their native languages for years, the internet community has been trying to bring that services to the users from many locations around the world. Some servers have successfully demonstrated the concept for such a service, for example, http://www.3721.com is providing Pinyin [Pinyin] based mnemonic registration for Chinese users and allow clicking through on users' screen from Chinese URL[URL] window. This document suggests a client side structure and cooporated by servers to support such a direct and speedy universal URL access for all users on Internet. 1.1 Context Symbols of natural languages are open sets for CJK[CJK] as well as for English [ALPHBET]. For example, Chinese continuously discovers characters, "Zi", to add onto their character set exceeding the number of 100,000 already. In the United States, many European symbols appears in American names, which makes its symbol set exceeds the original of English 26 letters. Combinations of symbols are called "word" in English, "ci" in Chinese, and "string" in term of domain names. In this document, the discussion is focused on a mapping structure, called IDN-map for symbols, which are referred as UCS[UCS] "Code Points", to specify its relationships among various national symbol standards in term of code points to support accuate, speedy combinations of symbols for Internet domain name identification. Due to the nature of UCS character set as a multi-script, for multi-language users, besides the issue of equal speedy access, IDN-map has to address three additional issues in recognizing the nature of an open symbol set. The first issue is allowing more mixed script use when there is enough experience in dealing with existing mixed script use. The second issue is allowing new symbols to be added into the table in the future. The third issue is to let depreciated local standards drop out without implicating the international structure and IDN-map's life expectancy. IDN-map needs two key mechanism to accomodate above issues in addition to current [nameprep] proposal. The first key mechanism is a traffic signal, called "Language Tag" [RFC 3066], since the users are using different spoken languages as they are defined in [ISO 639]. These languages are expressed with symbols specified in UCS[ISO 10646], as well as ASCII[ASCII], GB[GB], BIG5[BIG5], JIS[JIS], KSC[KSC], ISCII[ISCII]. The users dictate which symbol to be used and from where in the UCS, which exhibits very high locality for legitimate uses, and here is called "Script Range" of a specific language tag. A script range may include more than one code blocks of UCS, such that it permits the deployment of IDN in multiple stages, and allows a script range to be expanded in the future for mixed script use. The second key of IDN-map is a two-level symbol switching mechanism, called langauge tagged ASCII compatible character encoding, short for T-ACE. The T, for the tag part, is the switch between different spoken languages which may implies various national and international standards including ASCII. The ACE part is the switch among symbols within the same script range. The ACE part of the switch is a massive one for Chinese tag: it is a range from 2,000 for student readers of "People's Daily", to 50,000 and above for a librarian and many other variants in between, not including Japanese, Korean and other spoken languages. To provide a switching system for such a variation use of symbols, each switch in the system needs to be labeled for a human. It needs to be a mneumonic switch and it needs to be scaleble for different user groups too. The proposed ACE is a mnemonic encoding scheme, and is called StepCode [StepCode]. With T-ACE in a multiple standard tabulation, a simple uniform keyboard control of a domain name identifier becomes possible. 1.2 Author's Disclaimer The author is not associated in anyway either as a member or as a consultant with any of the above mentioned standards, or standard bodies, or any other commercially operated entities and can not be responsible for any consequence raised from either inclusion or exclusion of any names mentioned herein. 1.3 Terminology The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and "MAY" in this document are to be interpreted as described in [RFC2119]. Examples in this document use the notation from the Unicode Standard [Unicode3] as well as the ISO 10646 names. For example, the letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER A". Examples also use octet notation from national code exchange standards to represent a Unicode character, such as "5167". 1.4 IDN summary IDN-Map is a basic international code exchange table to support interoperability across various existing clients and local servers on the Internet. It accomodates existing user requirements, engineering feasibility, DNS stability and security, and provides a bridge from existing user platforms to new applications based on the table of Unicode standard. 2. IDN Standards Code Exchange Table The character set in UCS is a super set of many national code exchange standards as well as many symbols outside those standards. Vast existing applications built on such national code exchange standards are highly crafted to serve large groups of language specific users [UNAME]. While these existing local standards are not compatible with each other, they are compatible with ASCII, any of its symbols may be expressible with alphanumeral of ASCII characters. Through such an alphanumeral, a mapping between a symbol in a local standard to a code point in UCS is easily achievable. 2.1 Structure of IDN Code Exchange Table Due to IDN name preparation requirement [IDN req], many of the symbols used in common names need to be normalized and canonicalized [nameprep] before they can be used as IDN identifiers. Thus the IDN Code Exchange Table has two columns to satisfy such a primary requirement, and the third column is the corresponding T-ACE identifier for each UCS IDN identifier of the primary language users of those identifiers. The three columns are called Unicode-full-section Unicode-primary-fold and ACE-primary tagged, and short as U-s, U-p, and A-p respectively as in the following example: U-s U-p A-p U+0041 U+0061 a (Latin Letter A case folding) U+2fc2 U+2ee5 yv2 (Han character fish for Chinese case folding) The three columns define a primary IDN code exchange table, and referred as "IDN Primary Map" here after. When there are more than one spoken language users for the same UCS codepoints, one or more secondary languages are added to the primary map. For example, a Japanese Kanji "Fish" corresponding with the same UCS code point U+2fc2 is added to the above map, then: U-s U-p A-p U-j A-j U+0041 U+0061 a (Latin Letter A case folding) U+2fc2 U+2ee5 yv2 U+2fc2 uo (Han character fish) The U-j column equally can be U-k, for Unicode tagged as Korean, a Hangul code point may be there just as well. Or Korean can be additional two columns added to the secondary map. 2.1.1 IDN-Map that Never Shrinks It is REQUIRED that a IDN Primary Map contains a column of all permitted symbols, sorted by UCS code points, used in an IDN names, and it is called the "UCS input codepoint". It is also REQUIRED that a IDN Primary Map contains a column of corresponding IDN identifier symbols, called UCS-folded codepoints, and a column of corresponding ASCII symbols permitted by [STD13] to be used for DNS identifiers, called DNS-codepoints. Data items in a IDN Primary Map MUST NOT be removed, MUST NOT be altered in anyway whence it is deployed. It is REQUIRED that after a secondary language added onto an IDN primary map, the items in such an addition MAY NOT be removed, MAY NOT be altered in anyway whence it is deployed. The additional columns of a secondary language is called IDN secondary map, and each item in a secondary map MUST correponding with its primary map entry in associated UCS input codepoints. 2.1.2 Equivalent Symbol Set Mapping Equivalent Symbol set of a script is common and it is important to identify such equivalency in the context of IDN identifiers on naming the same entity with semanticly equivalent symbols especially when IDN provides far more potential use for symbols from mixed scripts. IDN-map is a convenient vehicle to carry out equivalent symbol set by prividing more referencing columns, called Equites Map and shorted as U-e, to the IDN Primary Map, as such: U-e U-s U-p A-p U-j A-j U+0410 U+0041 U+0061 a (Latin Letter A case folding) U+???? U+2fc2 U+2ee5 yv2 U+2fc2 uo (Han character fish) or in IDN Primary Map format: U-s U-p A-p a a a a' a a a" a a Access support to Equites Map is NOT RECOMMENDED for applications discussed in this document, since the focus here is for the ease of the largest common denominator code exchange. 2.2 Access of IDN Code Exchange Table Many access method can be supported with IDN code exchange map, they are the universal access and local access, where a local access MAY be deprecated in the future when universal access becomes direct global access to every one in particular local area. 2.2.1 Universal Access The IDN Primary Map offers two types of access: 1) Unicode input through a screen selection or URL buffer and receive a DNS codepoint in its primary language users' favor, and is called "idn-umap"; 2) to access through a DNS codepoint and retrieve its corresponding UCS codepoint for display. The IDN maps sorted by codepoints in particular column are called IDN access maps, and the access through primary DNS compatible codes is called the IDN Primary Access Map, and is called IDN Tagged Primary Access Map, for subsets of IDN Primary Access Map. For example, UCS CJK section in IDN Primary Access Map is called IDN Chinese Primary Access MAP, or "idn-zh-pmap" for short. It is REQUIRED that the IDN Tagged Primary Access Maps are NOT overlap with each other in terms of UCS codepoints. There is also the potential in over fregmenting IDN Primary Map, and causing unnecessary processing overhead for both machine time and user fustration. Reasonable studies are REQUIRED in defining Primary Access Maps to facilitate different language groups using the same Primary Access Maps, such that Primary Access Maps are not fregmented into excessively small maps. The DNS codepoint access map for a secondary language user is called IDN Tagged Secondary Universal Access Map. Thus a Korean universal access map is named as "idn-kr-amap". IDN Universal Access Maps MUST be updated when IDN primary map is updated. 2.2.2 Local Access Many existing local display standards are the basic code points in the client system and local server systems. They are limited to highly efficient set of operations for the end users as well as processes for the local servers. To facilitate end users for the speed of IDN access as well as compatibility with existing applications, it is RECOMMENDED that an IDN code exchange table inculdes applicable local display standards corresponding with each applicable codepoints in UCS. Taking the example from Section 2.1: U-s U-p A-p U-j A-j U+0041 U+0061 a (Latin Letter A case folding) U+2fc2 U+2ee5 yv2 U+2fc2 uo (Han character fish) after including local code standards, it becomes: 0 2-1 2 2+1 2+2 6-1 6 6+1 (Column number) U-s U-p A-p G-p B-p U-j A-j J-j (Column header) U+0041 U+0061 a (Case folding) U+2fc2 U+2ee5 yv2 5167 b3bd U+2fc2 uo ??? (Han character fish) Where G-p: GB standard in primary language of codepoint U+2fc2 B-p: Big5 standard in primary language of codepoint U+2fc2 J-j: JIS standard in Japanese language of codepoint U+2fc2 The Column number in the first row are identified with a language tag discussed in Section 2.3.1. The column# with "+" are local access maps. They are called idn-zh-lmap-gb, idn-zh-lmap-b5 and idn-ja-lmap-ji respectively, and their column number is an off-set index from its tagged ACE column number. It is RECOMMENDED, that when a local display code standard is not used for any legitimate reasons, it MAY be deprecated from IDN code exchange table, and any new application based on the IDN-map MAY NOT depend on local access maps. 2.2.3 Summary of IDN Maps A list of IDN maps using the Column header in example in section 2.2.2, where (S) indicates the sorted column with the map naming: Full maps: 0 2-1 2 2+1 2+2 6-1 6 6+1 (Column number) idn-umap U-s(S) U-p A-p G-p B-p U-j A-j J-j (UCS Map) 0-3 0-2 0-1 0 2 6 (Column number) idn-emap U-e" U-e' U-e U-s(S) U-p A-p U-j A-j (Equites Map) Tagged section maps: idn-la-pmap U-s U-p A-p(S) (Latin section) idn-zh-pmap U-s U-p A-p(S) G-p B-p U-j A-j J-j (Chinese CJK section) idn-ja-pmap U-s U-p A-p(S) G-p B-p U-j A-j J-j (Japanese Kana section) ... idn-ja-amap U-s U-p A-p G-p B-p U-j A-j(S) J-j (Japanese CJK section) ... Local access maps: idn-zh-lmap-gb U-s U-p A-p G-p(S) B-p U-j A-j J-j (Chinese GB access) idn-zh-lmap-b5 U-s U-p A-p G-p B-p(S) U-j A-j J-j (Chinese BIG5 access) idn-ja-lmap-ji U-s U-p A-p G-p B-p U-j A-j J-j(S) (Japanese JIS access) ... 2.2.4 Syntax of IDN Maps The syntax of IDN maps MUST conform in full with definition specified in Section 3 of [Version]. In addition, a third field of the values is specified as the language tagged, [STD13] conforming IDN names, or DNS identifiers. It is further specified, if any fields in a line is empty within a given language tagged code block, a field separator ";" MUST be used to maintain data fields alinement. It is REQUIRED that each line of IDN-map is treated in its entirety in sorting and its columns MUST consistent with its column number sepcified in its full map, idn-umap. A separated text file, and is proposed to be named as "idntag-xy.txt", to specify particular Unicode blocks applicable to a particular language tag and its data fields or column number definition. More discussion regarding the IDNTAG file is in the next section. 3. Version control and Language tags of IDN Code Exchange Table UCS character set is an open set, there are possible updates to let in new scripts as well as new individual characters. There are also possible longer preparation time required for certain subsets to be deployed, as well as possible increased user demand for mixed script use in the future. Language tag defined by [ISO639][RFC 3066] MUST be used as a flag in 1) defining a ready to serve a language group as apposed to unspecified language group such as mathematic "language", 2) ready to serve script range in terms of Unicode blocks, 3) ready to find corresponding mneumonic ACE for a UCS codepoint and vice versa. 3.1 Language Tags A language tag is define by [ISO639-2/T] and [RFC 3066], and it MUST be prepended to a DNS name label and followed by a hyphen "-" in the form of "xx-". A tag MUST have at least one non-zero Unicode block, R1, as its associated script range, defined by a triple: (start-point, end-point, Column# of T-ACE in IDN-Map), or (0001, ffff, n), where start-point <= R1 >= end-point of Unicode code points; and column# MUST be an positive interger, n, where n-1 is the tagged Unicode folded column, and n+1, n+2, ... , m, are column# of the local display standards of the language tag. The first code block of a script range is the primary range of a language tag. It is REQUIRED that none of the primary ranges of language tags are overlap for feasible covering of error checking and consistent assignment of T-ACE value. It is also RECOMMENDED to test for operational complexity before increasing its associating number of blocks, or to expand its script range. It is REQURIED to register a language tag with IANA and its associated script range whenever it is modified. The repertoire of the registered tags and their script ranges is called IDNTAG file here after. 3.2 Language Tag File Format IDNTAG file has a consistent format specified in [Version] Section 3, that is: one language tag per line lines separated by CR/LF each field in the line separated by ";" each subfield in the line separated by "," the third subfield of the first triple field in a line is a constant for all primary language tagges for ease of maintainess. such that the IDNTAG file takes on the form: tag-name; version#; block-1; block-2; block-3;... where each block has a three subfields, specifing the starting and ending codepoint of a block in form of Unicode hexadecimal, and a interger as the number of T-ACE column in IDN map. For example: tag1;1.0;HHHH,HHHH,2;HHHH,HHHH,6;HHHH,HHHH,5; tag2;1.0;HHHH,HHHH,2;HHHH,HHHH,5; ... 3.3 Identification of a Tag of an Input String An IDN address in URL format may be in any mixed scripts, but all the characters of an IDN label MUST be in the same script range of one language tag. This conformity ensures correct treatment of an IDN label by any URL parsers, and minimizes confusion codepoints among different scripts. To use mixed scripts in one IDN label is NOT RECOMMEMDED for an early deployment of IDN. 3.3.1 IDN Tag File Interface An IDN label can be an arbitary byte stream in IDN-Map permitted display code standards ([ISO10646] and others to be decided), and a four parameters for such an interface to IDNTAG file is defined as: stat = find-tag(input, tag-file, input-std, tag-rec) where find-tag MUST have four parameters: input: a string in input standard byte stream, input-std: one of the code exchange standard permitted in IDN-map, including (UCS, USASCII, GB, BIG5, JIS, KSC, ISC ...) idntag-file: tag definition file specified in Section 2.3.2. tag-rec: a buffer for returning triples as defined in Section 2.3.1. stat: status of the search including (ERROR, USASCII, UCS, ALPH, CONS, CJK, NO-TAG, LOC), and discussed in Section 3.3.3. Find-tag portocol is REQUIRED before each access to IDN-Map. 3.3.2 IDN Language Tag Identification Protocol The above find-tag protocol is REQUIRED to include the following actions and the actions are performed in the following orders: 1) to identify tag prefix of an DNS label and returns a tag's triples; 2) to identify an ASCII DNS label, if it is conforming to [STD13], and assign USASCII to the tag value, and return USASCII to tag status; 3) to assign a tag, if the input standard has a known language tag, for example, input standard JIS implys language tag "ja", returns tag triples; 4) be defaulted to UCS and checking for script range error. It is RECOMMENDED that at least two of the input Unicode codepoints to be checked for more acurate tag identification. Inconsistent tag values between the two check points, the more specific value MUST be returned, and a coresponding tag triples MUST be returned; 5) to assign a language tag status to the protocol, when no applicable tag is found, and a prohibited codepoint is not encounted, a NO-TAG value MUST be returned. 3.3.3 IDN Language Tag Identification Status Protocol Tag identification is RECOMMENTED to use at least two of input codepoints, for higher accuracy and a two-step classification as well: one for its script group, the other for the script within the group. The first step is to identify script group. Since scripts may be treated in three different groups: alphabet, consonant and syllabic or character-based systems. The three groups is reflected by the following code blocks in UCS as shown bellow: Alphabet Sys. Consonant Sys. Character Sys. From: 0020 0530 2e80 to: 052f 1bff d7af include:Latin Armenian CJK Greek Hebrew Kanji Cyrillic Arabic Kana IPA Devanagari Hangul Vietnamese Malayalam Yi Thai Lao Tibetan ... Some cultures often use more than two scripts within the same group, such as Japanese, but rarely using another script especially from a different group. The three different groups also reflect different processing consideration as well. Scripts in Alphabet group are frequently used by different language users who may mix two or more different spoken language names using the same script. Also, alphabet has semanticly equivalent two sets of symbols: uppercase and lowercase letters, which can be folded under [nameprep] canonicalization. The main treatment issue is to consider mixed symbol use, for differen language groups, for example, an Azerbaijian may wish to switch between Latin and Cyrillic at easy. The majority scripts in Consonant group are one language per script, where many of the symbols from different scripts are look-alike but have unrelated values. However, when such a look-alike symbol in its own script contexts, its value is unambignous. IF the script is correctly identified, potential symbol confusion would be resolved. In this group, more language tag identification care should be given than members of other script groups. Treatment of Character based scripts is largely a uniqueness of characters' indices. The issue is more contentious if a character T-ACE collides with another T-ACE on a different character. Also, due to its mear size of symbols, its T-ACE index system has to be easily mastered and to be sorted for fast access [StepCode]. The main issue in IDN-Map is to identify character equivalent sets, and reduce the number of applicable IDN identifiers by 1) limiting the applicable IDN input code points to Plane 0 of Unicode table, 2) assigning one IDN identifier from each semanticly equivalent character class suggested by [CJK], [tsconv]. 3.3.4 Summary of IDN Language Tag Status Protocol The three major script groups are status as ALPH, CONS, and CJK, as they are mentioned in Section 3.3.1. and 3.3.3. It is suggested that language tags are fall into the same script groups, MAY be treated with the same language specific normalization and presentation methods discussed later in Section 4.3 of this document, to reduce implementation complexity. IDN Language Tag status also has NO-TAG: Unicode input code points without primary language tag defined, ERROR: prohibited UCS input code points [nameprep], LOC: code points of local standards permitted in IDN-map other than Unicode; USASCII: [STD13] complient input string. 4. Interface with IDN Code Exchange Map A uniform interface with IDN map is specified for interoperability among different clients and local server systems, and feasible upgrade of language specifice modules associated to an individual language tag. These language tag specific modules are called "language tagged procedures". 4.1 Language Specific Modules A spoken languages is expressed with specific symbols grouped into a corresponding script, which may be scattered in different UCS bolcks. Each script has its own methods in manipulating its symbols, in decomposing a symbol into parts, in selecting a symbol from an equivalent symbol set, in combining symbols into a string, as well as in presentation of a string on a screen. However, each language has each own systematic way to treat its script, some processes can be captured in simple procedures, others have to be treated on an individual basis, and many variations are in between. It is RECOMMENDED that reasonable studies are given to each language to classify script treatment model, and a cost vs. benifit analysis in select a long term script specific processing protocol to be embedded in IDN language specific modules. It is RECOMMENDED that processing speed and simplicity of its implementation takes the highest priority in such a decision. Two levals of script specific processing are supported with IDN-Map structure. The lowest level is the language tagged IDN map in favor of the primary users of a script (Section 2.1), where a simple code equivalence from input to an IDN identifier can be assigned, and is referred as canonicalization in [UTR21],[tsconv], [jpchar], [hangeul]. The second level is IDN label nomalization and presentation. 4.2 Script Specific Canonicalization The first level of script specific canonicalization have been addressed in [nameprep], [tsconv], [jpchar], [hangeul], [bidi], [UTR21], [CJK], where a mechanism of folding by Domain Name registration services and at client site for the purposes of preventing confusing allocations CJK Domain Names or the likes take much higher priority in domain name services. For local server based deployment of IDN, a partial solution of recover the registered codepoints MAY be achieved by specifing the presentation of IDN use prefolded form for all of the names. For example, "JOES-Pizza" is folded to "joes-pizza", and recoved to "JOES-PIZZA" when the user has such a desire. Another complete recovering solution would involve a different server transport of the original registered form, where a supporting mechanism is discussed in [UNAME} and is used in CJK specific procedures in Section 4.3.1 and 4.3.2. Uniform interface to IDN map has one procedure with 5 parameters: idn-folding(input-list, input-std, tag-rec, output-std, output-list); where input-list is the normalized and error checked codepoints [bidi][UAX15], input-std is the code standard of the normalized input label (Sec.3.3.1), tag-rec is the returned tag triples from find-tag protocol(Sec.3.2), output-std is the requested code standard, same as input-std, output-list is a list of all the codepoints retrieved from IDN Map in output-std; and input-std and output-std are couples of intergers in the form of (a,b), where the interger, a, is the input-std(Sec.3.3.1) and the second interger, b, is the off set number of columns from corresponding T-ACE column number (Sec.2.2.3). 4.3 Language Specific Normalization and Presentation The second level of script specific processing have been addressed in [IDNA], [icdn], [UAX15], [UAX9] and [bidi] are referred as normalization procedures, and presentation procedures. Normalization is to break an input string into a list of UCS codepoints in input code standard. Presentation is to combine a list of UCS codepoints into a string in output code standard. Presentation may joint certain symbols between UCS codepoints or randering the order of UCS codepoints' presentation as a string. Normalization MUST reverse all the randerings made by its corresponding presentation procedure on a label string when it break a string into a list of UCS code points. When input is an ACE string similar processes are calles "fitting" and decompose". The relations are: input Processes output UCS normalize-->fitting ACE \/ /\ ACE decompose-->present UCS For convenience, these procedures are proposed to be named with the exact language tag defined in IDNTAG file in the name, such that a language tagged normalizasion procedure is named as "idn-XY-normalize", where "XY" represents the language tag of associating procedure. Following the same convention that "idn-XY-present", "idn-XY-fitting", "idn-XY-decompose" would be the names for respective DNS name decompose procedure and IDN name presentation procedures. For example, "idn-zh-present" is the langauge tagged IDN label presentation procedure for Chinese. Two language specific script treatment procedures are REQUIRED for each language tag registered: 1) Normalize and 2) Present, and two additional T-ACE specific script treatment procedures 3) Fitting, 4) Decompose are RECOMMENDED for non-alphabet languages. It is also RECOMMENDED that a NO-TAG general compressive ACE [AMC] is registered as compress and decompress procedures corresponding with Fitting and Decompose procedures with IANA. It is REQUIRED that when a language tag is registered with IANA, the associated script specific procedures to be registered at the same time. 4.3.1 Language Tagged Normalization and Input Error Checking The find-tag interface gives the legal search range for error checking and normalization process to insure all the codepoints in input IDN label are legal IDN codepoints, which SHOULD NOT be rejected by IDN Map. The returned list of UCS codepoints MUST be checked for such an error, to prevent illegal IDN codepoint slip through and burden its following search in IDN-Map. The nomalization protocol is: stat=idn-XY-normalize(input, input-std, tag-rec, input-list, err-report) It is REQUIRED that each language tagged nomalization procedure perform: 1) check for disallowed input-std, 2) check for disallowed codepoints in its script range, 3) normalize input string to IDN-Map allowed input codepoints, 4) return input-list with one UCS codepoint per record, 5) report any errors. A similar protocol for stat=idn-XY-decompose(input, USASCII, tag-rec, input-list, err-report) It is RECOMMENDED that each T-ACE decomposition procedure perform: 1) check zonefile for cached IDN label 2) check for Non-ASCII input string for transport corruption, 3) check label length, if it is up to the maximum, request for the original registered IDN label from registrar, 4) strip language tag, 5) decompose input string to IDN-Map permitted UCS code points, 6) return input-list with ACE for each UCS codepoint per record, 7) report any errors. 4.3.2 Language Tagged Presentation and Preserving Character Boundary When idn-fold protocol returns a list of output UCS codepoints, a presentation process checks correctness of output codepoints and combines these codepoints into a display string. If output codepoints contain errors, presentation procedure SHOULD report an error, and request the original IDN display codepoints to be send, and make its best effort in display the current IDN string. The presentation protocol is: stat=idn-XY-present(output-list, output-string, err-report) It is RECOMMENDED that each language tagged presentation procedure perform: 1) if a codepoint contain an error, request for the original registered IDN label from original registrar, 2) reverse randerings made to a string by normalization procedure, 3) arrange string display order/direction, 4) concatenate output-list to output label and return the output label, 5) report any errors. A similar protocol for stat=idn-XY-fitting(output-list, output-name, err-report) is to put in necessary separtors for easy decomposing, and make it certain the encoding length fit into limited label space of 63 octets. If the encoding is over maximum label length, it SHOULD record both input string and T-ACE name to local zonefile, and compose a DNS identifier from output-list codepoints. It is RECOMMENDED that each T-ACE fitting procedure perform: 1) check for total code length, truncate certain tailing ACE to fit into the label length limit if required, 2) when necessary, put codepoint separator for proper decomposing, 3) concatenate ACE from each UCS code point to an output-name, 4) prepend the language tag to output-name, 5) report any errors. 4.3.3 Special Attention to Mix Scripts A string mixed with CJK and Kana is Japanese, CJK and Hangul mix is Korean. However, an all CJK character string MUST presumed to be in the primary language tag, that is Chinese, and registered as the only IDN name, unless the registrant requests a second and a third language to access the same IDN name. In this case, there could be more than one DNS label to be maintained by the registrant, and the IDN-Map becomes an automatic name translation agency. Tag identification of an arbitary input string proposed in find-tag protocol is an language indicator at its best. More careful check should be given in normalizing and error checking procedure. For example, the Chinese tagged normalizing procedure, idn-zh-nomalize, MUST check all input points to be certain about the correctness of returned value from find-tag procedure, and alter when it is necessary. It SHOULD identify a CJK-Kana mix as Japanese tag, and CJK-Hangl mix as Korean tag. 4.4 Language Tagged IDN Label Conversions The primary IDN label conversions are from UCS to [STD13] and vice versa. A backward compatibility utilitary support is also given to a limited set of local standards. Uniform IDN interface to applications is concured by IETF IDN Working group session(August 2001, London, England). The protocol SHOULD treat any possible input string with the same procedures, and divert language specific requirement to language tagged procedures at fixed points of IDN label conversions. The uniform IDN interface to applications is proposed to be: idn-label(input, input-std, tag-file, zone-buff, idn-name, output-std); where input: IDN label in input-std, input-std: any IDN permitted code standard (Sec.3.3.1), tag-file: IANA distributed IDNTAG file (Sec.3.2), zonefile: optional local registered domain name file for servers [UNAME], or cache at a client site, idn-name: output of converted input in requested output-std, output-std: requested output form in any IDN permitted code standard. In addition a localized zonefile search procedure SHOULD be supplied if a zonefile is applicable. 4.4.1 Code Conversions Supported by IDN-Map Idn-label protocol recognizes two code standards: UCS and ASCII by default. Any other permitted code standards MUST be specified as parameters. The code conversion direction is specified in the following matrix. Input-std to output-std implementation matrix: in\out U-i U-p ACE ASCII G B J U-i - fold DNS - disp disp disp U-p record - DNS - disp disp disp ACE record regist pass pass disp disp disp G record fold DNS - - disp disp B record fold DNS - disp - disp J record fold DNS - disp disp - ASCII - - - pass - - - where U-i UCS input U-p UCS folded in primary language ACE T-ACE form G,B,J permitted local code standards record used for registration font or trademark records regist for registration conflict matching fold canonicalization case folding DNS obtain DNS identifier pass pass by, no process disp local client backward compatible display - prohibited From observision of the matrix, it is clear, that the conversion is based on input code standard. If the input and output are all ASCII, then output is ASCII without any further delay, which is compatible with current DNS operation. 4.4.2 Input and Output Format Request Considering that idn-label protocol may be installed on a client site, the input and output request specification may contain errors due to variety of inconsistent site configuration, smooth handling of such errors is an important part of idn-label protocol. Input-std to output-std default case matrix: in\out U-t U-n ACE ASCII U-t - - ACE - U-n* - - ACE - ACE UCS - - - ASCII - - - pass where U-t UCS code with tag identified U-n UCS code with NO-TAG identified, *also any input-std error case ACE identified ACE format ASCII [STD13] with no tag, or with "us-" tag added by zone masters - ignored case pass passby without any processing It is proposed that the tag "us-" is reserved for a name part which consists exclusively of characters that conform to the hostname requirements in [STD13], as an optional language tag. If an all ASCII label in [STD13] or a "us-" prepended to a name, and the output standard is not specified, or is specified as USASCII, then the input name MUST NOT be converted at all. This absolute requirement prevents: 1) double encoding from a client of user keyboard input and a server provider; 2) messing up existing registered domain names; 3) interfering with registered glyphs with more than one phonetic standard, such as Hanja and Kanji in CJK script. If the input string consists only of characters that conform to the hostname requirements in [STD13], and with a prefixed language tag, and the output standard is NOT USASCII, the RECOMMENED output defaults to UCS folded, column #1, which is the universal base support. This recommentation is to provide a friendly presentation for end user configuation ignorance. When there is no tag on a non-ASCII input string, then it is going through script identification, prohibited characters filtering, canonicalization, case-folding, as defined in [nameprep] and is treated with find-tag process. If its output-std is not specified or specified with inconsistence, then the USASCII is assigned as the default output-std for any non-ASCII input. All the rest input and output code standards MUST be explicitely specified for any conversion requests to be honoured. 4.5. Uniform Idn-label Protocol The Idn-label protocol is summarized in a C language format, with some of the parameters and details ommitted. idn-label(input, input-std, tag-file, zonefile, idn-name, output-std) { flag = find-tag(input, tag-file, input-std, tag-rec); tag = get-tag(tag-rec); /* Part 1: Name preparation, normalization and error checking */ switch (flag) { case ERROR: return(ERROR); case USASCII: /* input ASCII */ { switch (tag) { case NIL: return (idn-name = input); /* ASCII passby */ case US: return (idn-name = input); /* ASCII passby */ case AMC: /* General ACE[AMC] */ {idn-amc-decompress; return(idn-name)} /* Finish */ case ZH: idn-zh-decompose; /* T-ACE decompose */ case JA: idn-ja-decompose; case KR: idn-kr-decompose; ... default: return ("unimplemented tag ERROR"); } if (output-std not permitted) /* Output request check */ output-std = UCS; } case NO-TAG: /* General UCS input */ { switch (flag) { case ALPH: idn-alph-normalize case CONS: idn-cons-normalize case CJK: idn-cjk-normalize } if (output-std ERROR) /* Output request check */ output-std = USASCII; } default: /* script range found */ { switch (tag) { case zh: idn-zh-normalize case ja: idn-ja-normalize case kr: idn-kr-normalize ... } if (output-std not permitted) /* Output request check */ output-std = USASCII; } } /* Above normalizing protocol: stat=idn-XY-normalize(input, input-std, tag-rec, input-list, err-report)*/ if error (stat) /* Input error checked */ { fprintf(stderr, "%s %s", input, err-report); return (ERROR); } /* Part 2: Canonicalize and Code exchange */ idn-folding(input-list, input-std, tag, output-std, output-list); /* Part 3: Present and Fitting */ switch (output-std) { /* output ACE */ case USASCII: { if (flag = NO-TAG) { stat=idn-AMC-compress; tag=AMC;} switch (tag) { case zh: stat=idn-zh-fitting(output-list, idn-name, err-report); case kr: stat=idn-kr-fitting(output-list, idn-name, err-report); case XY: stat=idn-XY-fitting(output-list, idn-name, err-report); } concatenate( tag, idn-name); /* prepend tag to ACE*/ } case UCS: { /* output UCS */ switch (tag) { case AMC: switch (flag) { case ALPH: idn-alph-present; case CONS: idn-cons-present; case CJK: idn-cjk-present; } case kr: idn-kr-present; case ja: idn-ja-present; case XY: stat=idn-XY-present(output-list, idn-name, err-report); } } case other-output:{} /* output other standard */ } } 5. Prefered Embodiment of IDN Code Exchange Map Three applcations are suggested for client, server and general public. 5.1. Client Application Uniform Idn-label Protocol of Section 4.4 is one of the prefered embodiments of IDN-map discussed to provide consistent IDN client interface corss any language installation. Using Idn-label interface, a basic URI cut and paste operation may be implemented: URL cut and paste, then send: Loop for all labels { Get IDN label from URL buffer, Call Idn-label, receive ACE label, replace IDN label with ACE label until end of URL } send URL. 5.2 Server Application The most important embodiment of IDN-Map is in IDN Domain Name registration process to check for name conflict and trademark search, where trademarks in Han characters is common practise. The following prototype demonstrates such an embodiement. IDN registration as an example for server application: 1) get wish-name, call Idn-label(wish-name), receive T-ACE-label. examing T-ACE-label, if bad go to 1). send T-ACE-label for DNS match, bad go to 1). good go to 2) 2) call Idn-label(T-ACE-label), receive IDN-name. examing IDN-name, if bad, go to 1). send IDN-name for IDN match, if bad go to 1). good, go to 3). 3)Register IDN-name, T-ACE-label in zonefile [UNAME]. 5.3. Implications of Deployment of IDN-Map IDN-Map is a feasible tool for many, for example, a third application has been suggested to use the IDN-map as a general input encoding exchange module to be called from any applications. If it is implemented then a librarian may use a keyboard with existing input software to access a particular CJK character, C, in UCS Plane 0, and retrieve a C' from Plane 1, or C" from Plane 2. A flexible tool always brings its cons with it. From technical area, more scrutiny has to be placed for each equivalent symbol to be mapped into its equivalent code point, and each T-ACE has to be checked for mnemonic pros, simple logical assignment to ensure consistence and uniqueness. Also, it introduces more policy decisions, for example, an all CJK character trademark registrant may have to registrate in three languages to ensure the legitimacy of the trademark. After all, a useful tool is to let its user to make decisions. 6. Security Considerations Much of the security of the Internet relies on the DNS. Thus, any change to the characteristics of the DNS can change the security of much of the Internet. IDN-Map makes no changes to the DNS itself. 7.Internationalization considerations The Internetional code exchange table will provide convenience for many internetional application development. 8. Acknowledgements The special comments which have contributed to improve this document were received from Li Ming Tseng as well as many other people from the working group. 9. IANA Considerations This document requires IANA action for availibility of language tag, and registration for each tag and associated language specific processing procedures. 10. References [AMC] Adam M. Costello, "AMC-ACE-Z," draft-ietf-idn-amc-ace-z, Sept. 2001. [Alphabet] "Repertoires of characters used to write the indigenous languages of Europe", A CEN Workshop Agreement, Version 2.8, TECHNICAL REPORT, Draft: 1998-12-14. http://www.egt.ie/alphabets/#1.3 [ASCII] American National Standards Institute (formerly United States of America Standards Institute), X3.4, 1968, "USA Code for Information Interchange". (ANSI X3.4-1968) [bidi] Martin Duerst, "Internet Identifiers and Bidirectionality", draft-duerst-iri-bidi-00.txt, July 2001. [CJK] James SENG, etc. "Han Ideograph (CJK) for Internationalized Domain Names", draft-ietf-idn-cjk-01.txt, Apr 2001. [GB] China national code exchange standard. [hangeul] Soobok Lee and GyeongSeog Gim, "Hangeul NAMEPREP recommendation", draft-ietf-idn-hangeulchar, July 2001. [icdn] Xiang Deng and Yan Fang Wang, "The Implementation of Chinese character in IDN", draft-ietf-idn-icdn-00.txt, July 2001. [IDN] "IETF Internationalized Domain Names Working Group", idn@ops.ietf.org, James Seng, Marc Blanchet [IDNA] Patrik Faltstrom and Paul Hoffman, "Internationalizing Host Names In Applications", draft-ietf-idn-idna-03.txt, July 2001. [IDNReq] Zita Wenzel and James Seng, "Requirements of Internationalized Domain Names", draft-ietf-idn-requirements. May 2001.) [ISCII] Indian Standard Code for Information Exchange [ISO639][ISO639-2/T] ISO/IEC 639-2 2001 Codes for the Representation of Names of Languages. [ISO10646] ISO/IEC 10646-1:2000 (note that an amendment 1 is in preparation), ISO/IEC 10646-2 (in preparation), plus corrigenda and amendments to these standards. [JIS] "Japanese Industrial Standards", Information Technology (Terms/Code/Date elements)-99, ISBN 4-542-12976-4 [jpchar] Yoshiro Yoneya and Yasuhiro Morishita, "Japanese characters in multilingual domain name labels", draft-ietf-idn-jpchar-01, March 2001. [KSC] Korean national code exchage standard. [nameprep] Paul Hoffman and Marc Blanchet, "Preparation of Internationalized Host Names", draft-ietf-idn-nameprep, July 2001. [Pinyin] "Scheme for the Chinese Phonetic Alphabet", Shangwu Pubishing House, 1979, United Book# 9017.810 [RFC2277] "IETF Policy on Character Sets and Languages", rfc2277.txt, January 1998, H. Alvestrand. [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997, RFC 2119. [RFC2231] Email tag [RFC 3066] H. Alvestrand, "Tags for the Identification of Languages", (RFC 3066). [STD13] Paul Mockapetris, "Domain names - implementation and specification", November 1987, STD 13 (RFC 1035). [StepCode] Liana Ye, "StepCode - A Mnemonic Internationalized Domain Name Encoding", to be submitted. [tsconv] XiaoDong LEE, etc. "Traditional and Simplified Chinese Conversion", draft-ietf-idn-tsconv-00.txt, June 2001. [UAX9] Mark Davis, "The Bidirectional Algorithm", Unicode Standard Annex #9, March 2001. http://www.unicode.org/unicode/reports/tr9 [UAX15] Mark Davis and Martin Duerst. Unicode Standard Annex #15: Unicode Normalization Forms, Version 3.1.0. [UCS] "Universal Multiple-Octet Coded Character Set", ISO/IEC 10646-1:1993, ISBN 0-201-61633-5 [UNAME] Li Ming TSENG, etc. "Internationalized Domain Names and Unique Identifiers/Names", draft-ietf-idn-uname-01.txt, Jul 2001. [UTR21] Mark Davis. Case Mappings. Unicode Technical Report;21. . [UNICODE] The Unicode Consortium, "The Unicode Standard". Described at http://www.unicode.org/unicode/standard/versions/. [UNICODE3] The Unicode Consortium, "The Unicode Standard -- Version 3.0", ISBN 0-201-61633-5. Same repertoire as ISO/IEC 10646-1:2000. Described at http://www.unicode.org/unicode/ standard/versions/Unicode3.0.html. [URL] Roy Fielding et al., "Uniform Resource Identifiers: Generic Syntax", August 1998, RFC 2396; Robert Hinden et. al, "IPv6 Literal Addresses in URL's", December 1999, RFC 2732. [version] Marc Blanchet, "Handling versions of internationalized domain names protocols", draft-ietf-idn-version 11. Authors' Contact Information Liana Ye Y&D ISG 2607 Read Ave. Belmont, CA 94002, USA. (650) 592-7092 liana.ydisg@juno.com