Network Working Group Jean-Francois C. Morfin Internet-Draft Intlnet Intended status: Independent submission September 9, 2009 Expires: March 10, 2010 WG-IDNABIS/LC comments and responses draft-iucg-wgidnabislc-01.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on January 14, 2010. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Morfin Expires March 10, 2010 [Page 1] Internet-Draft wgidnalc September 2009 Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract The IDNA is a key issue for the IUCG as a paradigm for the future of the Internet. There is therefore a need to make sure its description document set reflects a complete IETF and users consensus. To help this, this memo keeps track of the WG-IDNABIS/LC requested and received answers. The IAB Draft on IDN has been added because some remarks have been made which are important. The author is quoted if the comment is not from IUCG Table of Contents 1. Introduction................................................... 3 2. General appreciation........................................... 3 3. IDN - IAB...................................................... 4 4. IDNA Definitions............................................... 6 5. IDNA Rationale................................................ 16 6. IDNA Mapping.................................................. 19 7. IDNA Protocol................................................. 21 8. IDNA BIDI..................................................... 31 9. IDNA Tables................................................... 41 Requirements notation The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Morfin Expires March 10, 2010 [Page 2] Internet-Draft wgidnalc September 2009 1. Introduction The IDNA is a key issue for the IUCG as a paradigm for the future of the Internet. There is therefore a need to make sure its description document set reflects a complete IETF and users consensus. To help this, this memo keeps track of the WG-IDNABIS/LC requested and received answers. The IAB Draft on IDN has been added because some remarks have been made which are important to the background and to the future of the Multilingual Internet. The author is quoted if the comment is not from IUCG. This compilation is updated up to the answer of Harald Alvestrand on 08/29. 2. General appreciation * The document repartition seems adequate. However, even if the Mapping memo was not a part of the IDNA (why?) document set, it is more than logical and enlightening to have it read prior to the Protocol parts. * The documents are rather confusing because it is impossible to decide whether: * they consider IDNA as a part or not as a part of the DNS (we may also be influenced by the ML-DNS pile we work on). * they differentiate (which) between characters and codepoints. * they use NFKC or NFC, and what are their differences, intrinsically and from an IDNA point of view * they want to be a complete standards, or a partial suggestions, set. This results from: * the non-normative forms are being used in places that one would deem normative * the constant discussion of Registries' capacities/obligations and the lack of documentation on the tools for executing them and managing the related registration/coding metadata and rules. * (Martin Duerst): - Use only one name for talking about the document collection. Currently: * 'collection' (Abstract, 1.1) * 'series' (1.1) Morfin Expires March 10, 2010 [Page 3] Internet-Draft wgidnalc September 2009 * 'set' (1.3) * 'and the associated ones' (1.1.1) * 'these documents' (2.1; very unclear when reading whether that phrase indeed refers to the document collection or to Unicode documents or what, similar again in 2.2) This variability is confusing. 3. IDN - IAB * (Martin Duerst): Mentioning ISO-2022-JP for encoding Japanese domain names raises some suspicion. ISO-2022-JP may well be (or have been) used in the DNS or a similar system, but such use would be atypical, and should be documented by a reference. Based on the general "division of labor" of the three classical Japanese encodings (ISO-2022-JP, EUC-JP, Shift_JIS), one would expect EUC-JP or Shift_JIS rather than ISO-2022-JP in such a case. [Among the three, ISO-2022-JP makes it easiest to explain the "heuristic encoding detection" scenario described at the end of Section 1.1. But without a reference, it may look to some as if ISO-2022-JP was a made-up example.] * 1. "An Internationalized Domain Name (IDN) is a name that contains one or more non-ASCII characters.". What is a "name" vs. a domain name? * 1. "When an IDN is encoded with Punycode, it is prefixed with "xn--",". Is an IDN a label? * 1. "it is prefixed with "xn--", which assumes that ASCII names do not start with this prefix." Isn't the whole thing supposed to use "xn--" ASCII labels instead of non-ASCII entries? * 1. "reversible Unicode-to-Punycode conversion .... reversible Punycode-to-Unicode conversion". Unicode is a table, Punycode is an algorithm. Punycode is not reversible, but its use can be restricted to codepoints in turn permitting it to perform reversibly. * 1. "UTF-8 [RFC3629] is a mechanism for encoding a Unicode character in a variable number of 8-bit octets, where an ASCII character is preserved as-is." Characters belong to scripts that may or may not be supported by ASCII and Unicode encoding tables. * 1.1. (Martin Duerst): For the bulleted list at the end of Section 1.1, it should be pointed out that UTF-8 can be detected, and distinguished from other 8-bit encodings, with much higher Morfin Expires March 10, 2010 [Page 4] Internet-Draft wgidnalc September 2009 precision than just "a byte in the string has the 8th bit set". For details, please see http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf. * 1.1. (Martin Duerst): The heuristic for punycode that is given in Section 1.1 is "starts with xn--". However, on the level of getaddrinfo, we are dealing with domain names, not single labels, and something like www.xn--foo.jp should definitely be punycode even if it doesn't start with xn--. * 1.1. "When used with a DNS resolver library, IDNA is inserted as a "shim" between the application and the resolver library": that shim should be located in Figure 2. * 1.1. "Again this assumes that ASCII names never start with "xn--", and also that UTF-8 strings never contain an ESC character". It may be worth documenting whether "IDN", "names", "labels", and "strings" are or are not the same thing. * 2. "Applications that convert an IDN to Punycode before calling getaddrinfo() will result in name resolution failures if the Punycode name is directly used in such protocols. " Why? Is that not actually due to the reason that was given in 3? "Applications that convert an IDN to Punycode before calling getaddrinfo() will result in name resolution failures if the name is actually registered in a private name space in some other encoding (e.g., UTF-8)." * 3. "While implementations of the DNS protocol must not place any restrictons on the labels that can be used, applications that use the DNS are free to impose whatever restrictions they like, and many have." Wouldn't these two rules contradict the proposed WG-IDNABIS charter change? Wouldn't they permit the support of cases such as Tatweel, Tamil figures, and French majuscules? * 3.4. "The DNS resolver will append suffixes in the suffix search list". Where is the "suffix search list" documented? * 4. "even when DNS is used, the conversion to Punycode should be done only by an entity that knows which name space will be used." Fundamental. Yet, this is not considered by IDNA rationale or protocol. * 4.1. "Indeed the choice of conversion within the resolver libraries is consistent with the quote from [RFC3490] section 6.2 stating that Punycode conversion "might be performed inside these new versions of the resolver libraries". - "the recommendation is that a resolver library be more liberal in what it would accept from an application would mean that such a name would be accepted Morfin Expires March 10, 2010 [Page 5] Internet-Draft wgidnalc September 2009 and re-encoded as needed". These recommended architectures (such as ML-DNS) are not considered in the IDNA rationale. Will IDNA be interoperable with these recommended architectures? * 4. "encoding conversion between Punycode and UTF-8 is unambiguous". (?). This could lead to stabilization through punycode and A-labels, in turn making A-labels the DNS referent entry for UTF support? * 5. Security considerations. This kind of consideration is already posing a problem for TM protection. This is the "Babel Names" case. This occurs when someone trademarks the U-label corresponding to a protected Roman script TM. When that U-label displays under its ASCII label form, it infringes on the Roman script TM rights. Ex.: "xn--coca-cola" or "xn--vint-cerf". * (Martin Duerst): The solution that the document seems to be pushing most is heuristic detection, i.e. an API where strings in different encodings are fed in and the API sorts things out heuristically, converting if necessary. To some extent, this may be an unavoidable evil, but it would be good if the document were pushing more for clear encoding identification (for which I think GetAddrInfoW() (UTF-16) would be an example). * (Martin Duerst): It may be a good idea to also look into the issue of escaped forms of domain names being fed into resolver APIs. One form of escaping is (UTF-8-based) %-encoding in URIs (and IRIs), which is allowed in URIs according to RFC 3986, is the only way to encode non-ASCII in the host part of an URI where punycode isn't appropriate, and may be the result of a conversion from an IRI to an URI. For further background and discussion, please see http://lists.w3.org/Archives/Public/public-iri/2009Aug/0012.html and http://lists.w3.org/Archives/Public/public-iri/2009Aug/0024.html and the followup discussion. Another potential kind of escaping are HTML/XML numeric character references (of the form "& #xABCD ;"), although I expect them to be less of a problem because they are used higher up in the application and usually removed early on. 4. IDNA Definitions * Information on Unicode is scattered throughout the document. Wouldn't it be much better to describe a clear sequence? * what an IDNA is, * what IDNs are, Morfin Expires March 10, 2010 [Page 6] Internet-Draft wgidnalc September 2009 * what IDNA labels are, * what they are made of, * how Unicode supports them, including NFC in the same 2.1. section, * how a zone manager may impose profiling rules (description, enforcement). * Most of the new terms are discussed before being defined. This starts with the confusing "looking them up" in part 1.1.1. (which means resolving, and not just asking about, validity or existence) as opposed to "registering"). IDNs are introduced 2.3.2.3. etc. This certainly reflects how difficult the work is in defining all these terms, but it is still quite confusing. For example, it is advisable to begin with part 4.4. * The different classes of domain names that are discussed only seem to be related to IDNA without an exhaustive presentation of the DNS domain name context. The names are somewhat confusing. The drafts are certainly clear, but they do not reflect a progressive logic of discovery of the nature of a name/label that could be ported to programming functions. * References to the lower/uppercase image can be understood by DNS old-timers, but is confusing to newcomers, as it does not reflect the same functionality and because U-label/A-label lower/uppercase treatment is not the same. * Different keyboards and encoding are discussed, stressing that a DNS resolution calls for a U-label conversion, but nothing obliges local applications to transcode user entries to Unicode when they interoperate at a layer other than DNS. However, these applications may want to canonize these entries in their proper way. Interplus supports the idea that an application layer may use middle non-Unicode and non-ASCII coding. Among others, this facilitates interoperability with UTF-8 that Microsoft supports within private nets: the user interface may be common and the underlying machinery either IDNA or UTF-8. * 1.1.1 (Martin Duerst): Audiences - "what names are permitted in DNS zone files," -> "what names are permitted in DNS zones," (whether these are files or whatever is implementation-dependent. This section is very important, and would be much more effective with less circumscription. Just use straightforward terms people/functions that everybody else names directly, such as 'registries', 'registrars', 'administrators creating subdomains', Morfin Expires March 10, 2010 [Page 7] Internet-Draft wgidnalc September 2009 and so on, and then say that this list isn't exclusive. That will have the additional benefit of bringing the document up in more of the relevant searches. The second paragraph is also overly circumscriptive. Using "the one containing explanatory material" to refer to Rationale is a strong disservice to every reader, even if strictly speaking may be preferable to a forward reference. Please use [] style references, or labels such as "Rationale" with a short sentence pointing to 1.3, or move the "who should read what" info to 1.3 with a general pointer from 1.1.1. But please stop talking around stuff that can be easily expressed more directly (this general comment applies in many other places, too). * 1.1.1 (Martin Duerst): should be 1.2, and 1.1.2 should be 1.3, and 1.3 should be 1.4, to simplify structure. * 2.1 (Martin Duerst): Say that 0x means hexadecimal (first para) * 2.3.1 (Martin Duerst): title: This looks as if this section defines one term, "LDH-Label". Change the title to something more general, such as "Definitions for ASCII-only Labels". * 2.3.1. (Paul Hoffman): Anchor 10 above Figure 1 indicates that the figure might be shrunk. I propose instead that the four footnotes simply be moved to immediately after the figure, which makes the figure itself fit on one page. * 2.3.1. (Paul Hoffman) In Figure 2, the terms "Binary Label (including high bit on)" and "Bit String Label" are not defined and are confusing without definition. Do we need this figure at all any more? * 2.3.1. (Martin Duerst): general (but most urgently 3rd para): Make sure that the terms defined stick out, at least the same way as in 2.1 (one para per def, defined word is first word of para). Move clear and simple definition to front, and rationale, relationships,... to the end of the paragaraph. * 2.3.1. (Martin Duerst): Move normative text to Protocol ("those labels MUST NOT be processed as ordinary LDH-labels by IDNA-conforming programs and SHOULD NOT be mixed with IDNA-labels in the same zone") * 2.3.1. (Martin Duerst): 3rd para: "but which otherwise conform to LDH-label rules" -> "but otherwise conform to LDH-label rules" * 2.3.1. (Martin Duerst): 3rd para: "case-independent" -> "case-insensitive" Morfin Expires March 10, 2010 [Page 8] Internet-Draft wgidnalc September 2009 * 2.3.1. (Martin Duerst): 3rd para: "divided in" -> "divided into" * 2.3.1. (Martin Duerst): "for future extensions that use extensions based on the same "prefix and encoding" model"": a) 'extensions' is repeated; b) the IETF is great at not talking about future eventualities and describing general models that never may be used. In this and other sections, such stuff should also be cut out. * 2.3.1. (Martin Duerst): anchor10: I do not understand why we need the (1)..(4) notes. Either the definitions are clear enough, or they should be fixed. Something like "NON-RESERVED LDH LABELS (NR-LDH-labels) NR-LDH LABELS" is also total overkill. The only thing that's necessary is "NR-LDH labels", with exactly the same capitalization and hyphenation as in the definition. * 2.3.1. (Martin Duerst): Fig. 2: I'm somewhat confused here. Note (5) seems to suggest that U-labels have a fixed binary encoding (e.g. UTF-8) and are used directly in the DNS. Otherwise, the note doesn't make sense. * 2.3.2.1. (Martin Duerst): "While that constraint may be tested in any of several ways, an A-label must be capable of being produced by conversion from a U-label and a U-label must be capable of being produced by conversion from an A-label.": This puts the chart before the horse. Change to "An A-label must be capable of being produced by conversion from a U-label and a U-label must be capable of being produced by conversion from an A-label. There are several ways in which this constraint may be tested." * 2.3.2.1. (Andrew Sullivan): Para. 2.3.2.1: An "A-label" is the ASCII-Compatible Encoding (ACE, see Section 2.3.2.5) form of an IDNA-valid string. It must be a complete label: IDNA is defined for labels, not for parts of them and not for complete domain names. This means, by definition, that every A-label will begin with the IDNA ACE prefix, "xn--" (see Section 2.3.2.5), followed by a string that is a valid output of the Punycode algorithm [RFC3492] and hence a maximum of 59 ASCII characters in length. The prefix and string together must conform to all requirements for a label that can be stored in the DNS including conformance to the rules for the preferred form described in RFC 1034, RFC 1035, and RFC 1123. A string meeting the above requirements is still not an A-label unless it can be decoded into a U-label. So, to be less vague: the section is supposed to define certain terms, and that bullet ought to define "A-label". It does not. It tells us the necessary conditions for being an A-label, but not the sufficient. This could be remedied if the last sentence said instead, "If and only if a string meeting the above requirements Morfin Expires March 10, 2010 [Page 9] Internet-Draft wgidnalc September 2009 can be decoded into a U-label, then it is an A-label." But I'm no longer sure that's true, given that we've lived with the I-D definition so long and yet not had it fully operationalized. Is there anything else? If there is, it needs to be added. These definitions, I say, must be completely operationalized (or else we have no excuse to call this document the definitions document). Since people have to write code on the basis of these definitions, they must be completely unambiguous. Author's current answer: These changes, with Paul's suggested modifications, have been tentatively accepted and incorporated in the document. Anyone who objects should say so quickly. * 2.3.2.1. (Martin Duerst): "Among other things, this implies that both U-labels and A-labels must be strings in Unicode NFC [Unicode-UAX15] normalized form.": A-labels are by definition in NFC, because they are ASCII-only. If you want to say that they must *represent* labels that are in NFC, that would be fine, but I think mentioning NFC here isn't really necessary. MAJOR!!!!! * 2.3.2.1. (Martin Duerst): says: "Any rules or conventions that apply to DNS labels in general, such as rules about lengths of strings, apply to whichever of the U-label or A-label would be more restrictive. For the U-label, constraints imposed by existing protocols and their presentation forms make the length restriction apply to the length in octets of the UTF-8 form of those labels (which will always be greater than or equal to the length in code points)." Now this is TOTALLY NEW to me. There sure is a restriction to 63 octets in the DNS itself, but because U-labels don't enter the DNS as such (neither as UTF-8 nor as UTF-16 or whatever), an arbitrary UTF-8-based length restriction seems totally unjustified. I'm not at all aware of such a restriction in IDNA2003. Indeed, punycode was explicitly designed, among else, to perform well for scripts with few characters. For small scripts that need 3 bytes per character in UTF-8 (all Indic scripts, Georgian, Sinhala, Thai, Lao, Tibetan, Myanmar, Ethiopic, Cherokee, Unified Canadian Aboriginal Syllabics, Khmer,..., this restriction would mean a drastic reduction of the number of characters usable in a label. To give an example, when at W3C, I created some IRI tests (http://www.w3.org/2001/08/iri-test/). The tests use Hiragana (http://www.???????.w3.mag.keio.ac.jp and http://???????.???????.???????.w3.mag.keio.ac.jp), which is atypical in that Hiragana-only Japanese is rarely used except in Morfin Expires March 10, 2010 [Page 10] Internet-Draft wgidnalc September 2009 children's books, but it is typical in that punycode is able to represent 41 Hiragana (123 octets in UTF-8) in 58 octets. Hiragana overall contains about 80 letters in a single block; punycode efficiency will vary with the size of the script (more efficient for smaller scripts, less efficient for larger scripts) as well as of course with every individual label. Currently, all (on Windows) of IE7, Mozilla Firefox, Safari, and Opera pass both length tests (single label and multiple labels). It would be very counterproductive if IDNA2008 required further artificial restrictions which essentially disfavor languages and cultures that haven't been lucky to get short encodings for their scripts in UTF-8. (I'd be fine if the Security section warns about the potential of some protocols or implementations not having appropriate space, but that's on a completely different level.) * 2.3.2.1. (Andrew Sullivan): "A "U-label" is an IDNA-valid string of Unicode characters, in normalization form NFC and including at least one non-ASCII character, expressed in a standard Unicode Encoding Form (in an Internet transmission context this will normally be UTF-8)." The parenthetical remark, I think, encourages implementers not to recognise as U-labels strings that come in as (say) UTF-32, but that are otherwise perfectly valid. Who cares what is normal in an Internet transmission context, when we're defining terms? Why does that matter ? Author's current answer: The comment was made because there is no requirement at all in IDNA (either 2003 or 2008) that UTF-8 be used; many applications on particular operating systems actually use something else (UTF-16 is most common). But I dropped the additional text. It now just says "(such as UTF-8)" as Paul suggested. Again, anyone who doesn't like this should speak up * 2.3.2.1. (Andrew Sullivan): "To be valid, U-labels and A-labels must obey an important symmetry constraint. While that constraint may be tested in any of several ways, an A-label must be capable of being produced by conversion from a U-label and a U-label must be capable of being produced by conversion from an A-label. Among other things, this implies that both U-labels and A-labels must be strings in Unicode NFC [Unicode-UAX15] normalized form. These strings MUST contain only characters specified elsewhere in this document series, and only in the contexts indicated as appropriate." This passage nowhere actually says that _the_ A-label produced by conversion from a particular U-label must in turn produce, by the application of the alogorithm, the _same_ U-label. There is a symmetry (though not an obvious one) in U[1] being convertible to Morfin Expires March 10, 2010 [Page 11] Internet-Draft wgidnalc September 2009 A[2] which is convertible to U[2] which is convertible to A[1], for instance. I have no idea whether such is possible, but there's no reason our formal definitions need to allow for it. This can be fixed so: To be valid, U-labels and A-labels must obey an important symmetry constraint. While that constraint may be tested in any of several ways, an A-label A' must be capable of being produced by conversion from a U-label U', and that U-label U' must be capable of being produced by conversion from A-label A'. Among other things, this implies that both U-labels and A-labels must be strings in Unicode NFC [Unicode-UAX15] normalized form. These strings MUST contain only characters specified elsewhere in this document series, and only in the contexts indicated as appropriate. I don't care about the notation, as long as it is unambiguously clear that we're always talking about the "very same" label on both sides of the transformation. We could go so far as to define IDNA-equivalent A-labels and U-labels formally. I think this would do it: A-label1 and U-label1 are equivalent if and only if all the following four conditions are true: 1. The encoding of A-label1 according to [RFC3492] results in U-label1. 2. The decoding of U-label2 according to [RFC3492] results in A-label2. 3. A-label1 is equivalent to A-label2 according to DNS matching rules for labels. 4. U-label1 is bistring equivalent to U-label2. Some may reject the above as a bit of needless formalism, or want to reduce some steps. I argue that this is the most basic and therefore most clear (but admittedly inelegant) formulation. As usual, however, I'm utterly prepared to admit that I've actually got the rule incorrect. But if I have, that amounts to a hint of trouble with the document, because I've managed to misunderstand it (and though I be dim, I have been following this effort). * 2.3.2.1 (Wil Tan): In protocol section 5.3, A-label Input section to add the lowercasing step prior to using the Punycode decoding algorithm. The section on symmetry constraint (-defs-10, section 2.3.2.1) should also have similar wordings. Morfin Expires March 10, 2010 [Page 12] Internet-Draft wgidnalc September 2009 * 2.3.1. Para5 (Wil Tan): Labels within the class of R-LDH labels that are not prefixed with "xn--" are also not valid IDNA-labels. To allow for future use of mechanisms similar to IDNA, those labels MUST NOT be processed as ordinary LDH-labels by IDNA-conforming programs and SHOULD NOT be mixed with IDNA-labels in the same zone. Author's current response: Unless, in the moving around of text, we have slipped up, it is important to note that the restriction here applies _only_ to IDNA-aware applications. That prevents it from being a restriction on the DNS generally. However, for IDNA-aware applications, it is a precaution against possible future prefix-altering changes as well as something of a mechanism for making it harder for bad guys to game future changes. If any non-IDNA arrangements come along that use "??--" label encodings, they will of course have to be coordinated with each other and with IDNA; in the interim, this provision keeps IDNA out of their way (i.e., avoids preempting such approaches). And, yes, the WG did discuss this at great length. "I may have missed it, but don't recall any discussions about restricting the processing of other tagged domains. Is this the right draft to prescribe restrictions on how non-XN-Labels are processed?" Author's current response: IMO, we are much too tied up in special definitions and confusing terminology already. Please let's not make it worse by introducing more unnecessary terminology in the form of "tagged domains". And it is the right place for defining how IDNA-aware applications handle R-LDH labels that are not valid A-labels, at least IMO and in the opinion of the mailing list the last two or three times we went through that topic. * 2.3.2.2. (Martin Duerst): NR-LDH-label and Internationalized Label: The section doesn't say anything about "Internationalized Label"s, although this term appears in the title. (the definition is in 2.3.2.3) * 2.3.2.3. (Martin Duerst): SVR record labels are not Internationalized labels, and therefore domain names used for SVR aren't IDNs. That's fine by me, but it should nevertheless be made clear (here or elsewhere) that IDNs can be used with SVR,... (this seems to be done at the end of 2.3.2.6, so this should be okay) * 2.3.2.4. (Martin Duerst): This seems to say that there is no equivalence between an all-lowercase A-label and an otherwise Morfin Expires March 10, 2010 [Page 13] Internet-Draft wgidnalc September 2009 equal label where some letters (maybe accidentally) have been upper-cased. I think the cause of the problem is (as often in this document) the lack of consistent language. Instead of "and then testing for an exact match between the A-labels", say "and then testing for equivalence between the A-labels [using normal DNS matching rules]". If that's not what's intended, then some more background may be appropriate. * 2.3.2.5. (Martin Duerst): "a string of ASCII characters" -> "the string of ASCII characters" * 2.3.3. (Martin Duerst): "Because IDN labels may contain characters that are read, and preferentially displayed, from right to left,": Remove 'preferentially'. This maybe refers to some hopelessly broken systems, or to the fact that Arabic Braille is LTR, or something else, but is totally irrelevant and potentially misleading in this context. * 2.3.3. (Martin Duerst): Why doesn't this paragraph just refer to 'logical' representation, a term that people who know bidi are familiar with and that's widely used in Unicode. * 2.3.3. (Andrew Sullivan): Also, as an aside, it would be helpful if defs pointed out more explicitly in its Para. 2.3.3 that there are BIDI-only terms defined in the BIDI document and not in defs. Author's current answer: I moved it to the beginning. I don't think there's anything bidi-specific in -defs. * 2.3.4. (Martin Duerst): "There has been some confusion about whether a "Punycode string" does or does not include the ACE prefix and about whether it is required that such strings could have been the output of the ToASCII operation": a) The combination of 'required' and 'could' doesn't make ANY sense for me. b) Is is unclear what "such strings" refers to (with ACE prefix? without ACE prefix?) * 2.3.4. (Martin Duerst): "much more clear" -> "much clearer" * 4. (Martin Duerst): There should be a very short paragraph saying that this section provides an overview and pointers into the security sections of the other documents. (or whatever else exactly the relationships are) * 4.1. (Martin Duerst): "In addition to characters that are Morfin Expires March 10, 2010 [Page 14] Internet-Draft wgidnalc September 2009 permitted by IDNA2003 and its mapping conventions": Does this mean "In addition to characters that are permitted by (IDNA2003 and its mapping conventions)" or "In addition to characters that are permitted by IDNA2003 and [in addition to] its mapping conventions"? Please clarify. * 4.1. (Martin Duerst): "problems that might raise" -> "problems that might araise" * 4.1. "Security on the Internet partly relies on the DNS. Thus, any change to the characteristics of the DNS can change the security of much of the Internet." This sentence seems extremely confusing, as IDNA does not affect (change characteristics) the DNS but is rather built on the fact that they will not be changed. * 4.1. The same : "The security of the Internet is compromised if a user entering a single internationalized name is connected to different servers based on different interpretations of the internationalized domain name." The security of the Internet is not compromised, however, trust in the IDNA proposition might be. * 4.2. (Martin Duerst): "these specifications"? The IDNA2008 collection of specifications? Or the specifications for the local character sets? * 4.2. (Martin Duerst): "(or different versions of one application)" -> "(or different versions or parts of one application)" (yes, this can and does happen) * 4.4. (Martin Duerst): "comparisons be done properly, as specified in the Requirements section of [IDNA2008-Protocol]": If comparisons are dealt with in Procotol, what's the purpose of 2.3.2.4? And what's the purpose of trying to explain it all again just after the quoted sentence? * 4.5. (Martin Duerst): "Despite that prohibition, there are a significant number of files and databases on the Internet in which domain name strings appear in native-character form;": This makes it appear as if such files and databases are in violation of some spec. But they may simply contain IRIs instead of URIs. I would simply start the subsection with something like "As long as IDNA2003 labels have been kept in A-label form, the only differences in interpretation arise for characters whose ..." and then, in a new paragraph, continue "For IDNA2003 labels that have been kept in native encoding,..." * 4.7. The Summary might be considered adventurous? Corporations such as Nominum propose services that are supposed to protect the DNS. One of the purposes of ML-DNS is precisely to permit an Morfin Expires March 10, 2010 [Page 15] Internet-Draft wgidnalc September 2009 architectural protection. * Acknowledgments (Andrew Sullivan): "As is usual with IETF specifications, while the document represents rough consensus, it should not be assumed that all participants and contributors agree with all provisions," I don't feel comfortable with starting to make the Acknowledgements section a platform for disclaimers about WG consensus. I object pretty strongly to this addition. I don't think we're served well by trying to state in any document how rough the rough consensus is: the document either has to stand through the IETF process, or not. Besides, this evaluation is a prerogative of the Chair, the ADs, and the IESG. If this sort of disclaimer is needed, it ought to be added by the IESG (and even then I would object). I would like the sentence to be removed. 5. IDNA Rationale * 1.1 (Andrew Sullivan): bugs me: "Traditionally, DNS labels are matched case-insensitively [RFC1034][RFC1035]. That convention was preserved in IDNA2003 by a case-folding operation that generally maps capital letters into lower-case ones. However, if case rules are enforced from one language, another language sometimes loses the ability to treat two characters separately. Case-sensitivity is treated slightly differently in IDNA2008." It makes it sound as though there's just a kooky tradition in the DNS, and we could fix that up. But that's not true: the matching rules are _defined_ to be case insensitive, so changing that would be a protocol change to DNS. Also, the text slides a little to easily between what are different contexts of "label" here, and I think it could be a source of confusion. I suggest this instead: The DNS matching rules for DNS labels are case-insensitive [RFC1034][RFC1035]. That convention was preserved for internationalized labels in IDNA2003 by a case-folding operation that generally maps capital letters into lower-case ones. However, if case rules are enforced from one language, another language sometimes loses the ability to treat two characters separately. Case-sensitivity is treated slightly differently in IDNA2008. * 1.3.1. DNS "Name" Terminology. "" would be better read as "orthotypographic" as an orthographic error that can be a way to lose some special semantics differences due to orthotypographic conventions. * 1.3.2. "IDNA-landr" typo? * 1.4. "Reduce the dependency on mapping, in order that the pre-mapped forms (which are not valid IDNA labels) tend to appear less often in various contexts, in favor of valid A-labels." calls Morfin Expires March 10, 2010 [Page 16] Internet-Draft wgidnalc September 2009 for the Charter to be revised. ALternatively, it could say , remove dependence on mapping as per a mapping document, in which this document would include a section on the various ways to ensure DNS security and the barring of some U+codes in some presentations. * 1.5. "This model has served the existing applications well, but it requires, with or without internationalized domain names, that users know the exact spelling of the domain names that are to be typed into applications such as web browsers and mail user agents. The introduction of the larger repertoire of characters potentially makes the set of misspellings larger, especially given that in some cases the same appearance, for example on a business card, might visually match several Unicode code points or several sequences of code points." may be read as if the users of these languages were more prone to errors than ASCII language. * "If an application wants to use non-ASCII characters in public DNS domain names, IDNA is the only currently-defined option." IDNA is not a DNS option. It is an application way to transcode Unicode domain names in LDH domain names for the convenience of ASCII oriented international managers. The idea is to attain the adherence of local users and managers to IDNA and not to impose ASCII on them. DNS is UTF-8 compatible. * "IDNA2008 divides all possible Unicode code-points into four categories: PROTOCOL-VALID, CONTEXTUAL RULE REQUIRED, DISALLOWED and UNASSIGNED. 3.1.1. PROTOCOL-VALID Characters identified as "PROTOCOL-VALID" (often abbreviated "PVALID") are permitted in IDNs." Are we talking of code-points or of characters? * 3.1.2.1 Not in the TOC * 3.1.3 Disallowed "various HEART symbols" - is U+38FA also disallowed? or U+3966? * 3.1.3. This is the first time anyone has spoken of NFKC. In IDNA Defs and other cases, it is NFC. Shouldn't t both of them be documented? Shouldn't someone explain in which specific case one is used? * "The character is an upper-case form or some other form that is mapped to another character by Unicode casefolding." this seems to create a very large mapping scheme that depends on a non-documented Unicode system needing correction (at least when it Morfin Expires March 10, 2010 [Page 17] Internet-Draft wgidnalc September 2009 does not specifically support majuscules). Moreover, are we dealing with characters (that are orthogonal to Unicode) or with codepoints that represent characters and that are subject to Unicode casefolding. The proposition is to: (1) clarify the character/codepoint issue, (2) explain what Unicode case folding is and its limitations, (3) move them to CONTEXTO when these codepoints are both used as upper-cases and as majuscules, (4) explain that majuscules that are supported by upper-cases will be transcoded by punycode. * 4.4. Case mapping. One may regret that the French majuscules current support of Unicode, which isperfectly adequate in other circumstances yet inadequate in this case, is not discussed. This would explain the upgrade above. Author's current answer: First, you want certain characters treated differently for some languages that use a given script than for others that use the same script. That is nearly impossible to think about, just because there is, in general, no way to know what language a particular label is supposed to be associated with, nor is there a way to know what top-level domain has the label in one of its subtrees (even if one could reliably associate top-level domains with languages). -- Second, if I understand your latest note correctly, you would like to have those characters treated via some contextual rule ("CONTEXTO"). But the contextual rules yield either "valid" or "invalid" based on adjacent or nearby characters -- they do not provide different mappings, nor different rules for different languages (the latter at least partially for the reasons above). -- And, finally, your suggestion requires treating capital letters (or at least some capital letters) as distinct from their lower-case forms, which would create massive inconsistencies with IDNA2003 (not just the two characters of inconsistency with which we have have had such extensive debates) as well as inconsistencies with DNS and host table practices that go back to the 1970s. No matter how strong your justification, and even if it were not also tied to differential treatment for a particular language, I cannot imagine the WG (or the IETF more broadly) agreeing to that change. * 4.5. "Examples of this are Yiddish, written with an extended Hebrew script, and Dhivehi (the official language of Maldives) which is written in the Thaana script (which is, in turn, derived from the Arabic script)" It seems that some explanation about Yiddish would be welcome so that the language will obtain the same Morfin Expires March 10, 2010 [Page 18] Internet-Draft wgidnalc September 2009 support as Dhivehi and Thaana. * 5. "Conversely, lookup applications are expected to reject labels that clearly violate global (protocol) rules (no one has ever seriously claimed that being liberal in what is accepted requires being stupid)." The remark between the parentheses is confusing: it possibly qualifies as "stupid" a behavior that is not recommended, but that is acceptable by the document set. * "Application implementors should be aware that where DNS wildcards are used, the ability to successfully resolve a name does not guarantee that it was actually registered." In which terms is this specific to IDNA? * 7.6. The Symbol Question. That part actually discusses the Unicode originated difficulties. Yet, the choice of Unicode has not yet been discussed. * 8.2. (Andrew Sullivan) the reference for DNSSEC should probably refer to DNSSECbis, which is usually [RFC4033], [RFC4034], [RFC4035], since RFC2535 is obsolete. One can make a strong argument for also including NSEC3 [RFC5155], but I don't feel too strongly about that. * 9. "Adding languages (or similar context) to IDNs generally, or to DNS matching in particular, would imply context dependent matching in DNS, which would be a very significant change to the DNS protocol itself". This sentence seems confusing. Natural languages are quoted throughout this IDNs document. * 12. (Andrew Sullivan): Here is where I make my now-familiar plea for the removal of the following sentence "As is usual with IETF specifications, while the document represents rough consensus, it should not be assumed that all participants and contributors agree with all provisions." 6. IDNA Mapping * (Andrew Sullivan): In general, I think the document is in reasonably good shape. I was hoping, however, for it to have some advice to registries on what sort of considerations are worth taking into account when formulating policies around what will and will not be registered. In particular, I think it would be very helpful to outline what it would mean for characters to be possibly mappable, and also to outline the different strategies that are available for preventing different alternative mappings from ending up with a different resolution. Morfin Expires March 10, 2010 [Page 19] Internet-Draft wgidnalc September 2009 The basic advice is that registries should try to detect whether a candidate label may be expected to be the result of some (plausible) application of a mapping appropriate in the probable user community, and similarly that when faced with a candidate IDN, registries should consider the probable user community and consider the plausible applications of mappings appropriate to that community. If the registry does not have the expertise to evaluate the probable user community for a given code point, then it should simply reject the code point outright. (This is, in effect, the advice that you should have a policy, it should be a policy based on knowledge of the use cases, and that it should default closed.) After the registry has detected whether the candidate label, there are two basic strategies it might follow: 1. Detect and reserve. In this case, the registry detects potential mappings, and reserves other candidate labels that might be the result of such mappings. This reservation takes the form of preventing registration of that label. 2. Detect and bundle. In this case, the registry detects the potential mappings, and creates identical entries in the registry conforming to those "alternative forms" of the candidate label. There is the potential for a very large number of these bundled labels. * (Andrew Sullivan): There's also a great deal of material at the end of rationale that would be more appropriate in this document (or else it should just go away, I think). * (Andrew Sullivan): I think the document is ready to go, assuming this is what we want to say. Yet its content is a little flabby as recommendations go. I can imagine a reader being a little surprised at this advice, for example: "These are a minimal set of mappings that an application should strongly consider doing. Of course, there are many others that might be done." That boils down to, "You might want to do this. Or not. Or something else. Up to you." I know why we're saying this, but I would not be surprised if people object to such thin advice. If that's all the advice we want to offer, however, this is the right document and it should go ahead. * Not sure that the terminology of "make sense" is adequate or clear. * 1. Introduction - This document is supposed to be separated from the IDNA document set. It should then document what the IDNA protocol is. It seems that the IDNA2008 protocols boil down to Morfin Expires March 10, 2010 [Page 20] Internet-Draft wgidnalc September 2009 "DNS domain names are to be expressed in LDH form. IDNA is a commonly agreed upon convention wherein if they are entered by the user in another form, applications are advised to convert them to UTF in order to filter and map them, as is discussed in the present document, as well as to transcode them in by using the punycode algorithm. Depending on the Registry policy, their registration can be carried out in the ITF and/or the transcoded ASCII form." * 2.3. NFC is confirmed, NFKC is not discussed. 7. IDNA Protocol * As a general comment: * The SHOULD/MUST chains may be somewhat awkward. MUSTs are used in a protocol procedure and then an alternative to that procedure is pragmatically considered. It could be of interest to draft a MUST tree to consider which cases are, or are not, covered. * there is some confusion as to what the "string" is compared to the label and domain name, in which "Label" may be used instead of "U-Label" or sometimes "A-Label". Wouldn't it be better to review the text in qualifying the "labels" in order to be certain that all the cases are clearly covered? * 1.para 1: (Martin Duerst): I missed the term IDNA2003 in Defs, it would have been useful. I didn't complain because I thought it had been 'deprecated'. Seeing it here, I think it should go back to Defs, and be actively used in Defs and Protocol at least, to simplify and clarify prose. * 1.para 2: (Martin Duerst): "does not changes" -> "does not change" * 1.para 2: (Martin Duerst): "IDNA does not depend on any changes to DNS servers, resolvers, or protocol elements" -> "IDNA does not depend on any changes to DNS servers, resolvers, or DNS protocol elements" or "IDNA does not depend on any changes to DNS (servers, resolvers, or protocol elements)" (Otherwise, it's possible to understand 'protocol elements' as being not limited to DNS.) * 1.para 4: (Martin Duerst): ", that share some terminology, reference data and operations." -> "These two protocols share terminology, reference data and operations." * 2.para 1: (Martin Duerst): "Terminology used in IDNA, but also in Unicode or other character set standards and the DNS, appears in [IDNA2008-Defs]." -> "Terminology used in IDNA appears in Morfin Expires March 10, 2010 [Page 21] Internet-Draft wgidnalc September 2009 [IDNA2008-Defs]." (where else these terms are used, or where they are from, can be explained in Defs where necessary, but is absolutely irrelevant here) * 2.para 1: (Martin Duerst): "Terminology that is required as part of the IDNA definition, including the definitions of "ACE", appears in that document as well." -> remove (first, the word 'definition' is used with two slightly different meanings, and second, I don't see the point of singling out "ACE".) * 3.1. Requirement 2: (Martin Duerst): Equivalence is already defined in Defs. Please make sure there is only a single definition. * 3.1. Requirement 2: (Martin Duerst): Why is it a MUST that U-Labels are compared without case-folding (even for ASCII chars?) or other steps? * 3.1. Requirement 2: (Martin Duerst): "In many cases, validation may be important for other reasons and SHOULD be performed.": Is this restricted to when trying to compare? Or in general? * 3.1. Requirement 3: (Martin Duerst): This does double duty, and should be removed. The alternative is to covert 3.1 into a conformance section as usual e.g. for ISO standards, but then a lot more rewriting will be necessary in all of 3.1. * 3.2. "It does not apply to domain name slots which do not use the Letter/Digit/Hyphen (LDH) syntax rules." Confusing. Would some of the DN slots not accept both? * 3.2. para 1: (Martin Duerst): "IDNA applies to": What does "applies to" mean? * 3.2. para 2: (Martin Duerst): "Because it uses the DNS, IDNA applies" -> "Because IDNA uses the DNS, it applies", or even better "Because IDNA uses the DNS, IDNA applies" (repetitions don't hurt in standards, reference before referent does) * 3.2. para 2: "unless those protocols and implementations of them" -> "unless those protocols and their implementations" * 3.2. para 2:(Martin Duerst): "be aware of IDNs in Unicode" -> "be aware of IDNs" (whether they are in Unicode or not is irrelevant here) * 3.2.1. The word CLASS only appears in the whole document set in two sentences: "DNA applies only to domain names in the NAME and RDATA fields of DNS resource records whose CLASS is IN. See RFC Morfin Expires March 10, 2010 [Page 22] Internet-Draft wgidnalc September 2009 1034 [RFC1034] for precise definitions of these terms. The application of IDNA to DNS resource records depends entirely on the CLASS of the record, and not on the TYPE except as noted below." What about internationalized domain name in a non IN CLASS? Author's answer: I have received no further input on this and will assume that the current text is ok unless I do. Additional CommentMy reading of that text was that it was a _restriction_, not a claim of fact. In other words, I interpreted that text as saying that, if a new class is invented, IDNA (if it were specified) would need to be specified separately for that class. By definition, the only place domain name labels can appear is in the NAME and RDATA fields of resource records. It's important to remember that the RDATA field can have subfields (as it does in the SOA record). I think that's clear enough from the rest of the discussion, and because the documents already say, "You need to understand DNS too." * 3.2.1. (Paul Hoffman): Section 3.2.1 says: "IDNA applies only to domain names in the NAME and RDATA fields of DNS resource records whose CLASS is IN." It would be good for the DNS-centric folks in the WG to verify that they think that this restriction is correct. Are there really no other fields where domain labels would appear? * 3.2.1. para 1: (Martin Duerst): The first paragraph reads as if IDNA applied to domain names in e.g. TXT records in CLASS IN. I think it would help here to say exactly what is meant by "IDNA applies". In some sense, IDNA applies nowhere in DNS records, they are all just ASCII. In some sense (labels starting with xn-- are presumed to be IDNA labels; you can add an IDN (or a label thereof) to a DNS record by using A-labels), IDNA applies. * 3.2.1. para 2: (Martin Duerst): The SVR discussion has significant overlap with Defs, please reduce. * 4. "This section defines the procedure for registering an IDN. The procedure is implementation independent; any sequence of steps that produces exactly the same result for all labels is considered a valid implementation." A procedure does provide but does not define a result? * 4. para 1:(Martin Duerst): "defines *the* procedure" ... : This would work better if there were really only one procedure, and it were written as a procedure. However, there are often variations, Morfin Expires March 10, 2010 [Page 23] Internet-Draft wgidnalc September 2009 and different, often non-procedural ways in which things are expressed (e.g. 'labels must ...' instead of 'if a label doesn't satisfy x, abort') * 4. para 2: (Martin Duerst): "the registration and lookup protocols (Section 5)" -> "the registration protocol (this section) and the lookup protocol (Section 5)" (shortcuts are the enemies of specifications) * 4. para 2: (Martin Duerst): "while ... are very similar in most respects, they are different" -> "while ... are very similar in most respects, they are not identical" * 4. para 2: (Martin Duerst): "follow the appropriate steps": appropriate appeals to value judgement, which isn't adequate here. * 4.1. The obligation chain reads: "By the time a string enters the IDNA registration process [], it is expected to be in Unicode []", yet "registries [] SHOULD avoid any possible ambiguity by accepting registrations only for A-labels []." * 4.1. (Andrew Sullivan): "By the time a string enters the IDNA registration process as described in this specification, it is expected to be in Unicode and MUST be in Unicode Normalization Form C (NFC [Unicode-UAX15])." The "expected to be" part is redundant, since we have a subsequent MUST. * 4.1. (Andrew Sullivan): I found this slightly confusing: "The registry SHOULD permit submission of labels in A-label form and is encouraged to accept both the A-label form and the U-label one. If it does so,". The "does so" reference there is ambiguous: is it the submission of A-labels or the A-label+U-label case. The subsequent text suggests it's the former. * 4.1. title: (Martin Duerst): Why suddenly "Process" instead of "Procedure"? Why not just "Input"? And why singular in the title, and then plural in the first line of the text? * 4.1.(Martin Duerst): "are outside the scope of these protocols": How many protocols are there? Only one that's relevant here. * 4.1. (Martin Duerst): Why is NFC a condition on the input? Please make it a validation step afterwards, to streamline things. * 4.1. (Martin Duerst): "Entities responsible for zone files ("registries") are expected to accept only the exact string for which registration is requested, free of any mappings or local adjustments.": It's clear to me what we want here, but it's much better to write this as a condition on the later processing, Morfin Expires March 10, 2010 [Page 24] Internet-Draft wgidnalc September 2009 rather than on input, something like: "Entities responsible for zone files ("registries") MUST NOT apply any mappings or local adjustments of any kind to the exact string for which registration is requested." * 4.1. (Martin Duerst): "They SHOULD avoid any possible ambiguity by accepting registrations only for A-labels, possibly paired with the relevant U-labels so that they can verify the correspondence." This has to be improved. First, the SHOULD doesn't belong on the reason, and the reason, if anywhere, belongs at the end. Second, there are three possible ways input can come in, so let's list things up: "Entities responsible for zone files ("registries") MAY accept input in any of three forms: 1) As a pair of A-label and U-label 2) As an A-label only 3) As an U-label only. 1) and 2) are RECOMMENDED because the use of A-labels avoids any possibility for ambiguity. (the first sentence in 4.2.1 can then be removed) * 4.2.1. (Martin Duerst): This is a complex jungle of conditions on input, conversions,... What should be done is: a) extract the 'raw' (without any preconditions) U-label->A-label and A-label->U-label 'functions' into subsections e.g. in Section 3; these will serve as building blocks both in Section 4 and Section 5. b) As the first step of the registration procedure, make sure we have both an A-label and an U-label. One way to write this is: "4.2.1: Preprocessing 1) If the input contained an A-label and a U-label, check that they are equivalent (or whatever that was called; the conditions are somewhere in Defs). If the check fails, abort registration. 2) If the input contained an A-label, but no U-label, calculate the U-label according to @@@. 3) If the input contained an U-label, but no A-label, calculate the A-label according to @@@." The above makes sure we have both an A-label and an U-label from here on. Checking on these can be performed independently (e.g. Morfin Expires March 10, 2010 [Page 25] Internet-Draft wgidnalc September 2009 length check on A-label, NFC check on U-label). Conversion to punycode is no longer needed in 4.4, because we simply put the A-label we have now into the zone (assumed we have passed all the checks up to here, of course). * 4.2.1. (Martin Duerst): (probably not needed anyway) "both the A-label form and the U-label one" -> "both the A-label form and the U-label form" * 4.2.1. (Martin Duerst): Word the A-label checks more clearly, and create section "4.2.2 A-label Validation" * 4.2.3.2. "a combining mark or combining character" -> "a combining character" (combining marks are a special case of combining characters, and as such irrelevant here) * 4.2.3.3. (Martin Duerst): "To check this, each code-point marked as CONTEXTJ and CONTEXTO in [IDNA2008-Tables] MUST have a non-null rule." Is this a requirement on Tables? Are there "null rules"? What purpose do they serve, what's the difference between them and DISALLOWED? * 4.2.3.4. (Martin Duerst): What are "characters written from right to left"? Either we define this clearly here, or we leave it (or put it) in Bidi, but then we have to rewrite the sentence here (just requiring conformance to the conditions in Bidi). * 4.2.4. (Martin Duerst): This is totally unnecessary, please remove. If we need a summary for what's essentially just about a page of text, we better give up. * 4.3. Registry restriction inheritance is not alluded to. * 4.3. (Martin Duerst): "Policies are likely to be informed by the local languages" -> "Policies are likely to be informed by the local scripts and languages" (IDNs are mostly a script issue, much less a language issue. ICANN has fixed their documents to avoid only talking about languages (they still could move a bit further to scripts), so let's not commit the same mistake here again.) * 4.3. (Martin Duerst): "or the application of special restrictions to others": like what? Like that such a label can only be resolved on Tuesdays? * 4.4. (Martin Duerst): The generic parts of the conversion need to go somewhere else (Section 3?). The actual conversion (or checking) needs to go at the start of 4.2. Then this section is empty and can be removed. Morfin Expires March 10, 2010 [Page 26] Internet-Draft wgidnalc September 2009 * 4.5. (Martin Duerst): "The A-label is registered in the DNS by insertion into a zone." -> "The label is registered in the DNS by inserting the A-label into a zone." (distinguish registration of the abstract thing from insertion of the concrete thing) * 5. Does this repetition (already in IDNA Rationale) "the presence of wild cards in the DNS might cause a string that is not actually registered in the DNS to be successfully looked up." reflect what the BIDI documents slightly differently: "Wildcards create the odd situation where a label is "valid" (can be looked up successfully) without the zone owner knowing that this label exists. So an owner of a zone whose name starts with a digit and contains a wildcard has no way of controlling whether or not names with RTL labels in them are looked up in his zone." * 5. (Martin Duerst): para 2: " The two steps described in Section 5.2 are required.": Superfluous. Make sure there's a MUST at the right place in that section. (Looking at 5.2, I have no clue what the two steps should be. This shows that indirect requirements like the above are rather unhelpful.) * 5.1.(Martin Duerst): first paragraph: Although IDNs will often get extracted from IRIs or URIs, there are many cases where these constructs are not involved. Examples would be telnet or ping commands, and so on. So IRIs and URIs should be deemphasized more. * 5.1.(Martin Duerst): "Processing in this step and the next two are local matters, to be accomplished prior to actual invocation of IDNA.": Again, which steps? Before, we supposedly had two steps in 5.2, now it looks as if we are talking about 5.2 and 5.3 as two steps. -> Create a subsection such as "Input preparation" or what where all the preliminary stuff goes in. Alternatively, talk about subsections, with subsection numbers for clear identification. * 5.2. The case of a character that is not supported by Unicode is not discussed. * 5.2. (Martin Duerst): "is not already Unicode" -> "is not already in Unicode" (in parallel to 'into' in the line before) * 5.2. (Martin Duerst): "A Unicode string may require normalization as discussed in Section 4.1.": There is no "discussion" in 4.1 (and no need for discussion). Express the requirements here independently of Section 4. * 5.3. (Wil Tan): section 5.3, A-label Input section to add the lowercasing step prior to using the Punycode decoding algorithm. The section on symmetry constraint (-defs-10, section 2.3.2.1) should also have similar wordings. Morfin Expires March 10, 2010 [Page 27] Internet-Draft wgidnalc September 2009 * 5.3. (Martin Duerst): (just checking) "See the Name Server Considerations section of [IDNA2008-Rationale] for additional discussion on this topic.": From the context, Name Server doesn't look related (we are client-side here). * 5.3. (Martin Duerst): "That conversion and testing SHOULD": Replace 'That' with something clearer and more precise. * 5.3. (Martin Duerst): para 2: List up the alternatives that are possible. Avoid mishmash textual paragraphs. * 5.4. The use of "U-Labels" in this part instead of "Labels" would probably clarify it. * 5.4. (Paul Hoffman): Section 5.4 assumes that an application knows the version of Unicode that is being used in the application. We should state that assumption in 5.4 or maybe further up near the beginning of section 5. Author's answer: It assumes that either the application or the operating system or library support keeps the two consistent. That range of options is the reason why Section 5.4 is not more explicit about which particular software elements or modules know what. If this is to be changed, I need suggestions about textual fixes that do not imply that the knowledge must be in the application itself. Further comment: The first bullet in 5.4 is the first time that "version of Unicode" is mentioned, so the note is probably most effective right there. I propose adding: This requirement means that the application must use a list of unassigned characters that is matched to the version of Unicode that is being used for the other requirements in this section. It is not required that the application know which version of Unicode is being used; that information might be part of the operating environment in which the application is running. * 5.4. (Paul Hoffman) The paragraph in section 5.4 that starts "This test may..." is out of date because the rules in the Bidi document no longer do inter-label checking. The whole paragraph can be removed. * 5.4. (Paul Hoffman) In the light of this, does the WG want to change the requirement level for checking Bidi on lookup from SHOULD to MUST? Given the above, I see no reason why not. * 5.4. "applying the test is likely to give much better information about the reason for a lookup failure -- information that may be Morfin Expires March 10, 2010 [Page 28] Internet-Draft wgidnalc September 2009 usefully passed to the user when that is feasible -- than DNS resolution failure information alone" might this lead to the idea that they could also be carried in case of the failure to better document it? * 5.4. "For all other strings, the lookup application MUST rely on the presence or absence of labels in the DNS to determine the validity of those labels and the validity of the characters they contain". Is it correct to assume that the first labels stand for "A-Label" and the second one stands for "their corresponding U-Labels"? * 5.4. (Martin Duerst): para 1: Mishmash again. Most of this para is best removed. * 5.4. (Martin Duerst): para 1: "Putative labels": Both in Section 4 and 5, labels are for the most part putative, because they don't conform to the definitions unless checked. Either before section 4, or once at the start (Input subsection) of both section 4 and section 5, say that for the most part, we are dealing with putative labels, but 'putative' isn't repeated all the time to make the text easier to read. * 5.4. (Martin Duerst): page 12: Finally a bullet list. I almost thought that the author didn't know how to create bullet lists, or was of the opinion that bullet lists don't have a place in spec. Quite to the contrary, please make sure there are much more bullet lists. It will make everything much easier to read and clearer. * 5.4. (Martin Duerst): "Labels that are not in NFC form as defined in [Unicode-UAX15].": There is only one definition of NFC, but the sentence suggests there are several. Please change to "Labels that are not in NFC [Unicode-UAX15]." * 5.4. (Martin Duerst): Please move bullet 1 (UNASSIGNED) and bullet 4 (DISALLOWED) and all the other table-related bullets together. I think it's best to put UNASSIGNED last (and mention that this is the category most subject to change). * 5.4. (Martin Duerst): Streamline the wording used to refer to Tables and a category. Currently, we have: "in the UNASSIGNED category of [IDNA2008-Tables]" - "in the "DISALLOWED" category in the permitted character table [IDNA2008-Tables]" that are identified in [IDNA2008-Tables] as "CONTEXTJ" * 5.4. (Martin Duerst): "Labels whose first character is a combining mark (see Section 4.2.3.2).": Refer directly to the relevant Unicode definition, rather than to section 4.2.3.2 (which contains a MUST, which is already implicit here). Morfin Expires March 10, 2010 [Page 29] Internet-Draft wgidnalc September 2009 * 5.4. (Martin Duerst): "In any event, lookup applications should avoid attempting to resolve labels that are invalid under that test.": Remove. We already have a SHOULD, no need for a should on top of that. * 5.4. (Martin Duerst): last para: I assume this is e.g. about labels with mixed scripts,... What it essentially seems to say is that a browser may warn users if it detects mixed scripts, but if the user still wants to see the page, s/he is entitled to it. In such a context, the word 'validity' seems quite a bit out of place; it would be better to speak about 'other tests' or some such in a more general way. * 5.5. (Martin Duerst): para 1: "using the Punycode algorithm (with the ACE prefix added)": The parenthetical seems to suggest that addition or not of the ACE prefix is an (optional) part of the Punycode algorithm, but RFC 3492 does not define the prefix, nor is the additon of the prefix part of the punycode algorithm. -> Convert parenthetical to a clause or sentence ("... and then adding the ACE prefix." or so). * 5.5. (Martin Duerst): rest from second sentence in para 1: As said in my comments on Section 4, a summary is unnecessary. Also, it has nothing to do with punycode conversion. In addition, the second bullet point is confusing, because an A-label (checked or not) cannot be punycode-converted again. -> remove * 5.6. (Martin Duerst): "That ... string" -> "The string resulting from the conversion in Section 5.5" * 5.6. (Martin Duerst): "That lookup" -> "The lookup" * 5.7. (Martin Duerst): What about (streamlined): Security Considerations for this version of IDNA are described in [IDNA2008-Defs], except for the special issues associated with right to left scripts and characters, which are discussed in [IDNA2008-BIDI]. * 7. IANA Considerations - There is no commitment from UNICODE to not update those Unicode documents that are accepted as normative in the IDNA documentation set. Should their copy at the time of the publication of this set not be stored by the IANA? * 8. (Andrew Sullivan): I suggest "This second-generation version would not have been possible without the work that went into that first version, due to its authors ." Or something like that. * 8./9.(Martin Duerst): These should be merged. The text explains it all. Morfin Expires March 10, 2010 [Page 30] Internet-Draft wgidnalc September 2009 * 8. (Martin Duerst): "Hoffman and Costello ... should not be held responsible for any errors or omissions.": Remove, this is implicitly clear, in the end it's the WG and the IETF that's responsible. Similar for "As is usual with IETF specifications, while the document represents rough consensus, it should not be assumed that all participants and contributors agree with all provisions." * 9. (Andrew Sullivan): I object very strongly to the inclusion of the sentence, "As is usual with IETF specifications, while the document represents rough consensus, it should not be assumed that all participants and contributors agree with all provisions." Rough consensus is always rough on everyone, but if you are a participant who urges this sentence on the product of the WG, I ask you to reconsider. It is unworthy of your effort and the efforts of your colleagues. It would be better to have an outright flamewar on the mailing list than to have that sort of not-with-a-bang-but-a-whimper remark live forever in the WG output. If we as a WG really have such deep disagreements that we have to send drafts with this sort of disclaimer to the IESG, I feel pretty uncomfortable that the WG has in fact reached consensus. * References. (Martin Duerst): [Unicode-RegEx], [Unicode-Scripts], [Unicode-UAX15] (and maybe others): Unicode data files don't have explicit authors, but Unicode TRs (and similar stuff) has authors/editors, same as RFCs. Please don't drop this information. 8. IDNA BIDI * (Paul Hoffman): This draft is not yet ready for publication. The text is still very confusing about the relationship between this document and RFC 3454. In many places, the wording makes it sound like this new algorithm is a replacement for that in RFC 3454, which it is not: it is the algorithm to use with IDNA2008. I would be happy to do a thorough edit to remove this ambiguity and make it clear that, while the algorithm is an improvement on the old one, it does not "change" or "fix" or "replace" the old one. Author's current answer: It changes the old definition in exactly the same meaning of the word "change" as the way IDNA2008 changes IDNA2003. Author's current answer: I'll send you the XML under separate cover; if you can make the changes you feel are needed, I can see if I agree with them. * (Paul Hoffman) The terminology section needs to define "the end of the label". Tests 3 and 6 of section 2 are confusing without some Morfin Expires March 10, 2010 [Page 31] Internet-Draft wgidnalc September 2009 definition. Author's current answer: Will add "beginning" and "end" to the paragraph that says "In this memo, we use "network order" to describe the sequence of characters as transmitted on the wire or stored in a file; the terms "first", "next", "previous", "before" and "after" are used to refer to the relationship of characters and labels in network order. * Abstract, and potentially elsewhere. (Martin Duerst): Avoid the word 'new'. RFCs are archival documents. * 1.1. Advisable or not to specify "when U-labels" instead of "labels" ? Author's current answer: The first paragraph says that the document is about U-labels. This should not need repeating. * 1.1. (Martin Duerst): para 2: "When labels satisfy the rule, and when certain other conditions are satisfied, they can be used with a minimal chance of these labels being displayed in a confusing way by a bidirectional display algorithm.": "they" .. "these labels" is confusing. What about "When labels satisfy the rule, and when certain other conditions are satisfied, there is only a minimal chance that these labels will be displayed in a confusing way by a bidirectional display algorithm." * 1.1. (Martin Duerst): "A bidirectional display algorithm": How many of them do we have? (I only know one, the Unicode one (with some minor variants)). How many of them have been used for testing/verification? * 1.1. (Martin Duerst): para 3: what exactly is a "right-to-left character"? * 1.2. (Andrew Sullivan): "While the document proposes completely new text, most reasonable labels that were allowed under the old criterion will also be allowed under the new criterion, so the operational impact of the rule change is limited." It would be nice here, I suggest, to offer some definition of what labels fall outside "most reasonable labels". The description sounds too much like, "The labels we picked when we wrote this," which is indubitably not the impression anyone intended. Author's current answer: I added some examples (mixtures of numerals, and AN inside LTR labels). Hope it helsp. * 1.2. (Martin Duerst): This section ideally should also be moved to after Section 2. Morfin Expires March 10, 2010 [Page 32] Internet-Draft wgidnalc September 2009 * 1.2. (Martin Duerst): para 1: "The IDNA specification "Stringprep"": change to something like "Stringprep, part of IDNA2003". Otherwise, it's not clear that this is an old spec. * 1.2. (Martin Duerst): para 4: "However, this makes certain words" -> "However, this made certain words" (past tense) * 1.2. (Martin Duerst): para 7: "While the document specifies rules" -> "While this document specifies rules" * 1.2. (Martin Duerst): para 7: "(the most important being label that mix Arabic and European digits (AN and EN) inside an RTL label, and labels that use AN in an LTR label)": Very weird. Such cases may not be completely impossible, but they are much less frequent than e.g. Arabic numbers inside Arabic letters, European numbers inside Arabic letters, and so on. There was even a strong movement to prohibit number mixing at the protocol level; this would never have happened if such mixing would have been deemed to be "most important". Also, after looking at the actual conditions, we either have an RTL label, which by condition 4 excludes mixing EN and AN, or we have an LTR label, which by condition 5 excludes AN and therefore the mixture of EN and AN. * 1.3. (Martin Duerst): title: "Layout" -> "Structure" or "Organization" * 1.3. (Martin Duerst): para 1: Change from "bidi test" to "bidi rule". (or unify otherwise) * 1.3. (Martin Duerst): para 1: ", that" -> ", which" * 1.3. (Martin Duerst): para 1: "no matter what the direction of the label is": What does this mean? It could either mean that you can apply the test forwards or backwards, or it could mean that it doesn't depend on what directionality the characters in the label have, or whatever. In the later case, I'd write e.g.: "This test [->rule, see above and below] can be applied to any kind of label, but becomes trivial if the input is guaranteed to contain only LTR characters." * 1.3. (Martin Duerst): "The primary initial use of that test": "that test" -> "this test" (this sentence talks about relationship with other documents, so it's the test in this document, not the test in that other section) * 1.3. (Martin Duerst): para 2: "a BIDI rule" -> "the BIDI rule" * 1.3. (Martin Duerst): para 3: "new rule proposed here" -> "new rule proposed" (we are talking about document organization, so Morfin Expires March 10, 2010 [Page 33] Internet-Draft wgidnalc September 2009 it's "the rule in that other section over there", so "here" doesn't fit) * 1.3. (Martin Duerst): para 4: "Section 5 to Section 9 describe" -> "Section 5 to Section 7 describe": Section 8 is IANA consideration. * 1.4. (Mark Davis) "An RTL label is a label that contains at least one character of type R or AL." I believe you should also add "AN". There are cases where it affects ordering. What I mean is that if you had AN + L in a label (not nec in that order), you wouldn't even count it as a BIDI domain name, and thus none of the bidi doc would apply (according to the text). Yet such labels would be legal according to protocol, and I think they can cause reordering, and could thus cause the kind of visual confusion that BIDI is supposed to prevent. Author's current answer: Good point. I'll modify. * 1.4. BIDI properties come from Unicode. They might not be complete or could be completed in the future. What then? Author's current answer: See section 7.2, "This memo does not propose a solution for this problem". * 1.4. (Andrew Sullivan): There are some terms defined in Para 1.4. I think it would be way helpful to a naive reader to be directed to defs at the beginning of this section first, and then to say "there are specific BIDI-only terms also defined here". So I'd move the reference that's at the end of this section to the beginning. * 1.4. (Andrew Sullivan): The third paragraph now ends with a comma, so it looks like something was supposed to be added and wasn't. Or is this just a typo? Author's current answer: Typo, fixed. * 1.4. (Andrew Sullivan): I find this peculiar: A "Bidi domain name" is a domain name that contains at least one RTL label. If a domain name is RTL.RTL.RTL, it qualifies under this definition, even though there is no bidirectionality (all labels have the same directionality). Explaining why this is still "bidi" would leave me less confused. Author's current answer: Added some text to say "adding a separate RTL-name category would just make the spec more complicated". Morfin Expires March 10, 2010 [Page 34] Internet-Draft wgidnalc September 2009 * 1.4. (Martin Duerst): "non spacing" -> "nonspacing" * 1.4. (Martin Duerst): "The directionality of such examples" -> "The display order of such examples" * 1.4. (Martin Duerst): "it means ..., approximately" -> "it approximately means" * 1.4. (Martin Duerst): "An RTL label": This seems to be the definition that Protocol might want to refer to. * 1.4. (Martin Duerst): 'Having a separate category of "RTL domain names" would not make this specification simpler, so has not been done.' -> 'Providing a separate category of "RTL domain names" would not make this specification simpler.' * 2. A replacement for the RFC 3454 BIDI rule: it would probably be good to indicate the applying order. Author's current answer: The 6 conditions can be checked in any order. All must be satisfied in order to make the test pass; different implementations may find that different checking orders make the code more or less efficient * 2. (Martin Duerst): (title), and elsewhere: Both "Bidi rule" and "Bidi test" are used, that's confusing. The term is always in singular. The document works that way in general, but "The following test" at the start of Section 2 is confusing, because the only 'tests' that one can see are the ones labeled 1. to 6. Maybe use something like "In order to pass the BIDI test, the following conditions 1. to 6. must all be satisfied." * 2. (Martin Duerst): conditions 2/4: Why are BN (control characters) allowed in RTL but not in LTR? * 2.1. (Andrew Sullivan): Rule 1 in Para. 2 says, "The first character must be a character with BIDI property L, R or AL." I can't tell whether that must is a requirement or a statement of fact that is entailed by other IDNA rules. If it's a requirement, it presumably ought to be a 2119 MUST; even if not, it seems that we have to know what to do in case the first character doesn't match this rule. If it's an entailment, it'd help to make that plain, which could be done by restating it, "The first character will be a character with BIDI property L, R, or AL due to [reason reference]." Author's current answer: I'm not sure what an entailment is - but this is a rule. In order to execute the text, you must look Morfin Expires March 10, 2010 [Page 35] Internet-Draft wgidnalc September 2009 at the first character and check its BIDI property. This document isn't using 2119 (wow); I used to have the reference way back when, but it didn't seem to help any, so I removed it. * 3. (Andrew Sullivan): "One specific requirement was thought to be problematic, but turned out to be satisfied by a string that obeys the proposed rules: * The Character Grouping requirement should be satisfied when directional controls (LRE, RLE, RLO, LRO, PDF) are used in the same paragraph (outside of the labels). Because these controls affect presentation order in non-obvious ways, by affecting the "sor" and "eor" properties of the Unicode BIDI algorithm, the conditions above require extra testing in order to figure out whether or not they influence the display of the domain name. Testing found that for the strings allowed under the rule presented in this document, directional controls do not influence the display of the domain name." comes after the discussion of things considered and rejected. This leaves me confused about whether the text above is in fact a requirement or not. If it is a requirement, then I'd move this segment to the part before the rejected requirements. Author's current answer: Added a little more text to explain the status of the requirement - it's a "nice to know". * 3. (Martin Duerst): "A requirement" -> "The requirement" (see above) * 3. (Martin Duerst): para 2: As this restricts things to the Unicode bidi algorithm, please say this earlier. (see above) * 3. (Martin Duerst): para 3: "requirements proposed" -> "requirements" (we are working on finalizing this document, we are no longer in the proposal stage) * 3. (Martin Duerst): requirement 2: Is the choice of 'characters delimiting the labels' open, is this only the ASCII dot, is this a small set (I'm interested in this both for spec clarity and because the answer might strongly affect draft-duerst-iri-bis). * 3. (Martin Duerst): 'possible requirement' related to directionality controls: "(outside of the labels)" -> "(outside, but potentially directly adjacent of the labels)" (does this include cases with directionality controls inside a domain name, i.e. before/after a dot?) "the conditions above require extra testing" -> "the conditions above required extra testing" Morfin Expires March 10, 2010 [Page 36] Internet-Draft wgidnalc September 2009 * 3. (Martin Duerst): 'Delimiterchars': FULL STOP not allowed in domain names? * 4.1. (Martin Duerst): Thaana 'Computer' example: "UBIUFILI" -> "UBUFILI" * 4.2. (Martin Duerst): This section could be shortened considerably. "Greater latitude here than ... Dhivehi." is irrelevant; as long as a significant part of a language's words cannot be used in IDN, there's a problem. The subsection is interesting for people interested in Yiddish, but the average reader of the spec will try to find something relevant for the algorithm, and mostly be more confused than enlightened. * 4.2.3.4 (James Mitchell): If the proposed label contains any characters that are written from right to left it MUST meet the "bidi" criteria [IDNA2008-BIDI]. The above implies that the label must meet the BIDI criteria, however BIDI criteria is applied to a BIDI domain name. From draft-ietf-idnabis-bidi-04, The following test has been developed for labels in BIDI domain names and A "Bidi domain name" is a domain name that contains at least one RTL label. I hesitate to provide alternative text for until the following question has been answered. The protocol states that the proposed label contains any characters written from right to left it MUST meet the bidi criteria. It does not impose such requirements on labels containing no right to left characters. Consider the registration of label 123abc in a zone containing labels written right to left. The label does not contain any right to left characters, therefore does not have to meet the BIDI criteria. However this name is a BIDI domain name, yet such a name would fail as the first label (LTR) does not begin with BIDI property L. Is the label intended to be valid for registration given a right to left zone? Author's current answer: In your specific case, registering "123abc" in the RTL zone "ABC" (usual convention applies) will lead to a domain name (network order: 123abc.ABC) that, on its own, will display as "123abc.CBA" in an RTL context, but if prepended by "DEF:", forming the network order string DEF:123abc.ABC, will display as "CBA.abc123:FED" - which may be surprising to some. (I'm only 90% confident on this - people more used to bidi in practice may be more confident). The WG has rejected inter-label tests, therefore all tests defined by the protocol as normative (MUST or SHOULD) apply only to one label at a time. Morfin Expires March 10, 2010 [Page 37] Internet-Draft wgidnalc September 2009 Given the WG decision, I tried to make the BIDI document quite clear that certain properties can only be guaranteed for domain names where all the labels meet the test, but this is a case where people have to read the warnings and do something reasonable, rather than having the rules define that being unreasonable is forbidden. "Warning: Contains hot liquids". I am not concerned about the display ordering of the name in question. The issue is a mismatch between the registration and lookup protocols. The registration protocol asks the question of the label '123abc', which left-to-right is not required to satisfy BIDI. However, the lookup protocol says that one SHOULD apply the BIDI test (on I assume the name). Applying the BIDI test to this name will fail and the name will not be looked up. As a registry, should I allow registration of the name 123abc.RTL? Author's current answer: My recommendation is that you should (as a registry) establish policy that says "it is not allowed". The protocol does not require you to do so. As an application, should I lookup the name 123abc.RTL? Author's current answer: The protocol does not say that you can't. For the obvious reasons, I think it's a legitimate implementor decision to decide not to. * 4.2.3.4 (James Mitchell): I note also the inconsistent use of the term BIDI. In draft-ietf-idnabis-protocol-14 section 4.2.3.4. it is quoted and in lower-case, whereas the draft-ietf-idnabis-bidi-04 uses the upper-case version extensively. Also, within draft-ietf-idnabis-bidi-04 the term "Bidi domain name" in Section 1.4 is inconsistent with BIDI domain names in Section 2, and the tem Bidi rule in Section 10 is inconsistent with the several other occurrences in the document. Author's current answer: Thanks, I tried to normalize it to uppercase in an earlier round, but didn't remain consistent in later edits. Will fix! * 4.3. (Martin Duerst): "(with the 5 being considered right-to-left because of the leading ALEF)": No, the 5 itself is never right-to-left. Change to "(the overall directionality being right-to-left because of the leading ALEF)" Morfin Expires March 10, 2010 [Page 38] Internet-Draft wgidnalc September 2009 * 4.3. (Martin Duerst): "but barring them both seems to require justification" -> "but barring them both seems unnecessary" or "but barring them both turned out to be unnecessary" * 5. (Martin Duerst): "Even if a label is registered under a "safe" label,": 'under' should be explained more clearly (I assume this refers to the hierarchical relationship in the DNS) * 5. (Martin Duerst): last paragraph: It would be better to change this into a SHOULD, such as "Where implementations see a a way to avoid ..., they SHOULD avoid". That will bring this issue on the radar screen of implementers, whereas it currently will just be glossed over. * 6 (Mark Davis)"Rules can also be specified at the protocol level, but while the example above involves right-to-left characters, this is not inherently a BIDI problem." I think the issue is that the word "can" was appropriate for when this was a proposal, but the situation is different in heading for release; the "can" should be changed according to what you mean. * If you are referring to the situation as of when these are all released, then the rules either "are specified" or they are not (in which case the statement removed). * If you are referring to a future time, then the "can" becomes "could" (or for clarity, adding "in a future version" or some such language). * If you are referring to what could have been done, then "could also have been" would be appropriate. Because I can't tell what you want to say, I don't know which of these you would mean. Author's current answer: There's no guarantee of synchronity in further updates, so the situation isn't really all that different (BIDI doesn't place any constraint on future -tables), but I'll change to "are". * 6. (Martin Duerst): first paragraph: "All other issues with these scripts": What scripts??? * 6. (Martin Duerst): "wishes to create rules for the mixing of digits" -> "wishes to create rules against the mixing of digits" or "wishes to restrict the mixing of digits" * 6. (Martin Duerst): "Rules are also specified at the protocol level, but while the example above involves right-to-left Morfin Expires March 10, 2010 [Page 39] Internet-Draft wgidnalc September 2009 characters, this is not inherently a BIDI problem." -> "This example is not inherently a BIDI problem, so such restrictions are not specified at the protocol level." ("Rules are also specified at the protocol level" is inherently vague; it seems to mean "Some rules against mixing digits are also specified at the protocol level, but only when this is necessary to avoid a BIDI problem.") * 6. (Martin Duerst): "It is unrealistic to expect that applications will display domain names using embedded formatting codes between their labels (for one thing, no reliable algorithms for identifying domain names in running text exist);": Please add that it is also unrealistic that formatting codes are removed before IDNA processing, and that allowing formatting codes could lead to many kinds of 'mischief' that would go against the two requirements in section 3. * 6. (Martin Duerst): "which might surprise someone expecting to see labels displayed in hierarchical order.": Please add that this may not be such a big problem to general users familiar with BIDI, because they are used to seeing/reading a sequuence of RTL units (e.g. words) from right to left. (for wording alternatives, see http://tools.ietf.org/html/rfc3987#section-4.4, first para, *second para*, ...) * 7. Does that restriction mean that telephone numbers cannot be registered in BIDI zones? Author's current answer: If the registry desires that domain names behave sensibly, yes; if the registry only desires that domain names pass the test, no. There are no inter-label tests. * 7.1 (Martin Duerst): Bullet points 1 and 2 are major, whereas bullet point 3 is really farfetched (not impossible just because there is no guarantee against weird implementations). It would be good to indicate that somehow. (this includes the paragraph following bullet point 3) * 7.1. (Martin Duerst): "The editors believe": change to something less specific; this is a WG document, we either have rough consensus or we don't. (I for one fully agree with this point) * 7.2. (Martin Duerst): This should be slightly reworded to more clearly send the message that changes to Unicode bidi properties, while not totally impossible, are expected to be rare, and to affect mostly symbols and the like, which will limit their effect on what the BIDI rule(/test) allows and what not. * 8. IANA considerations. Same remark as in the Protocol case. Morfin Expires March 10, 2010 [Page 40] Internet-Draft wgidnalc September 2009 Author's current answer: The Unicode Consortium does not make changes to published versions of its standards; I believe we can trust them to keep version 5.1 available for a while. Moreover, the section above then states: "the determination of validity for any string depends on the Unicode BIDI property values, which are not declared immutable by the Unicode Consortium." Author's current answer: See section 7.2. * 8. (Martin Duerst): "It is possible that differences in the interpretation of the specification": Wrong. There are no differences in interpretation for the old spec. There are no differences in the interpretation of the new spec. There are differences in the specs themselves. 9. IDNA Tables * disorder in paragraphs. Author's current answer: The exact order of the sections will be decided upon (and sections will potentially be moved around) at the time of publication by the RFC Editor while doing other formatting changes. I have been in contact with the other document editors, and our suggestion to our wg chair is to *NOT* risk destroying the actual content of the messages by moving things around at the time of last call (both wg and IETF). * (Gihan Dias): Tamil digits. John Klensin: "I see a considerable difference between, e.g., - "exclude Tamil numerals" and - "this character looks like that character, so exclude one of them". The latter is clearly part of a case-by-case character analysis. The former, whatever it might be, is a decision about a class of characters, whether Unicode's selection of properties identifies it as a class or not. The Sri Lanka IDN Task Force considered the document draft-ietf- idnabis-tables-06.txt and has an issue with the inclusion on the Tamil Digits as valid IDNA characters. Morfin Expires March 10, 2010 [Page 41] Internet-Draft wgidnalc September 2009 0BE6..0BEF ; PVALID # TAMIL DIGIT ZERO..TAMIL DIGIT NINE However, we consider that the potential for confusion is sufficient that they be disallowed in the protocol, and request that they be excluded. Author's current answer: Given the current rules, and the properties in the Unicode Database, the only way to treat 0BE6..0BEF as DISALLOWED is to add them to the exceptions table one by one. I.e. add them to section "2.6. Exceptions (F)" with the explicit value DISALLOWED. * 1. Introduction. "In particular, some combinations of allowed code points are not advisable for use in IDNs due to rules specific to a script or class of characters" introduces the concept of a "class of characters", but does not document it. IDNA Rationale 7.1.3 states "Maintain IDNA and Unicode tables that are consistent with regard to versions, i.e., unless the application actually executes the classification rules in [IDNA2008-Tables]" yet the only time "classifications (rules) appears" in IDNA Tables is in "4. Code points" as "The Categories and Rules defined in Section 2 and Section 3 apply to all Unicode code points. The table in Appendix B shows, for illustrative purposes, the consequences of the categories and classification rules, and the resulting property values." What is a "class of characters"? * 1. Introduction ends with " This document is part of a series that, together, constitute a proposal for updating the IDNA standards to resolve issues uncovered in recent years, cover a broader range of scripts, and provide for migration to newer versions of Unicode. See [IDNA2008-rationale] for a broader discussion. " Should this not be removed or edited? * 2.1. "For more information, see section 4.5 of The Unicode Standard [Unicode5]." Is it also the case in Unicode 5.1? Shouldn't this document be stored by the IANA? * 2.2. NFKC or NFC? * 2.2. (Paul Hoffman) Section 2.2 uses NFKC, but the protocol itself uses NFC. I think it is useful to make a note to this effect in Section 2.2. * 2.8. (Andrew Sullivan): "JoinControl (H) H: Join_Control(cp) = True Morfin Expires March 10, 2010 [Page 42] Internet-Draft wgidnalc September 2009 This category consists of Join Control characters (i.e., they are not in LetterDigits (Section 2.1)) but are still required in IDN labels under some circumstances. They require extended special treatment in Lookup and Resolution." I think we previously agreed just to call the action "lookup". Strictly, all the special treatment is part of the lookup process, but not the resolution process (which is a straight DNS activity that happens to be using an A-label as its QNAME). As I've argued before, I want the documents to stay very far away from any suggestion that they are changing the operation of the DNS as such. * 2.10. "It should be noted that Unicode distinguishes between 'unassigned code points' and 'unassigned characters'". Can the differences (nature and in relation to IDNA) between the characters and codepoints be explained here? * 5. IANA consideration. It is suggested that IANA should retain online copies of the version of external documents that are normatively referenced in the IETF documents. * 7. (Andrew Sullivan): " As is usual with IETF specifications, while the document represents rough consensus, it should not be assumed that all participants and contributors agree with all provisions." I'll spare participants my speech on why this is a bad thing this time. * "A table from which that registry can be initialized, and some further discussion, appears in Appendix A. " - Who is to decide and maintain the table and according to which rules/procedures? * Appendix A. as a comment, we do not understand, from the presented kind of logic, as to why: * Tamil digits cannot be made subject to a rule and added to CONTEXTO? * The same for French majuscules? * The same for any zone specific restriction? It seems implied that the logic should be the same on the sending and receiving end? The receiving end is only for decoding what the sending end chose to encode in its own context. That context needs to be considered and supported. If my application is in Tamil or French, it knows it and can be demanded to proceed accordingly. * Appendix A (Andrew Sullivan): paragraph "Note that "Before" and Morfin Expires March 10, 2010 [Page 43] Internet-Draft wgidnalc September 2009 "After" do not refer to the visual display order of the character in a label, which may be reversed or otherwise modified by the bidirectional algorithm for labels including characters from scripts written right-to-left." might benefit from the addition of another sentence, "Instead, 'Before' and 'After' refer to the network order of the character in the label." * Appendix A (Andrew Sullivan): "Appendix A.7. KATAKANA MIDDLE DOT Code point: U+30FB Overview: Note that the Script of Katakana Middle Dot is not any of "Hiragana", "Katakana" or "Han". The effect of this rule is to require at least one character in the label to be in one of those scripts. ...." there is no "End For" as called for in the pseudocode definition. Author's address Jean-Francois C. Morfin INTLNET 23 rue Saint Honore Versailles 78000 Versailles France Phone: (33.1) 39 50 05 10 Email: jefsey@jefsey.com URI: http://intlnet.org Morfin Expires March 10, 2010 [Page 44]