Internet DRAFT - draft-iucg-wgidnabislc
draft-iucg-wgidnabislc
Network Working Group Jean-Francois C. Morfin
Internet-Draft Intlnet
Intended status: Independent submission September 9, 2009
Expires: March 10, 2010
WG-IDNABIS/LC comments and responses
draft-iucg-wgidnabislc-01.txt
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79. This document may contain material
from IETF Documents or IETF Contributions published or made publicly
available before November 10, 2008. The person(s) controlling the
copyright in some of this material may not have granted the IETF
Trust the right to allow modifications of such material outside the
IETF Standards Process. Without obtaining an adequate license from
the person(s) controlling the copyright in such materials, this
document may not be modified outside the IETF Standards Process, and
derivative works of it may not be created outside the IETF Standards
Process, except to format it for publication as an RFC or to
translate it into languages other than English.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on January 14, 2010.
Copyright Notice
Copyright (c) 2009 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Morfin Expires March 10, 2010 [Page 1]
Internet-Draft wgidnalc September 2009
Provisions Relating to IETF Documents in effect on the date of
publication of this document (http://trustee.ietf.org/license-info).
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document.
Abstract
The IDNA is a key issue for the IUCG as a paradigm for the future of
the Internet. There is therefore a need to make sure its description
document set reflects a complete IETF and users consensus. To help
this, this memo keeps track of the WG-IDNABIS/LC requested and
received answers. The IAB Draft on IDN has been added because some
remarks have been made which are important. The author is quoted if
the comment is not from IUCG
Table of Contents
1. Introduction................................................... 3
2. General appreciation........................................... 3
3. IDN - IAB...................................................... 4
4. IDNA Definitions............................................... 6
5. IDNA Rationale................................................ 16
6. IDNA Mapping.................................................. 19
7. IDNA Protocol................................................. 21
8. IDNA BIDI..................................................... 31
9. IDNA Tables................................................... 41
Requirements notation
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
Morfin Expires March 10, 2010 [Page 2]
Internet-Draft wgidnalc September 2009
1. Introduction
The IDNA is a key issue for the IUCG as a paradigm for the future of
the Internet. There is therefore a need to make sure its description
document set reflects a complete IETF and users consensus. To help
this, this memo keeps track of the WG-IDNABIS/LC requested and
received answers. The IAB Draft on IDN has been added because some
remarks have been made which are important to the background and to
the future of the Multilingual Internet.
The author is quoted if the comment is not from IUCG. This
compilation is updated up to the answer of Harald Alvestrand on
08/29.
2. General appreciation
* The document repartition seems adequate. However, even if the
Mapping memo was not a part of the IDNA (why?) document set, it is
more than logical and enlightening to have it read prior to the
Protocol parts.
* The documents are rather confusing because it is impossible to
decide whether:
* they consider IDNA as a part or not as a part of the DNS (we
may also be influenced by the ML-DNS pile we work on).
* they differentiate (which) between characters and codepoints.
* they use NFKC or NFC, and what are their differences,
intrinsically and from an IDNA point of view
* they want to be a complete standards, or a partial suggestions,
set. This results from:
* the non-normative forms are being used in places that one
would deem normative
* the constant discussion of Registries'
capacities/obligations and the lack of documentation on the
tools for executing them and managing the related
registration/coding metadata and rules.
* (Martin Duerst): - Use only one name for talking about the
document collection. Currently:
* 'collection' (Abstract, 1.1)
* 'series' (1.1)
Morfin Expires March 10, 2010 [Page 3]
Internet-Draft wgidnalc September 2009
* 'set' (1.3)
* 'and the associated ones' (1.1.1)
* 'these documents' (2.1; very unclear when reading whether that
phrase indeed refers to the document collection or to Unicode
documents or what, similar again in 2.2)
This variability is confusing.
3. IDN - IAB
* (Martin Duerst): Mentioning ISO-2022-JP for encoding Japanese
domain names raises some suspicion. ISO-2022-JP may well be (or
have been) used in the DNS or a similar system, but such use would
be atypical, and should be documented by a reference. Based on the
general "division of labor" of the three classical Japanese
encodings (ISO-2022-JP, EUC-JP, Shift_JIS), one would expect
EUC-JP or Shift_JIS rather than ISO-2022-JP in such a case. [Among
the three, ISO-2022-JP makes it easiest to explain the "heuristic
encoding detection" scenario described at the end of Section 1.1.
But without a reference, it may look to some as if ISO-2022-JP was
a made-up example.]
* 1. "An Internationalized Domain Name (IDN) is a name that contains
one or more non-ASCII characters.". What is a "name" vs. a domain
name?
* 1. "When an IDN is encoded with Punycode, it is prefixed with
"xn--",". Is an IDN a label?
* 1. "it is prefixed with "xn--", which assumes that ASCII names do
not start with this prefix." Isn't the whole thing supposed to use
"xn--" ASCII labels instead of non-ASCII entries?
* 1. "reversible Unicode-to-Punycode conversion .... reversible
Punycode-to-Unicode conversion". Unicode is a table, Punycode is
an algorithm. Punycode is not reversible, but its use can be
restricted to codepoints in turn permitting it to perform
reversibly.
* 1. "UTF-8 [RFC3629] is a mechanism for encoding a Unicode
character in a variable number of 8-bit octets, where an ASCII
character is preserved as-is." Characters belong to scripts that
may or may not be supported by ASCII and Unicode encoding tables.
* 1.1. (Martin Duerst): For the bulleted list at the end of Section
1.1, it should be pointed out that UTF-8 can be detected, and
distinguished from other 8-bit encodings, with much higher
Morfin Expires March 10, 2010 [Page 4]
Internet-Draft wgidnalc September 2009
precision than just "a byte in the string has the 8th bit set".
For details, please see
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.
* 1.1. (Martin Duerst): The heuristic for punycode that is given in
Section 1.1 is "starts with xn--". However, on the level of
getaddrinfo, we are dealing with domain names, not single labels,
and something like www.xn--foo.jp should definitely be punycode
even if it doesn't start with xn--.
* 1.1. "When used with a DNS resolver library, IDNA is inserted as a
"shim" between the application and the resolver library": that
shim should be located in Figure 2.
* 1.1. "Again this assumes that ASCII names never start with "xn--",
and also that UTF-8 strings never contain an ESC character". It
may be worth documenting whether "IDN", "names", "labels", and
"strings" are or are not the same thing.
* 2. "Applications that convert an IDN to Punycode before calling
getaddrinfo() will result in name resolution failures if the
Punycode name is directly used in such protocols. " Why? Is that
not actually due to the reason that was given in 3? "Applications
that convert an IDN to Punycode before calling getaddrinfo() will
result in name resolution failures if the name is actually
registered in a private name space in some other encoding (e.g.,
UTF-8)."
* 3. "While implementations of the DNS protocol must not place any
restrictons on the labels that can be used, applications that use
the DNS are free to impose whatever restrictions they like, and
many have." Wouldn't these two rules contradict the proposed
WG-IDNABIS charter change? Wouldn't they permit the support of
cases such as Tatweel, Tamil figures, and French majuscules?
* 3.4. "The DNS resolver will append suffixes in the suffix search
list". Where is the "suffix search list" documented?
* 4. "even when DNS is used, the conversion to Punycode should be
done only by an entity that knows which name space will be used."
Fundamental. Yet, this is not considered by IDNA rationale or
protocol.
* 4.1. "Indeed the choice of conversion within the resolver
libraries is consistent with the quote from [RFC3490] section 6.2
stating that Punycode conversion "might be performed inside these
new versions of the resolver libraries". - "the recommendation is
that a resolver library be more liberal in what it would accept
from an application would mean that such a name would be accepted
Morfin Expires March 10, 2010 [Page 5]
Internet-Draft wgidnalc September 2009
and re-encoded as needed". These recommended architectures (such
as ML-DNS) are not considered in the IDNA rationale. Will IDNA be
interoperable with these recommended architectures?
* 4. "encoding conversion between Punycode and UTF-8 is
unambiguous". (?). This could lead to stabilization through
punycode and A-labels, in turn making A-labels the DNS referent
entry for UTF support?
* 5. Security considerations. This kind of consideration is already
posing a problem for TM protection. This is the "Babel Names"
case. This occurs when someone trademarks the U-label
corresponding to a protected Roman script TM. When that U-label
displays under its ASCII label form, it infringes on the Roman
script TM rights. Ex.: "xn--coca-cola" or "xn--vint-cerf".
* (Martin Duerst): The solution that the document seems to be
pushing most is heuristic detection, i.e. an API where strings in
different encodings are fed in and the API sorts things out
heuristically, converting if necessary. To some extent, this may
be an unavoidable evil, but it would be good if the document were
pushing more for clear encoding identification (for which I think
GetAddrInfoW() (UTF-16) would be an example).
* (Martin Duerst): It may be a good idea to also look into the issue
of escaped forms of domain names being fed into resolver APIs. One
form of escaping is (UTF-8-based) %-encoding in URIs (and IRIs),
which is allowed in URIs according to RFC 3986, is the only way to
encode non-ASCII in the host part of an URI where punycode isn't
appropriate, and may be the result of a conversion from an IRI to
an URI. For further background and discussion, please see
http://lists.w3.org/Archives/Public/public-iri/2009Aug/0012.html
and
http://lists.w3.org/Archives/Public/public-iri/2009Aug/0024.html
and the followup discussion.
Another potential kind of escaping are HTML/XML numeric character
references (of the form "& #xABCD ;"), although I expect them to
be less of a problem because they are used higher up in the
application and usually removed early on.
4. IDNA Definitions
* Information on Unicode is scattered throughout the document.
Wouldn't it be much better to describe a clear sequence?
* what an IDNA is,
* what IDNs are,
Morfin Expires March 10, 2010 [Page 6]
Internet-Draft wgidnalc September 2009
* what IDNA labels are,
* what they are made of,
* how Unicode supports them, including NFC in the same 2.1.
section,
* how a zone manager may impose profiling rules (description,
enforcement).
* Most of the new terms are discussed before being defined. This
starts with the confusing "looking them up" in part 1.1.1. (which
means resolving, and not just asking about, validity or existence)
as opposed to "registering"). IDNs are introduced 2.3.2.3. etc.
This certainly reflects how difficult the work is in defining all
these terms, but it is still quite confusing. For example, it is
advisable to begin with part 4.4.
* The different classes of domain names that are discussed only seem
to be related to IDNA without an exhaustive presentation of the
DNS domain name context. The names are somewhat confusing. The
drafts are certainly clear, but they do not reflect a progressive
logic of discovery of the nature of a name/label that could be
ported to programming functions.
* References to the lower/uppercase image can be understood by DNS
old-timers, but is confusing to newcomers, as it does not reflect
the same functionality and because U-label/A-label lower/uppercase
treatment is not the same.
* Different keyboards and encoding are discussed, stressing that a
DNS resolution calls for a U-label conversion, but nothing obliges
local applications to transcode user entries to Unicode when they
interoperate at a layer other than DNS. However, these
applications may want to canonize these entries in their proper
way. Interplus supports the idea that an application layer may use
middle non-Unicode and non-ASCII coding. Among others, this
facilitates interoperability with UTF-8 that Microsoft supports
within private nets: the user interface may be common and the
underlying machinery either IDNA or UTF-8.
* 1.1.1 (Martin Duerst): Audiences - "what names are permitted in
DNS zone files," -> "what names are permitted in DNS zones,"
(whether these are files or whatever is implementation-dependent.
This section is very important, and would be much more effective
with less circumscription. Just use straightforward terms
people/functions that everybody else names directly, such as
'registries', 'registrars', 'administrators creating subdomains',
Morfin Expires March 10, 2010 [Page 7]
Internet-Draft wgidnalc September 2009
and so on, and then say that this list isn't exclusive. That will
have the additional benefit of bringing the document up in more of
the relevant searches.
The second paragraph is also overly circumscriptive. Using "the
one containing explanatory material" to refer to Rationale is a
strong disservice to every reader, even if strictly speaking may
be preferable to a forward reference. Please use [] style
references, or labels such as "Rationale" with a short sentence
pointing to 1.3, or move the "who should read what" info to 1.3
with a general pointer from 1.1.1. But please stop talking around
stuff that can be easily expressed more directly (this general
comment applies in many other places, too).
* 1.1.1 (Martin Duerst): should be 1.2, and 1.1.2 should be 1.3, and
1.3 should be 1.4, to simplify structure.
* 2.1 (Martin Duerst): Say that 0x means hexadecimal (first para)
* 2.3.1 (Martin Duerst): title: This looks as if this section
defines one term, "LDH-Label". Change the title to something more
general, such as "Definitions for ASCII-only Labels".
* 2.3.1. (Paul Hoffman): Anchor 10 above Figure 1 indicates that the
figure might be shrunk. I propose instead that the four footnotes
simply be moved to immediately after the figure, which makes the
figure itself fit on one page.
* 2.3.1. (Paul Hoffman) In Figure 2, the terms "Binary Label
(including high bit on)" and "Bit String Label" are not defined
and are confusing without definition. Do we need this figure at
all any more?
* 2.3.1. (Martin Duerst): general (but most urgently 3rd para): Make
sure that the terms defined stick out, at least the same way as in
2.1 (one para per def, defined word is first word of para). Move
clear and simple definition to front, and rationale,
relationships,... to the end of the paragaraph.
* 2.3.1. (Martin Duerst): Move normative text to Protocol ("those
labels MUST NOT be processed as ordinary LDH-labels by
IDNA-conforming programs and SHOULD NOT be mixed with IDNA-labels
in the same zone")
* 2.3.1. (Martin Duerst): 3rd para: "but which otherwise conform to
LDH-label rules" -> "but otherwise conform to LDH-label rules"
* 2.3.1. (Martin Duerst): 3rd para: "case-independent" ->
"case-insensitive"
Morfin Expires March 10, 2010 [Page 8]
Internet-Draft wgidnalc September 2009
* 2.3.1. (Martin Duerst): 3rd para: "divided in" -> "divided into"
* 2.3.1. (Martin Duerst): "for future extensions that use extensions
based on the same "prefix and encoding" model"": a) 'extensions'
is repeated; b) the IETF is great at not talking about future
eventualities and describing general models that never may be
used. In this and other sections, such stuff should also be cut
out.
* 2.3.1. (Martin Duerst): anchor10: I do not understand why we need
the (1)..(4) notes. Either the definitions are clear enough, or
they should be fixed. Something like "NON-RESERVED LDH LABELS
(NR-LDH-labels) NR-LDH LABELS" is also total overkill. The only
thing that's necessary is "NR-LDH labels", with exactly the same
capitalization and hyphenation as in the definition.
* 2.3.1. (Martin Duerst): Fig. 2: I'm somewhat confused here. Note
(5) seems to suggest that U-labels have a fixed binary encoding
(e.g. UTF-8) and are used directly in the DNS. Otherwise, the note
doesn't make sense.
* 2.3.2.1. (Martin Duerst): "While that constraint may be tested in
any of several ways, an A-label must be capable of being produced
by conversion from a U-label and a U-label must be capable of
being produced by conversion from an A-label.": This puts the
chart before the horse. Change to "An A-label must be capable of
being produced by conversion from a U-label and a U-label must be
capable of being produced by conversion from an A-label. There are
several ways in which this constraint may be tested."
* 2.3.2.1. (Andrew Sullivan): Para. 2.3.2.1: An "A-label" is the
ASCII-Compatible Encoding (ACE, see Section 2.3.2.5) form of an
IDNA-valid string. It must be a complete label: IDNA is defined
for labels, not for parts of them and not for complete domain
names. This means, by definition, that every A-label will begin
with the IDNA ACE prefix, "xn--" (see Section 2.3.2.5), followed
by a string that is a valid output of the Punycode algorithm
[RFC3492] and hence a maximum of 59 ASCII characters in length.
The prefix and string together must conform to all requirements
for a label that can be stored in the DNS including conformance to
the rules for the preferred form described in RFC 1034, RFC 1035,
and RFC 1123. A string meeting the above requirements is still not
an A-label unless it can be decoded into a U-label.
So, to be less vague: the section is supposed to define certain
terms, and that bullet ought to define "A-label". It does not. It
tells us the necessary conditions for being an A-label, but not
the sufficient. This could be remedied if the last sentence said
instead, "If and only if a string meeting the above requirements
Morfin Expires March 10, 2010 [Page 9]
Internet-Draft wgidnalc September 2009
can be decoded into a U-label, then it is an A-label." But I'm no
longer sure that's true, given that we've lived with the I-D
definition so long and yet not had it fully operationalized. Is
there anything else? If there is, it needs to be added. These
definitions, I say, must be completely operationalized (or else we
have no excuse to call this document the definitions document).
Since people have to write code on the basis of these definitions,
they must be completely unambiguous.
Author's current answer: These changes, with Paul's suggested
modifications, have been tentatively accepted and incorporated
in the document. Anyone who objects should say so quickly.
* 2.3.2.1. (Martin Duerst): "Among other things, this implies that
both U-labels and A-labels must be strings in Unicode NFC
[Unicode-UAX15] normalized form.": A-labels are by definition in
NFC, because they are ASCII-only. If you want to say that they
must *represent* labels that are in NFC, that would be fine, but I
think mentioning NFC here isn't really necessary.
MAJOR!!!!!
* 2.3.2.1. (Martin Duerst): says: "Any rules or conventions that
apply to DNS labels in general, such as rules about lengths of
strings, apply to whichever of the U-label or A-label would be
more restrictive. For the U-label, constraints imposed by existing
protocols and their presentation forms make the length restriction
apply to the length in octets of the UTF-8 form of those labels
(which will always be greater than or equal to the length in code
points)."
Now this is TOTALLY NEW to me. There sure is a restriction to 63
octets in the DNS itself, but because U-labels don't enter the DNS
as such (neither as UTF-8 nor as UTF-16 or whatever), an arbitrary
UTF-8-based length restriction seems totally unjustified. I'm not
at all aware of such a restriction in IDNA2003.
Indeed, punycode was explicitly designed, among else, to perform
well for scripts with few characters. For small scripts that need
3 bytes per character in UTF-8 (all Indic scripts, Georgian,
Sinhala, Thai, Lao, Tibetan, Myanmar, Ethiopic, Cherokee, Unified
Canadian Aboriginal Syllabics, Khmer,..., this restriction would
mean a drastic reduction of the number of characters usable in a
label. To give an example, when at W3C, I created some IRI tests
(http://www.w3.org/2001/08/iri-test/).
The tests use Hiragana (http://www.???????.w3.mag.keio.ac.jp and
http://???????.???????.???????.w3.mag.keio.ac.jp), which is
atypical in that Hiragana-only Japanese is rarely used except in
Morfin Expires March 10, 2010 [Page 10]
Internet-Draft wgidnalc September 2009
children's books, but it is typical in that punycode is able to
represent 41 Hiragana (123 octets in UTF-8) in 58 octets. Hiragana
overall contains about 80 letters in a single block; punycode
efficiency will vary with the size of the script (more efficient
for smaller scripts, less efficient for larger scripts) as well as
of course with every individual label.
Currently, all (on Windows) of IE7, Mozilla Firefox, Safari, and
Opera pass both length tests (single label and multiple labels).
It would be very counterproductive if IDNA2008 required further
artificial restrictions which essentially disfavor languages and
cultures that haven't been lucky to get short encodings for their
scripts in UTF-8. (I'd be fine if the Security section warns about
the potential of some protocols or implementations not having
appropriate space, but that's on a completely different level.)
* 2.3.2.1. (Andrew Sullivan): "A "U-label" is an IDNA-valid string
of Unicode characters, in normalization form NFC and including at
least one non-ASCII character, expressed in a standard Unicode
Encoding Form (in an Internet transmission context this will
normally be UTF-8)." The parenthetical remark, I think, encourages
implementers not to recognise as U-labels strings that come in as
(say) UTF-32, but that are otherwise perfectly valid. Who cares
what is normal in an Internet transmission context, when we're
defining terms? Why does that matter ?
Author's current answer: The comment was made because there is
no requirement at all in IDNA (either 2003 or 2008) that UTF-8
be used; many applications on particular operating systems
actually use something else (UTF-16 is most common). But I
dropped the additional text. It now just says "(such as UTF-8)"
as Paul suggested. Again, anyone who doesn't like this should
speak up
* 2.3.2.1. (Andrew Sullivan): "To be valid, U-labels and A-labels
must obey an important symmetry constraint. While that constraint
may be tested in any of several ways, an A-label must be capable
of being produced by conversion from a U-label and a U-label must
be capable of being produced by conversion from an A-label. Among
other things, this implies that both U-labels and A-labels must be
strings in Unicode NFC [Unicode-UAX15] normalized form. These
strings MUST contain only characters specified elsewhere in this
document series, and only in the contexts indicated as
appropriate."
This passage nowhere actually says that _the_ A-label produced by
conversion from a particular U-label must in turn produce, by the
application of the alogorithm, the _same_ U-label. There is a
symmetry (though not an obvious one) in U[1] being convertible to
Morfin Expires March 10, 2010 [Page 11]
Internet-Draft wgidnalc September 2009
A[2] which is convertible to U[2] which is convertible to A[1],
for instance. I have no idea whether such is possible, but there's
no reason our formal definitions need to allow for it. This can be
fixed so:
To be valid, U-labels and A-labels must obey an important symmetry
constraint. While that constraint may be tested in any of several
ways, an A-label A' must be capable of being produced by
conversion from a U-label U', and that U-label U' must be capable
of being produced by conversion from A-label A'. Among other
things, this implies that both U-labels and A-labels must be
strings in Unicode NFC [Unicode-UAX15] normalized form. These
strings MUST contain only characters specified elsewhere in this
document series, and only in the contexts indicated as
appropriate.
I don't care about the notation, as long as it is unambiguously
clear that we're always talking about the "very same" label on
both sides of the transformation. We could go so far as to define
IDNA-equivalent A-labels and U-labels formally. I think this would
do it:
A-label1 and U-label1 are equivalent if and only if all the
following four conditions are true:
1. The encoding of A-label1 according to [RFC3492] results in
U-label1.
2. The decoding of U-label2 according to [RFC3492] results in
A-label2.
3. A-label1 is equivalent to A-label2 according to DNS matching
rules for labels.
4. U-label1 is bistring equivalent to U-label2.
Some may reject the above as a bit of needless formalism, or want
to reduce some steps. I argue that this is the most basic and
therefore most clear (but admittedly inelegant) formulation. As
usual, however, I'm utterly prepared to admit that I've actually
got the rule incorrect. But if I have, that amounts to a hint of
trouble with the document, because I've managed to misunderstand
it (and though I be dim, I have been following this effort).
* 2.3.2.1 (Wil Tan): In protocol section 5.3, A-label Input section
to add the lowercasing step prior to using the Punycode decoding
algorithm. The section on symmetry constraint (-defs-10, section
2.3.2.1) should also have similar wordings.
Morfin Expires March 10, 2010 [Page 12]
Internet-Draft wgidnalc September 2009
* 2.3.1. Para5 (Wil Tan): Labels within the class of R-LDH labels
that are not prefixed with "xn--" are also not valid IDNA-labels.
To allow for future use of mechanisms similar to IDNA, those
labels MUST NOT be processed as ordinary LDH-labels by
IDNA-conforming programs and SHOULD NOT be mixed with IDNA-labels
in the same zone.
Author's current response: Unless, in the moving around of
text, we have slipped up, it is important to note that the
restriction here applies _only_ to IDNA-aware applications.
That prevents it from being a restriction on the DNS generally.
However, for IDNA-aware applications, it is a precaution
against possible future prefix-altering changes as well as
something of a mechanism for making it harder for bad guys to
game future changes. If any non-IDNA arrangements come along
that use "??--" label encodings, they will of course have to be
coordinated with each other and with IDNA; in the interim, this
provision keeps IDNA out of their way (i.e., avoids preempting
such approaches).
And, yes, the WG did discuss this at great length.
"I may have missed it, but don't recall any discussions about
restricting the processing of other tagged domains. Is this the
right draft to prescribe restrictions on how non-XN-Labels are
processed?"
Author's current response: IMO, we are much too tied up in
special definitions and confusing terminology already. Please
let's not make it worse by introducing more unnecessary
terminology in the form of "tagged domains". And it is the
right place for defining how IDNA-aware applications handle
R-LDH labels that are not valid A-labels, at least IMO and in
the opinion of the mailing list the last two or three times we
went through that topic.
* 2.3.2.2. (Martin Duerst): NR-LDH-label and Internationalized
Label: The section doesn't say anything about "Internationalized
Label"s, although this term appears in the title. (the definition
is in 2.3.2.3)
* 2.3.2.3. (Martin Duerst): SVR record labels are not
Internationalized labels, and therefore domain names used for SVR
aren't IDNs. That's fine by me, but it should nevertheless be made
clear (here or elsewhere) that IDNs can be used with SVR,... (this
seems to be done at the end of 2.3.2.6, so this should be okay)
* 2.3.2.4. (Martin Duerst): This seems to say that there is no
equivalence between an all-lowercase A-label and an otherwise
Morfin Expires March 10, 2010 [Page 13]
Internet-Draft wgidnalc September 2009
equal label where some letters (maybe accidentally) have been
upper-cased. I think the cause of the problem is (as often in this
document) the lack of consistent language. Instead of "and then
testing for an exact match between the A-labels", say "and then
testing for equivalence between the A-labels [using normal DNS
matching rules]". If that's not what's intended, then some more
background may be appropriate.
* 2.3.2.5. (Martin Duerst): "a string of ASCII characters" -> "the
string of ASCII characters"
* 2.3.3. (Martin Duerst): "Because IDN labels may contain characters
that are read, and preferentially displayed, from right to left,":
Remove 'preferentially'.
This maybe refers to some hopelessly broken systems, or to the
fact that Arabic Braille is LTR, or something else, but is totally
irrelevant and potentially misleading in this context.
* 2.3.3. (Martin Duerst): Why doesn't this paragraph just refer to
'logical' representation, a term that people who know bidi are
familiar with and that's widely used in Unicode.
* 2.3.3. (Andrew Sullivan): Also, as an aside, it would be helpful
if defs pointed out more explicitly in its Para. 2.3.3 that there
are BIDI-only terms defined in the BIDI document and not in defs.
Author's current answer: I moved it to the beginning. I don't
think there's anything bidi-specific in -defs.
* 2.3.4. (Martin Duerst): "There has been some confusion about
whether a "Punycode string" does or does not include the ACE
prefix and about whether it is required that such strings could
have been the output of the ToASCII operation":
a) The combination of 'required' and 'could' doesn't make ANY
sense for me.
b) Is is unclear what "such strings" refers to (with ACE prefix?
without ACE prefix?)
* 2.3.4. (Martin Duerst): "much more clear" -> "much clearer"
* 4. (Martin Duerst): There should be a very short paragraph saying
that this section provides an overview and pointers into the
security sections of the other documents. (or whatever else
exactly the relationships are)
* 4.1. (Martin Duerst): "In addition to characters that are
Morfin Expires March 10, 2010 [Page 14]
Internet-Draft wgidnalc September 2009
permitted by IDNA2003 and its mapping conventions": Does this mean
"In addition to characters that are permitted by (IDNA2003 and its
mapping conventions)" or "In addition to characters that are
permitted by IDNA2003 and [in addition to] its mapping
conventions"? Please clarify.
* 4.1. (Martin Duerst): "problems that might raise" -> "problems
that might araise"
* 4.1. "Security on the Internet partly relies on the DNS. Thus, any
change to the characteristics of the DNS can change the security
of much of the Internet." This sentence seems extremely confusing,
as IDNA does not affect (change characteristics) the DNS but is
rather built on the fact that they will not be changed.
* 4.1. The same : "The security of the Internet is compromised if a
user entering a single internationalized name is connected to
different servers based on different interpretations of the
internationalized domain name." The security of the Internet is
not compromised, however, trust in the IDNA proposition might be.
* 4.2. (Martin Duerst): "these specifications"? The IDNA2008
collection of specifications? Or the specifications for the local
character sets?
* 4.2. (Martin Duerst): "(or different versions of one application)"
-> "(or different versions or parts of one application)" (yes,
this can and does happen)
* 4.4. (Martin Duerst): "comparisons be done properly, as specified
in the Requirements section of [IDNA2008-Protocol]": If
comparisons are dealt with in Procotol, what's the purpose of
2.3.2.4? And what's the purpose of trying to explain it all again
just after the quoted sentence?
* 4.5. (Martin Duerst): "Despite that prohibition, there are a
significant number of files and databases on the Internet in which
domain name strings appear in native-character form;": This makes
it appear as if such files and databases are in violation of some
spec. But they may simply contain IRIs instead of URIs. I would
simply start the subsection with something like "As long as
IDNA2003 labels have been kept in A-label form, the only
differences in interpretation arise for characters whose ..." and
then, in a new paragraph, continue "For IDNA2003 labels that have
been kept in native encoding,..."
* 4.7. The Summary might be considered adventurous? Corporations
such as Nominum propose services that are supposed to protect the
DNS. One of the purposes of ML-DNS is precisely to permit an
Morfin Expires March 10, 2010 [Page 15]
Internet-Draft wgidnalc September 2009
architectural protection.
* Acknowledgments (Andrew Sullivan): "As is usual with IETF
specifications, while the document represents rough consensus, it
should not be assumed that all participants and contributors agree
with all provisions," I don't feel comfortable with starting to
make the Acknowledgements section a platform for disclaimers about
WG consensus. I object pretty strongly to this addition. I don't
think we're served well by trying to state in any document how
rough the rough consensus is: the document either has to stand
through the IETF process, or not. Besides, this evaluation is a
prerogative of the Chair, the ADs, and the IESG. If this sort of
disclaimer is needed, it ought to be added by the IESG (and even
then I would object). I would like the sentence to be removed.
5. IDNA Rationale
* 1.1 (Andrew Sullivan): bugs me: "Traditionally, DNS labels are
matched case-insensitively [RFC1034][RFC1035]. That convention was
preserved in IDNA2003 by a case-folding operation that generally
maps capital letters into lower-case ones. However, if case rules
are enforced from one language, another language sometimes loses
the ability to treat two characters separately. Case-sensitivity
is treated slightly differently in IDNA2008." It makes it sound as
though there's just a kooky tradition in the DNS, and we could fix
that up. But that's not true: the matching rules are _defined_ to
be case insensitive, so changing that would be a protocol change
to DNS. Also, the text slides a little to easily between what are
different contexts of "label" here, and I think it could be a
source of confusion. I suggest this instead:
The DNS matching rules for DNS labels are case-insensitive
[RFC1034][RFC1035]. That convention was preserved for
internationalized labels in IDNA2003 by a case-folding operation
that generally maps capital letters into lower-case ones. However,
if case rules are enforced from one language, another language
sometimes loses the ability to treat two characters separately.
Case-sensitivity is treated slightly differently in IDNA2008.
* 1.3.1. DNS "Name" Terminology. "" would be better read as
"orthotypographic" as an orthographic error that can be a way to
lose some special semantics differences due to orthotypographic
conventions.
* 1.3.2. "IDNA-landr" typo?
* 1.4. "Reduce the dependency on mapping, in order that the
pre-mapped forms (which are not valid IDNA labels) tend to appear
less often in various contexts, in favor of valid A-labels." calls
Morfin Expires March 10, 2010 [Page 16]
Internet-Draft wgidnalc September 2009
for the Charter to be revised. ALternatively, it could say ,
remove dependence on mapping as per a mapping document, in which
this document would include a section on the various ways to
ensure DNS security and the barring of some U+codes in some
presentations.
* 1.5. "This model has served the existing applications well, but it
requires, with or without internationalized domain names, that
users know the exact spelling of the domain names that are to be
typed into applications such as web browsers and mail user agents.
The introduction of the larger repertoire of characters
potentially makes the set of misspellings larger, especially given
that in some cases the same appearance, for example on a business
card, might visually match several Unicode code points or several
sequences of code points." may be read as if the users of these
languages were more prone to errors than ASCII language.
* "If an application wants to use non-ASCII characters in public DNS
domain names, IDNA is the only currently-defined option." IDNA is
not a DNS option. It is an application way to transcode Unicode
domain names in LDH domain names for the convenience of ASCII
oriented international managers. The idea is to attain the
adherence of local users and managers to IDNA and not to impose
ASCII on them. DNS is UTF-8 compatible.
* "IDNA2008 divides all possible Unicode code-points into four
categories: PROTOCOL-VALID, CONTEXTUAL RULE REQUIRED, DISALLOWED
and UNASSIGNED.
3.1.1. PROTOCOL-VALID
Characters identified as "PROTOCOL-VALID" (often abbreviated
"PVALID") are permitted in IDNs." Are we talking of code-points or
of characters?
* 3.1.2.1 Not in the TOC
* 3.1.3 Disallowed "various HEART symbols" - is U+38FA also
disallowed? or U+3966?
* 3.1.3. This is the first time anyone has spoken of NFKC. In IDNA
Defs and other cases, it is NFC. Shouldn't t both of them be
documented? Shouldn't someone explain in which specific case one
is used?
* "The character is an upper-case form or some other form that is
mapped to another character by Unicode casefolding." this seems to
create a very large mapping scheme that depends on a
non-documented Unicode system needing correction (at least when it
Morfin Expires March 10, 2010 [Page 17]
Internet-Draft wgidnalc September 2009
does not specifically support majuscules). Moreover, are we
dealing with characters (that are orthogonal to Unicode) or with
codepoints that represent characters and that are subject to
Unicode casefolding. The proposition is to: (1) clarify the
character/codepoint issue, (2) explain what Unicode case folding
is and its limitations, (3) move them to CONTEXTO when these
codepoints are both used as upper-cases and as majuscules, (4)
explain that majuscules that are supported by upper-cases will be
transcoded by punycode.
* 4.4. Case mapping. One may regret that the French majuscules
current support of Unicode, which isperfectly adequate in other
circumstances yet inadequate in this case, is not discussed. This
would explain the upgrade above.
Author's current answer: First, you want certain characters
treated differently for some languages that use a given script
than for others that use the same script. That is nearly
impossible to think about, just because there is, in general,
no way to know what language a particular label is supposed to
be associated with, nor is there a way to know what top-level
domain has the label in one of its subtrees (even if one could
reliably associate top-level domains with languages).
-- Second, if I understand your latest note correctly, you
would like to have those characters treated via some contextual
rule ("CONTEXTO"). But the contextual rules yield either
"valid" or "invalid" based on adjacent or nearby characters --
they do not provide different mappings, nor different rules for
different languages (the latter at least partially for the
reasons above).
-- And, finally, your suggestion requires treating capital
letters (or at least some capital letters) as distinct from
their lower-case forms, which would create massive
inconsistencies with IDNA2003 (not just the two characters of
inconsistency with which we have have had such extensive
debates) as well as inconsistencies with DNS and host table
practices that go back to the 1970s. No matter how strong your
justification, and even if it were not also tied to
differential treatment for a particular language, I cannot
imagine the WG (or the IETF more broadly) agreeing to that
change.
* 4.5. "Examples of this are Yiddish, written with an extended
Hebrew script, and Dhivehi (the official language of Maldives)
which is written in the Thaana script (which is, in turn, derived
from the Arabic script)" It seems that some explanation about
Yiddish would be welcome so that the language will obtain the same
Morfin Expires March 10, 2010 [Page 18]
Internet-Draft wgidnalc September 2009
support as Dhivehi and Thaana.
* 5. "Conversely, lookup applications are expected to reject labels
that clearly violate global (protocol) rules (no one has ever
seriously claimed that being liberal in what is accepted requires
being stupid)." The remark between the parentheses is confusing:
it possibly qualifies as "stupid" a behavior that is not
recommended, but that is acceptable by the document set.
* "Application implementors should be aware that where DNS wildcards
are used, the ability to successfully resolve a name does not
guarantee that it was actually registered." In which terms is this
specific to IDNA?
* 7.6. The Symbol Question. That part actually discusses the Unicode
originated difficulties. Yet, the choice of Unicode has not yet
been discussed.
* 8.2. (Andrew Sullivan) the reference for DNSSEC should probably
refer to DNSSECbis, which is usually [RFC4033], [RFC4034],
[RFC4035], since RFC2535 is obsolete. One can make a strong
argument for also including NSEC3 [RFC5155], but I don't feel too
strongly about that.
* 9. "Adding languages (or similar context) to IDNs generally, or to
DNS matching in particular, would imply context dependent matching
in DNS, which would be a very significant change to the DNS
protocol itself". This sentence seems confusing. Natural languages
are quoted throughout this IDNs document.
* 12. (Andrew Sullivan): Here is where I make my now-familiar plea
for the removal of the following sentence "As is usual with IETF
specifications, while the document represents rough consensus, it
should not be assumed that all participants and contributors agree
with all provisions."
6. IDNA Mapping
* (Andrew Sullivan): In general, I think the document is in
reasonably good shape. I was hoping, however, for it to have some
advice to registries on what sort of considerations are worth
taking into account when formulating policies around what will and
will not be registered.
In particular, I think it would be very helpful to outline what it
would mean for characters to be possibly mappable, and also to
outline the different strategies that are available for preventing
different alternative mappings from ending up with a different
resolution.
Morfin Expires March 10, 2010 [Page 19]
Internet-Draft wgidnalc September 2009
The basic advice is that registries should try to detect whether a
candidate label may be expected to be the result of some
(plausible) application of a mapping appropriate in the probable
user community, and similarly that when faced with a candidate
IDN, registries should consider the probable user community and
consider the plausible applications of mappings appropriate to
that community. If the registry does not have the expertise to
evaluate the probable user community for a given code point, then
it should simply reject the code point outright. (This is, in
effect, the advice that you should have a policy, it should be a
policy based on knowledge of the use cases, and that it should
default closed.)
After the registry has detected whether the candidate label, there
are two basic strategies it might follow:
1. Detect and reserve. In this case, the registry detects
potential mappings, and reserves other candidate labels that might
be the result of such mappings. This reservation takes the form of
preventing registration of that label.
2. Detect and bundle. In this case, the registry detects the
potential mappings, and creates identical entries in the registry
conforming to those "alternative forms" of the candidate label.
There is the potential for a very large number of these bundled
labels.
* (Andrew Sullivan): There's also a great deal of material at the
end of rationale that would be more appropriate in this document
(or else it should just go away, I think).
* (Andrew Sullivan): I think the document is ready to go, assuming
this is what we want to say. Yet its content is a little flabby as
recommendations go. I can imagine a reader being a little
surprised at this advice, for example: "These are a minimal set of
mappings that an application should strongly consider doing. Of
course, there are many others that might be done." That boils down
to, "You might want to do this. Or not. Or something else. Up to
you." I know why we're saying this, but I would not be surprised
if people object to such thin advice. If that's all the advice we
want to offer, however, this is the right document and it should
go ahead.
* Not sure that the terminology of "make sense" is adequate or
clear.
* 1. Introduction - This document is supposed to be separated from
the IDNA document set. It should then document what the IDNA
protocol is. It seems that the IDNA2008 protocols boil down to
Morfin Expires March 10, 2010 [Page 20]
Internet-Draft wgidnalc September 2009
"DNS domain names are to be expressed in LDH form. IDNA is a
commonly agreed upon convention wherein if they are entered by the
user in another form, applications are advised to convert them to
UTF in order to filter and map them, as is discussed in the
present document, as well as to transcode them in by using the
punycode algorithm. Depending on the Registry policy, their
registration can be carried out in the ITF and/or the transcoded
ASCII form."
* 2.3. NFC is confirmed, NFKC is not discussed.
7. IDNA Protocol
* As a general comment:
* The SHOULD/MUST chains may be somewhat awkward. MUSTs are used
in a protocol procedure and then an alternative to that
procedure is pragmatically considered. It could be of interest
to draft a MUST tree to consider which cases are, or are not,
covered.
* there is some confusion as to what the "string" is compared to
the label and domain name, in which "Label" may be used instead
of "U-Label" or sometimes "A-Label". Wouldn't it be better to
review the text in qualifying the "labels" in order to be
certain that all the cases are clearly covered?
* 1.para 1: (Martin Duerst): I missed the term IDNA2003 in Defs, it
would have been useful. I didn't complain because I thought it had
been 'deprecated'. Seeing it here, I think it should go back to
Defs, and be actively used in Defs and Protocol at least, to
simplify and clarify prose.
* 1.para 2: (Martin Duerst): "does not changes" -> "does not change"
* 1.para 2: (Martin Duerst): "IDNA does not depend on any changes to
DNS servers, resolvers, or protocol elements" -> "IDNA does not
depend on any changes to DNS servers, resolvers, or DNS protocol
elements" or "IDNA does not depend on any changes to DNS (servers,
resolvers, or protocol elements)" (Otherwise, it's possible to
understand 'protocol elements' as being not limited to DNS.)
* 1.para 4: (Martin Duerst): ", that share some terminology,
reference data and operations." -> "These two protocols share
terminology, reference data and operations."
* 2.para 1: (Martin Duerst): "Terminology used in IDNA, but also in
Unicode or other character set standards and the DNS, appears in
[IDNA2008-Defs]." -> "Terminology used in IDNA appears in
Morfin Expires March 10, 2010 [Page 21]
Internet-Draft wgidnalc September 2009
[IDNA2008-Defs]." (where else these terms are used, or where they
are from, can be explained in Defs where necessary, but is
absolutely irrelevant here)
* 2.para 1: (Martin Duerst): "Terminology that is required as part
of the IDNA definition, including the definitions of "ACE",
appears in that document as well." -> remove (first, the word
'definition' is used with two slightly different meanings, and
second, I don't see the point of singling out "ACE".)
* 3.1. Requirement 2: (Martin Duerst): Equivalence is already
defined in Defs. Please make sure there is only a single
definition.
* 3.1. Requirement 2: (Martin Duerst): Why is it a MUST that
U-Labels are compared without case-folding (even for ASCII chars?)
or other steps?
* 3.1. Requirement 2: (Martin Duerst): "In many cases, validation
may be important for other reasons and SHOULD be performed.": Is
this restricted to when trying to compare? Or in general?
* 3.1. Requirement 3: (Martin Duerst): This does double duty, and
should be removed. The alternative is to covert 3.1 into a
conformance section as usual e.g. for ISO standards, but then a
lot more rewriting will be necessary in all of 3.1.
* 3.2. "It does not apply to domain name slots which do not use the
Letter/Digit/Hyphen (LDH) syntax rules." Confusing. Would some of
the DN slots not accept both?
* 3.2. para 1: (Martin Duerst): "IDNA applies to": What does
"applies to" mean?
* 3.2. para 2: (Martin Duerst): "Because it uses the DNS, IDNA
applies" -> "Because IDNA uses the DNS, it applies", or even
better "Because IDNA uses the DNS, IDNA applies" (repetitions
don't hurt in standards, reference before referent does)
* 3.2. para 2: "unless those protocols and implementations of them"
-> "unless those protocols and their implementations"
* 3.2. para 2:(Martin Duerst): "be aware of IDNs in Unicode" -> "be
aware of IDNs" (whether they are in Unicode or not is irrelevant
here)
* 3.2.1. The word CLASS only appears in the whole document set in
two sentences: "DNA applies only to domain names in the NAME and
RDATA fields of DNS resource records whose CLASS is IN. See RFC
Morfin Expires March 10, 2010 [Page 22]
Internet-Draft wgidnalc September 2009
1034 [RFC1034] for precise definitions of these terms. The
application of IDNA to DNS resource records depends entirely on
the CLASS of the record, and not on the TYPE except as noted
below."
What about internationalized domain name in a non IN CLASS?
Author's answer: I have received no further input on this and
will assume that the current text is ok unless I do.
Additional CommentMy reading of that text was that it was a
_restriction_, not a claim of fact. In other words, I
interpreted that text as saying that, if a new class is
invented, IDNA (if it were specified) would need to be
specified separately for that class.
By definition, the only place domain name labels can appear is
in the NAME and RDATA fields of resource records. It's
important to remember that the RDATA field can have subfields
(as it does in the SOA record). I think that's clear enough
from the rest of the discussion, and because the documents
already say, "You need to understand DNS too."
* 3.2.1. (Paul Hoffman): Section 3.2.1 says: "IDNA applies only to
domain names in the NAME and RDATA fields of DNS resource records
whose CLASS is IN." It would be good for the DNS-centric folks in
the WG to verify that they think that this restriction is correct.
Are there really no other fields where domain labels would appear?
* 3.2.1. para 1: (Martin Duerst): The first paragraph reads as if
IDNA applied to domain names in e.g. TXT records in CLASS IN. I
think it would help here to say exactly what is meant by "IDNA
applies". In some sense, IDNA applies nowhere in DNS records, they
are all just ASCII. In some sense (labels starting with xn-- are
presumed to be IDNA labels; you can add an IDN (or a label
thereof) to a DNS record by using A-labels), IDNA applies.
* 3.2.1. para 2: (Martin Duerst): The SVR discussion has significant
overlap with Defs, please reduce.
* 4. "This section defines the procedure for registering an IDN. The
procedure is implementation independent; any sequence of steps
that produces exactly the same result for all labels is considered
a valid implementation." A procedure does provide but does not
define a result?
* 4. para 1:(Martin Duerst): "defines *the* procedure" ... : This
would work better if there were really only one procedure, and it
were written as a procedure. However, there are often variations,
Morfin Expires March 10, 2010 [Page 23]
Internet-Draft wgidnalc September 2009
and different, often non-procedural ways in which things are
expressed (e.g. 'labels must ...' instead of 'if a label doesn't
satisfy x, abort')
* 4. para 2: (Martin Duerst): "the registration and lookup protocols
(Section 5)" -> "the registration protocol (this section) and the
lookup protocol (Section 5)" (shortcuts are the enemies of
specifications)
* 4. para 2: (Martin Duerst): "while ... are very similar in most
respects, they are different" -> "while ... are very similar in
most respects, they are not identical"
* 4. para 2: (Martin Duerst): "follow the appropriate steps":
appropriate appeals to value judgement, which isn't adequate here.
* 4.1. The obligation chain reads: "By the time a string enters the
IDNA registration process [], it is expected to be in Unicode []",
yet "registries [] SHOULD avoid any possible ambiguity by
accepting registrations only for A-labels []."
* 4.1. (Andrew Sullivan): "By the time a string enters the IDNA
registration process as described in this specification, it is
expected to be in Unicode and MUST be in Unicode Normalization
Form C (NFC [Unicode-UAX15])." The "expected to be" part is
redundant, since we have a subsequent MUST.
* 4.1. (Andrew Sullivan): I found this slightly confusing: "The
registry SHOULD permit submission of labels in A-label form and is
encouraged to accept both the A-label form and the U-label one. If
it does so,". The "does so" reference there is ambiguous: is it
the submission of A-labels or the A-label+U-label case. The
subsequent text suggests it's the former.
* 4.1. title: (Martin Duerst): Why suddenly "Process" instead of
"Procedure"? Why not just "Input"? And why singular in the title,
and then plural in the first line of the text?
* 4.1.(Martin Duerst): "are outside the scope of these protocols":
How many protocols are there? Only one that's relevant here.
* 4.1. (Martin Duerst): Why is NFC a condition on the input? Please
make it a validation step afterwards, to streamline things.
* 4.1. (Martin Duerst): "Entities responsible for zone files
("registries") are expected to accept only the exact string for
which registration is requested, free of any mappings or local
adjustments.": It's clear to me what we want here, but it's much
better to write this as a condition on the later processing,
Morfin Expires March 10, 2010 [Page 24]
Internet-Draft wgidnalc September 2009
rather than on input, something like: "Entities responsible for
zone files ("registries") MUST NOT apply any mappings or local
adjustments of any kind to the exact string for which registration
is requested."
* 4.1. (Martin Duerst): "They SHOULD avoid any possible ambiguity by
accepting registrations only for A-labels, possibly paired with
the relevant U-labels so that they can verify the correspondence."
This has to be improved. First, the SHOULD doesn't belong on the
reason, and the reason, if anywhere, belongs at the end. Second,
there are three possible ways input can come in, so let's list
things up: "Entities responsible for zone files ("registries") MAY
accept input in any of three forms:
1) As a pair of A-label and U-label
2) As an A-label only
3) As an U-label only.
1) and 2) are RECOMMENDED because the use of A-labels avoids any
possibility for ambiguity. (the first sentence in 4.2.1 can then
be removed)
* 4.2.1. (Martin Duerst): This is a complex jungle of conditions on
input, conversions,... What should be done is:
a) extract the 'raw' (without any preconditions) U-label->A-label
and A-label->U-label 'functions' into subsections e.g. in Section
3; these will serve as building blocks both in Section 4 and
Section 5.
b) As the first step of the registration procedure, make sure we
have both an A-label and an U-label. One way to write this is:
"4.2.1: Preprocessing
1) If the input contained an A-label and a U-label, check that
they are equivalent (or whatever that was called; the conditions
are somewhere in Defs). If the check fails, abort registration.
2) If the input contained an A-label, but no U-label, calculate
the U-label according to @@@.
3) If the input contained an U-label, but no A-label, calculate
the A-label according to @@@."
The above makes sure we have both an A-label and an U-label from
here on. Checking on these can be performed independently (e.g.
Morfin Expires March 10, 2010 [Page 25]
Internet-Draft wgidnalc September 2009
length check on A-label, NFC check on U-label). Conversion to
punycode is no longer needed in 4.4, because we simply put the
A-label we have now into the zone (assumed we have passed all the
checks up to here, of course).
* 4.2.1. (Martin Duerst): (probably not needed anyway) "both the
A-label form and the U-label one" -> "both the A-label form and
the U-label form"
* 4.2.1. (Martin Duerst): Word the A-label checks more clearly, and
create section "4.2.2 A-label Validation"
* 4.2.3.2. "a combining mark or combining character" -> "a combining
character" (combining marks are a special case of combining
characters, and as such irrelevant here)
* 4.2.3.3. (Martin Duerst): "To check this, each code-point marked
as CONTEXTJ and CONTEXTO in [IDNA2008-Tables] MUST have a non-null
rule." Is this a requirement on Tables? Are there "null rules"?
What purpose do they serve, what's the difference between them and
DISALLOWED?
* 4.2.3.4. (Martin Duerst): What are "characters written from right
to left"? Either we define this clearly here, or we leave it (or
put it) in Bidi, but then we have to rewrite the sentence here
(just requiring conformance to the conditions in Bidi).
* 4.2.4. (Martin Duerst): This is totally unnecessary, please
remove. If we need a summary for what's essentially just about a
page of text, we better give up.
* 4.3. Registry restriction inheritance is not alluded to.
* 4.3. (Martin Duerst): "Policies are likely to be informed by the
local languages" -> "Policies are likely to be informed by the
local scripts and languages" (IDNs are mostly a script issue, much
less a language issue. ICANN has fixed their documents to avoid
only talking about languages (they still could move a bit further
to scripts), so let's not commit the same mistake here again.)
* 4.3. (Martin Duerst): "or the application of special restrictions
to others": like what? Like that such a label can only be resolved
on Tuesdays?
* 4.4. (Martin Duerst): The generic parts of the conversion need to
go somewhere else (Section 3?). The actual conversion (or
checking) needs to go at the start of 4.2. Then this section is
empty and can be removed.
Morfin Expires March 10, 2010 [Page 26]
Internet-Draft wgidnalc September 2009
* 4.5. (Martin Duerst): "The A-label is registered in the DNS by
insertion into a zone." -> "The label is registered in the DNS by
inserting the A-label into a zone." (distinguish registration of
the abstract thing from insertion of the concrete thing)
* 5. Does this repetition (already in IDNA Rationale) "the presence
of wild cards in the DNS might cause a string that is not actually
registered in the DNS to be successfully looked up." reflect what
the BIDI documents slightly differently: "Wildcards create the odd
situation where a label is "valid" (can be looked up successfully)
without the zone owner knowing that this label exists. So an owner
of a zone whose name starts with a digit and contains a wildcard
has no way of controlling whether or not names with RTL labels in
them are looked up in his zone."
* 5. (Martin Duerst): para 2: " The two steps described in Section
5.2 are required.": Superfluous. Make sure there's a MUST at the
right place in that section. (Looking at 5.2, I have no clue what
the two steps should be. This shows that indirect requirements
like the above are rather unhelpful.)
* 5.1.(Martin Duerst): first paragraph: Although IDNs will often get
extracted from IRIs or URIs, there are many cases where these
constructs are not involved. Examples would be telnet or ping
commands, and so on. So IRIs and URIs should be deemphasized more.
* 5.1.(Martin Duerst): "Processing in this step and the next two are
local matters, to be accomplished prior to actual invocation of
IDNA.": Again, which steps? Before, we supposedly had two steps in
5.2, now it looks as if we are talking about 5.2 and 5.3 as two
steps. -> Create a subsection such as "Input preparation" or what
where all the preliminary stuff goes in. Alternatively, talk about
subsections, with subsection numbers for clear identification.
* 5.2. The case of a character that is not supported by Unicode is
not discussed.
* 5.2. (Martin Duerst): "is not already Unicode" -> "is not already
in Unicode" (in parallel to 'into' in the line before)
* 5.2. (Martin Duerst): "A Unicode string may require normalization
as discussed in Section 4.1.": There is no "discussion" in 4.1
(and no need for discussion). Express the requirements here
independently of Section 4.
* 5.3. (Wil Tan): section 5.3, A-label Input section to add the
lowercasing step prior to using the Punycode decoding algorithm.
The section on symmetry constraint (-defs-10, section 2.3.2.1)
should also have similar wordings.
Morfin Expires March 10, 2010 [Page 27]
Internet-Draft wgidnalc September 2009
* 5.3. (Martin Duerst): (just checking) "See the Name Server
Considerations section of [IDNA2008-Rationale] for additional
discussion on this topic.": From the context, Name Server doesn't
look related (we are client-side here).
* 5.3. (Martin Duerst): "That conversion and testing SHOULD":
Replace 'That' with something clearer and more precise.
* 5.3. (Martin Duerst): para 2: List up the alternatives that are
possible. Avoid mishmash textual paragraphs.
* 5.4. The use of "U-Labels" in this part instead of "Labels" would
probably clarify it.
* 5.4. (Paul Hoffman): Section 5.4 assumes that an application knows
the version of Unicode that is being used in the application. We
should state that assumption in 5.4 or maybe further up near the
beginning of section 5.
Author's answer: It assumes that either the application or the
operating system or library support keeps the two consistent.
That range of options is the reason why Section 5.4 is not more
explicit about which particular software elements or modules
know what. If this is to be changed, I need suggestions about
textual fixes that do not imply that the knowledge must be in
the application itself.
Further comment: The first bullet in 5.4 is the first time that
"version of Unicode" is mentioned, so the note is probably most
effective right there. I propose adding:
This requirement means that the application must use a list of
unassigned characters that is matched to the version of Unicode
that is being used for the other requirements in this section.
It is not required that the application know which version of
Unicode is being used; that information might be part of the
operating environment in which the application is running.
* 5.4. (Paul Hoffman) The paragraph in section 5.4 that starts "This
test may..." is out of date because the rules in the Bidi document
no longer do inter-label checking. The whole paragraph can be
removed.
* 5.4. (Paul Hoffman) In the light of this, does the WG want to
change the requirement level for checking Bidi on lookup from
SHOULD to MUST? Given the above, I see no reason why not.
* 5.4. "applying the test is likely to give much better information
about the reason for a lookup failure -- information that may be
Morfin Expires March 10, 2010 [Page 28]
Internet-Draft wgidnalc September 2009
usefully passed to the user when that is feasible -- than DNS
resolution failure information alone" might this lead to the idea
that they could also be carried in case of the failure to better
document it?
* 5.4. "For all other strings, the lookup application MUST rely on
the presence or absence of labels in the DNS to determine the
validity of those labels and the validity of the characters they
contain". Is it correct to assume that the first labels stand for
"A-Label" and the second one stands for "their corresponding
U-Labels"?
* 5.4. (Martin Duerst): para 1: Mishmash again. Most of this para is
best removed.
* 5.4. (Martin Duerst): para 1: "Putative labels": Both in Section 4
and 5, labels are for the most part putative, because they don't
conform to the definitions unless checked. Either before section
4, or once at the start (Input subsection) of both section 4 and
section 5, say that for the most part, we are dealing with
putative labels, but 'putative' isn't repeated all the time to
make the text easier to read.
* 5.4. (Martin Duerst): page 12: Finally a bullet list. I almost
thought that the author didn't know how to create bullet lists, or
was of the opinion that bullet lists don't have a place in spec.
Quite to the contrary, please make sure there are much more bullet
lists. It will make everything much easier to read and clearer.
* 5.4. (Martin Duerst): "Labels that are not in NFC form as defined
in [Unicode-UAX15].": There is only one definition of NFC, but the
sentence suggests there are several. Please change to "Labels that
are not in NFC [Unicode-UAX15]."
* 5.4. (Martin Duerst): Please move bullet 1 (UNASSIGNED) and bullet
4 (DISALLOWED) and all the other table-related bullets together. I
think it's best to put UNASSIGNED last (and mention that this is
the category most subject to change).
* 5.4. (Martin Duerst): Streamline the wording used to refer to
Tables and a category. Currently, we have: "in the UNASSIGNED
category of [IDNA2008-Tables]" - "in the "DISALLOWED" category in
the permitted character table [IDNA2008-Tables]" that are
identified in [IDNA2008-Tables] as "CONTEXTJ"
* 5.4. (Martin Duerst): "Labels whose first character is a combining
mark (see Section 4.2.3.2).": Refer directly to the relevant
Unicode definition, rather than to section 4.2.3.2 (which contains
a MUST, which is already implicit here).
Morfin Expires March 10, 2010 [Page 29]
Internet-Draft wgidnalc September 2009
* 5.4. (Martin Duerst): "In any event, lookup applications should
avoid attempting to resolve labels that are invalid under that
test.": Remove. We already have a SHOULD, no need for a should on
top of that.
* 5.4. (Martin Duerst): last para: I assume this is e.g. about
labels with mixed scripts,... What it essentially seems to say is
that a browser may warn users if it detects mixed scripts, but if
the user still wants to see the page, s/he is entitled to it. In
such a context, the word 'validity' seems quite a bit out of
place; it would be better to speak about 'other tests' or some
such in a more general way.
* 5.5. (Martin Duerst): para 1: "using the Punycode algorithm (with
the ACE prefix added)": The parenthetical seems to suggest that
addition or not of the ACE prefix is an (optional) part of the
Punycode algorithm, but RFC 3492 does not define the prefix, nor
is the additon of the prefix part of the punycode algorithm. ->
Convert parenthetical to a clause or sentence ("... and then
adding the ACE prefix." or so).
* 5.5. (Martin Duerst): rest from second sentence in para 1: As said
in my comments on Section 4, a summary is unnecessary. Also, it
has nothing to do with punycode conversion. In addition, the
second bullet point is confusing, because an A-label (checked or
not) cannot be punycode-converted again. -> remove
* 5.6. (Martin Duerst): "That ... string" -> "The string resulting
from the conversion in Section 5.5"
* 5.6. (Martin Duerst): "That lookup" -> "The lookup"
* 5.7. (Martin Duerst): What about (streamlined): Security
Considerations for this version of IDNA are described in
[IDNA2008-Defs], except for the special issues associated with
right to left scripts and characters, which are discussed in
[IDNA2008-BIDI].
* 7. IANA Considerations - There is no commitment from UNICODE to
not update those Unicode documents that are accepted as normative
in the IDNA documentation set. Should their copy at the time of
the publication of this set not be stored by the IANA?
* 8. (Andrew Sullivan): I suggest "This second-generation version
would not have been possible without the work that went into that
first version, due to its authors ." Or something like that.
* 8./9.(Martin Duerst): These should be merged. The text explains it
all.
Morfin Expires March 10, 2010 [Page 30]
Internet-Draft wgidnalc September 2009
* 8. (Martin Duerst): "Hoffman and Costello ... should not be held
responsible for any errors or omissions.": Remove, this is
implicitly clear, in the end it's the WG and the IETF that's
responsible. Similar for "As is usual with IETF specifications,
while the document represents rough consensus, it should not be
assumed that all participants and contributors agree with all
provisions."
* 9. (Andrew Sullivan): I object very strongly to the inclusion of
the sentence, "As is usual with IETF specifications, while the
document represents rough consensus, it should not be assumed that
all participants and contributors agree with all provisions."
Rough consensus is always rough on everyone, but if you are a
participant who urges this sentence on the product of the WG, I
ask you to reconsider. It is unworthy of your effort and the
efforts of your colleagues. It would be better to have an outright
flamewar on the mailing list than to have that sort of
not-with-a-bang-but-a-whimper remark live forever in the WG
output. If we as a WG really have such deep disagreements that we
have to send drafts with this sort of disclaimer to the IESG, I
feel pretty uncomfortable that the WG has in fact reached
consensus.
* References. (Martin Duerst): [Unicode-RegEx], [Unicode-Scripts],
[Unicode-UAX15] (and maybe others): Unicode data files don't have
explicit authors, but Unicode TRs (and similar stuff) has
authors/editors, same as RFCs. Please don't drop this information.
8. IDNA BIDI
* (Paul Hoffman): This draft is not yet ready for publication. The
text is still very confusing about the relationship between this
document and RFC 3454. In many places, the wording makes it sound
like this new algorithm is a replacement for that in RFC 3454,
which it is not: it is the algorithm to use with IDNA2008. I would
be happy to do a thorough edit to remove this ambiguity and make
it clear that, while the algorithm is an improvement on the old
one, it does not "change" or "fix" or "replace" the old one.
Author's current answer: It changes the old definition in
exactly the same meaning of the word "change" as the way
IDNA2008 changes IDNA2003.
Author's current answer: I'll send you the XML under separate
cover; if you can make the changes you feel are needed, I can
see if I agree with them.
* (Paul Hoffman) The terminology section needs to define "the end of
the label". Tests 3 and 6 of section 2 are confusing without some
Morfin Expires March 10, 2010 [Page 31]
Internet-Draft wgidnalc September 2009
definition.
Author's current answer: Will add "beginning" and "end" to the
paragraph that says "<t>In this memo, we use "network order" to
describe the sequence of characters as transmitted on the wire
or stored in a file; the terms "first", "next", "previous",
"before" and "after" are used to refer to the relationship of
characters and labels in network order.</t>
* Abstract, and potentially elsewhere. (Martin Duerst): Avoid the
word 'new'. RFCs are archival documents.
* 1.1. Advisable or not to specify "when U-labels" instead of
"labels" ?
Author's current answer: The first paragraph says that the
document is about U-labels. This should not need repeating.
* 1.1. (Martin Duerst): para 2: "When labels satisfy the rule, and
when certain other conditions are satisfied, they can be used with
a minimal chance of these labels being displayed in a confusing
way by a bidirectional display algorithm.": "they" .. "these
labels" is confusing. What about "When labels satisfy the rule,
and when certain other conditions are satisfied, there is only a
minimal chance that these labels will be displayed in a confusing
way by a bidirectional display algorithm."
* 1.1. (Martin Duerst): "A bidirectional display algorithm": How
many of them do we have? (I only know one, the Unicode one (with
some minor variants)). How many of them have been used for
testing/verification?
* 1.1. (Martin Duerst): para 3: what exactly is a "right-to-left
character"?
* 1.2. (Andrew Sullivan): "While the document proposes completely
new text, most reasonable labels that were allowed under the old
criterion will also be allowed under the new criterion, so the
operational impact of the rule change is limited." It would be
nice here, I suggest, to offer some definition of what labels fall
outside "most reasonable labels". The description sounds too much
like, "The labels we picked when we wrote this," which is
indubitably not the impression anyone intended.
Author's current answer: I added some examples (mixtures of
numerals, and AN inside LTR labels). Hope it helsp.
* 1.2. (Martin Duerst): This section ideally should also be moved to
after Section 2.
Morfin Expires March 10, 2010 [Page 32]
Internet-Draft wgidnalc September 2009
* 1.2. (Martin Duerst): para 1: "The IDNA specification
"Stringprep"": change to something like "Stringprep, part of
IDNA2003". Otherwise, it's not clear that this is an old spec.
* 1.2. (Martin Duerst): para 4: "However, this makes certain words"
-> "However, this made certain words" (past tense)
* 1.2. (Martin Duerst): para 7: "While the document specifies rules"
-> "While this document specifies rules"
* 1.2. (Martin Duerst): para 7: "(the most important being label
that mix Arabic and European digits (AN and EN) inside an RTL
label, and labels that use AN in an LTR label)": Very weird. Such
cases may not be completely impossible, but they are much less
frequent than e.g. Arabic numbers inside Arabic letters, European
numbers inside Arabic letters, and so on. There was even a strong
movement to prohibit number mixing at the protocol level; this
would never have happened if such mixing would have been deemed to
be "most important". Also, after looking at the actual conditions,
we either have an RTL label, which by condition 4 excludes mixing
EN and AN, or we have an LTR label, which by condition 5 excludes
AN and therefore the mixture of EN and AN.
* 1.3. (Martin Duerst): title: "Layout" -> "Structure" or
"Organization"
* 1.3. (Martin Duerst): para 1: Change from "bidi test" to "bidi
rule". (or unify otherwise)
* 1.3. (Martin Duerst): para 1: ", that" -> ", which"
* 1.3. (Martin Duerst): para 1: "no matter what the direction of the
label is": What does this mean? It could either mean that you can
apply the test forwards or backwards, or it could mean that it
doesn't depend on what directionality the characters in the label
have, or whatever. In the later case, I'd write e.g.: "This test
[->rule, see above and below] can be applied to any kind of label,
but becomes trivial if the input is guaranteed to contain only LTR
characters."
* 1.3. (Martin Duerst): "The primary initial use of that test":
"that test" -> "this test" (this sentence talks about relationship
with other documents, so it's the test in this document, not the
test in that other section)
* 1.3. (Martin Duerst): para 2: "a BIDI rule" -> "the BIDI rule"
* 1.3. (Martin Duerst): para 3: "new rule proposed here" -> "new
rule proposed" (we are talking about document organization, so
Morfin Expires March 10, 2010 [Page 33]
Internet-Draft wgidnalc September 2009
it's "the rule in that other section over there", so "here"
doesn't fit)
* 1.3. (Martin Duerst): para 4: "Section 5 to Section 9 describe" ->
"Section 5 to Section 7 describe": Section 8 is IANA
consideration.
* 1.4. (Mark Davis) "An RTL label is a label that contains at least
one character of type R or AL." I believe you should also add
"AN". There are cases where it affects ordering. What I mean is
that if you had AN + L in a label (not nec in that order), you
wouldn't even count it as a BIDI domain name, and thus none of the
bidi doc would apply (according to the text). Yet such labels
would be legal according to protocol, and I think they can cause
reordering, and could thus cause the kind of visual confusion that
BIDI is supposed to prevent.
Author's current answer: Good point. I'll modify.
* 1.4. BIDI properties come from Unicode. They might not be complete
or could be completed in the future. What then?
Author's current answer: See section 7.2, "This memo does not
propose a solution for this problem".
* 1.4. (Andrew Sullivan): There are some terms defined in Para 1.4.
I think it would be way helpful to a naive reader to be directed
to defs at the beginning of this section first, and then to say
"there are specific BIDI-only terms also defined here". So I'd
move the reference that's at the end of this section to the
beginning.
* 1.4. (Andrew Sullivan): The third paragraph now ends with a comma,
so it looks like something was supposed to be added and wasn't. Or
is this just a typo?
Author's current answer: Typo, fixed.
* 1.4. (Andrew Sullivan): I find this peculiar: A "Bidi domain name"
is a domain name that contains at least one RTL label. If a domain
name is RTL.RTL.RTL, it qualifies under this definition, even
though there is no bidirectionality (all labels have the same
directionality). Explaining why this is still "bidi" would leave
me less confused.
Author's current answer: Added some text to say "adding a
separate RTL-name category would just make the spec more
complicated".
Morfin Expires March 10, 2010 [Page 34]
Internet-Draft wgidnalc September 2009
* 1.4. (Martin Duerst): "non spacing" -> "nonspacing"
* 1.4. (Martin Duerst): "The directionality of such examples" ->
"The display order of such examples"
* 1.4. (Martin Duerst): "it means ..., approximately" -> "it
approximately means"
* 1.4. (Martin Duerst): "An RTL label": This seems to be the
definition that Protocol might
want to refer to.
* 1.4. (Martin Duerst): 'Having a separate category of "RTL domain
names" would not make this specification simpler, so has not been
done.' -> 'Providing a separate category of "RTL domain names"
would not make this specification simpler.'
* 2. A replacement for the RFC 3454 BIDI rule: it would probably be
good to indicate the applying order.
Author's current answer: The 6 conditions can be checked in any
order. All must be satisfied in order to make the test pass;
different implementations may find that different checking
orders make the code more or less efficient
* 2. (Martin Duerst): (title), and elsewhere: Both "Bidi rule" and
"Bidi test" are used, that's confusing. The term is always in
singular. The document works that way in general, but "The
following test" at the start of Section 2 is confusing, because
the only 'tests' that one can see are the ones labeled 1. to 6.
Maybe use something like "In order to pass the BIDI test, the
following conditions 1. to 6. must all be satisfied."
* 2. (Martin Duerst): conditions 2/4: Why are BN (control
characters) allowed in RTL but not in LTR?
* 2.1. (Andrew Sullivan): Rule 1 in Para. 2 says, "The first
character must be a character with BIDI property L, R or AL." I
can't tell whether that must is a requirement or a statement of
fact that is entailed by other IDNA rules. If it's a requirement,
it presumably ought to be a 2119 MUST; even if not, it seems that
we have to know what to do in case the first character doesn't
match this rule. If it's an entailment, it'd help to make that
plain, which could be done by restating it, "The first character
will be a character with BIDI property L, R, or AL due to [reason
reference]."
Author's current answer: I'm not sure what an entailment is -
but this is a rule. In order to execute the text, you must look
Morfin Expires March 10, 2010 [Page 35]
Internet-Draft wgidnalc September 2009
at the first character and check its BIDI property. This
document isn't using 2119 (wow); I used to have the reference
way back when, but it didn't seem to help any, so I removed it.
* 3. (Andrew Sullivan): "One specific requirement was thought to be
problematic, but turned out to be satisfied by a string that obeys
the proposed rules:
* The Character Grouping requirement should be satisfied when
directional controls (LRE, RLE, RLO, LRO, PDF) are used in the
same paragraph (outside of the labels). Because these controls
affect presentation order in non-obvious ways, by affecting the
"sor" and "eor" properties of the Unicode BIDI algorithm, the
conditions above require extra testing in order to figure out
whether or not they influence the display of the domain name.
Testing found that for the strings allowed under the rule
presented in this document, directional controls do not
influence the display of the domain name."
comes after the discussion of things considered and rejected. This
leaves me confused about whether the text above is in fact a
requirement or not. If it is a requirement, then I'd move this
segment to the part before the rejected requirements.
Author's current answer: Added a little more text to explain
the status of the requirement - it's a "nice to know".
* 3. (Martin Duerst): "A requirement" -> "The requirement" (see
above)
* 3. (Martin Duerst): para 2: As this restricts things to the
Unicode bidi algorithm, please say this earlier. (see above)
* 3. (Martin Duerst): para 3: "requirements proposed" ->
"requirements" (we are working on finalizing this document, we are
no longer in the proposal stage)
* 3. (Martin Duerst): requirement 2: Is the choice of 'characters
delimiting the labels' open, is this only the ASCII dot, is this a
small set (I'm interested in this both for spec clarity and
because the answer might strongly affect draft-duerst-iri-bis).
* 3. (Martin Duerst): 'possible requirement' related to
directionality controls: "(outside of the labels)" -> "(outside,
but potentially directly adjacent of the labels)" (does this
include cases with directionality controls inside a domain name,
i.e. before/after a dot?) "the conditions above require extra
testing" -> "the conditions above required extra testing"
Morfin Expires March 10, 2010 [Page 36]
Internet-Draft wgidnalc September 2009
* 3. (Martin Duerst): 'Delimiterchars': FULL STOP not allowed in
domain names?
* 4.1. (Martin Duerst): Thaana 'Computer' example: "UBIUFILI" ->
"UBUFILI"
* 4.2. (Martin Duerst): This section could be shortened
considerably. "Greater latitude here than ... Dhivehi." is
irrelevant; as long as a significant part of a language's words
cannot be used in IDN, there's a problem. The subsection is
interesting for people interested in Yiddish, but the average
reader of the spec will try to find something relevant for the
algorithm, and mostly be more confused than enlightened.
* 4.2.3.4 (James Mitchell): If the proposed label contains any
characters that are written from right to left it MUST meet the
"bidi" criteria [IDNA2008-BIDI].
The above implies that the label must meet the BIDI criteria,
however BIDI criteria is applied to a BIDI domain name. From
draft-ietf-idnabis-bidi-04, The following test has been developed
for labels in BIDI domain names and A "Bidi domain name" is a
domain name that contains at least one RTL label. I hesitate to
provide alternative text for until the following question has been
answered.
The protocol states that the proposed label contains any
characters written from right to left it MUST meet the bidi
criteria. It does not impose such requirements on labels
containing no right to left characters. Consider the registration
of label 123abc in a zone containing labels written right to left.
The label does not contain any right to left characters, therefore
does not have to meet the BIDI criteria. However this name is a
BIDI domain name, yet such a name would fail as the first label
(LTR) does not begin with BIDI property L. Is the label intended
to be valid for registration given a right to left zone?
Author's current answer: In your specific case, registering
"123abc" in the RTL zone "ABC" (usual convention applies) will
lead to a domain name (network order: 123abc.ABC) that, on its
own, will display as "123abc.CBA" in an RTL context, but if
prepended by "DEF:", forming the network order string
DEF:123abc.ABC, will display as "CBA.abc123:FED" - which may be
surprising to some. (I'm only 90% confident on this - people
more used to bidi in practice may be more confident).
The WG has rejected inter-label tests, therefore all tests
defined by the protocol as normative (MUST or SHOULD) apply
only to one label at a time.
Morfin Expires March 10, 2010 [Page 37]
Internet-Draft wgidnalc September 2009
Given the WG decision, I tried to make the BIDI document quite
clear that certain properties can only be guaranteed for domain
names where all the labels meet the test, but this is a case
where people have to read the warnings and do something
reasonable, rather than having the rules define that being
unreasonable is forbidden.
"Warning: Contains hot liquids".
I am not concerned about the display ordering of the name in
question. The issue is a mismatch between the registration and
lookup protocols.
The registration protocol asks the question of the label '123abc',
which left-to-right is not required to satisfy BIDI. However, the
lookup protocol says that one SHOULD apply the BIDI test (on I
assume the name). Applying the BIDI test to this name will fail
and the name will not be looked up.
As a registry, should I allow registration of the name 123abc.RTL?
Author's current answer: My recommendation is that you should
(as a registry) establish policy that says "it is not allowed".
The protocol does not require you to do so.
As an application, should I lookup the name 123abc.RTL?
Author's current answer: The protocol does not say that you
can't. For the obvious reasons, I think it's a legitimate
implementor decision to decide not to.
* 4.2.3.4 (James Mitchell): I note also the inconsistent use of the
term BIDI. In draft-ietf-idnabis-protocol-14 section 4.2.3.4. it
is quoted and in lower-case, whereas the
draft-ietf-idnabis-bidi-04 uses the upper-case version
extensively. Also, within draft-ietf-idnabis-bidi-04 the term
"Bidi domain name" in Section 1.4 is inconsistent with BIDI domain
names in Section 2, and the tem Bidi rule in Section 10 is
inconsistent with the several other occurrences in the document.
Author's current answer: Thanks, I tried to normalize it to
uppercase in an earlier round, but didn't remain consistent in
later edits. Will fix!
* 4.3. (Martin Duerst): "(with the 5 being considered right-to-left
because of the leading ALEF)": No, the 5 itself is never
right-to-left. Change to "(the overall directionality being
right-to-left because of the leading ALEF)"
Morfin Expires March 10, 2010 [Page 38]
Internet-Draft wgidnalc September 2009
* 4.3. (Martin Duerst): "but barring them both seems to require
justification" -> "but barring them both seems unnecessary" or
"but barring them both turned out to be unnecessary"
* 5. (Martin Duerst): "Even if a label is registered under a "safe"
label,": 'under' should be explained more clearly (I assume this
refers to the hierarchical relationship in the DNS)
* 5. (Martin Duerst): last paragraph: It would be better to change
this into a SHOULD, such as "Where implementations see a a way to
avoid ..., they SHOULD avoid". That will bring this issue on the
radar screen of implementers, whereas it currently will just be
glossed over.
* 6 (Mark Davis)"Rules can also be specified at the protocol level,
but while the example above involves right-to-left characters,
this is not inherently a BIDI problem." I think the issue is that
the word "can" was appropriate for when this was a proposal, but
the situation is different in heading for release; the "can"
should be changed according to what you mean.
* If you are referring to the situation as of when these are all
released, then the rules either "are specified" or they are not
(in which case the statement removed).
* If you are referring to a future time, then the "can" becomes
"could" (or for clarity, adding "in a future version" or some
such language).
* If you are referring to what could have been done, then "could
also have been" would be appropriate.
Because I can't tell what you want to say, I don't know which of
these you would mean.
Author's current answer: There's no guarantee of synchronity in
further updates, so the situation isn't really all that
different (BIDI doesn't place any constraint on future
-tables), but I'll change to "are".
* 6. (Martin Duerst): first paragraph: "All other issues with these
scripts": What scripts???
* 6. (Martin Duerst): "wishes to create rules for the mixing of
digits" -> "wishes to create rules against the mixing of digits"
or "wishes to restrict the mixing of digits"
* 6. (Martin Duerst): "Rules are also specified at the protocol
level, but while the example above involves right-to-left
Morfin Expires March 10, 2010 [Page 39]
Internet-Draft wgidnalc September 2009
characters, this is not inherently a BIDI problem." -> "This
example is not inherently a BIDI problem, so such restrictions are
not specified at the protocol level." ("Rules are also specified
at the protocol level" is inherently vague; it seems to mean "Some
rules against mixing digits are also specified at the protocol
level, but only when this is necessary to avoid a BIDI problem.")
* 6. (Martin Duerst): "It is unrealistic to expect that applications
will display domain names using embedded formatting codes between
their labels (for one thing, no reliable algorithms for
identifying domain names in running text exist);": Please add that
it is also unrealistic that formatting codes are removed before
IDNA processing, and that allowing formatting codes could lead to
many kinds of 'mischief' that would go against the two
requirements in section 3.
* 6. (Martin Duerst): "which might surprise someone expecting to see
labels displayed in hierarchical order.": Please add that this may
not be such a big problem to general users familiar with BIDI,
because they are used to seeing/reading a sequuence of RTL units
(e.g. words) from right to left. (for wording alternatives, see
http://tools.ietf.org/html/rfc3987#section-4.4, first para,
*second para*, ...)
* 7. Does that restriction mean that telephone numbers cannot be
registered in BIDI zones?
Author's current answer: If the registry desires that domain
names behave sensibly, yes; if the registry only desires that
domain names pass the test, no. There are no inter-label tests.
* 7.1 (Martin Duerst): Bullet points 1 and 2 are major, whereas
bullet point 3 is really farfetched (not impossible just because
there is no guarantee against weird implementations). It would be
good to indicate that somehow. (this includes the paragraph
following bullet point 3)
* 7.1. (Martin Duerst): "The editors believe": change to something
less specific; this is a WG document, we either have rough
consensus or we don't. (I for one fully agree with this point)
* 7.2. (Martin Duerst): This should be slightly reworded to more
clearly send the message that changes to Unicode bidi properties,
while not totally impossible, are expected to be rare, and to
affect mostly symbols and the like, which will limit their effect
on what the BIDI rule(/test) allows and what not.
* 8. IANA considerations. Same remark as in the Protocol case.
Morfin Expires March 10, 2010 [Page 40]
Internet-Draft wgidnalc September 2009
Author's current answer: The Unicode Consortium does not make
changes to published versions of its standards; I believe we
can trust them to keep version 5.1 available for a while.
Moreover, the section above then states: "the determination of
validity for any string depends on the Unicode BIDI property
values, which are not declared immutable by the Unicode
Consortium."
Author's current answer: See section 7.2.
* 8. (Martin Duerst): "It is possible that differences in the
interpretation of the specification": Wrong. There are no
differences in interpretation for the old spec. There are no
differences in the interpretation of the new spec. There are
differences in the specs themselves.
9. IDNA Tables
* disorder in paragraphs.
Author's current answer: The exact order of the sections will
be decided upon (and sections will potentially be moved around)
at the time of publication by the RFC Editor while doing other
formatting changes.
I have been in contact with the other document editors, and our
suggestion to our wg chair is to *NOT* risk destroying the
actual content of the messages by moving things around at the
time of last call (both wg and IETF).
* (Gihan Dias): Tamil digits. John Klensin: "I see a considerable
difference between, e.g.,
- "exclude Tamil numerals"
and
- "this character looks like that character, so exclude one of
them".
The latter is clearly part of a case-by-case character analysis.
The former, whatever it might be, is a decision about a class of
characters, whether Unicode's selection of properties identifies
it as a class or not.
The Sri Lanka IDN Task Force considered the document draft-ietf-
idnabis-tables-06.txt and has an issue with the inclusion on the
Tamil Digits as valid IDNA characters.
Morfin Expires March 10, 2010 [Page 41]
Internet-Draft wgidnalc September 2009
0BE6..0BEF ; PVALID # TAMIL DIGIT ZERO..TAMIL DIGIT NINE
However, we consider that the potential for confusion is
sufficient that they be disallowed in the protocol, and request
that they be excluded.
Author's current answer: Given the current rules, and the
properties in the Unicode Database, the only way to treat
0BE6..0BEF as DISALLOWED is to add them to the exceptions table
one by one. I.e. add them to section "2.6. Exceptions (F)" with
the explicit value DISALLOWED.
* 1. Introduction. "In particular, some combinations of allowed code
points are not advisable for use in IDNs due to rules specific to
a script or class of characters" introduces the concept of a
"class of characters", but does not document it. IDNA Rationale
7.1.3 states "Maintain IDNA and Unicode tables that are consistent
with regard to versions, i.e., unless the application actually
executes the classification rules in [IDNA2008-Tables]" yet the
only time "classifications (rules) appears" in IDNA Tables is in
"4. Code points" as "The Categories and Rules defined in Section 2
and Section 3 apply to all Unicode code points. The table in
Appendix B shows, for illustrative purposes, the consequences of
the categories and classification rules, and the resulting
property values."
What is a "class of characters"?
* 1. Introduction ends with " This document is part of a series
that, together, constitute a proposal for updating the IDNA
standards to resolve issues uncovered in recent years, cover a
broader range of scripts, and provide for migration to newer
versions of Unicode. See [IDNA2008-rationale] for a broader
discussion. " Should this not be removed or edited?
* 2.1. "For more information, see section 4.5 of The Unicode
Standard [Unicode5]." Is it also the case in Unicode 5.1?
Shouldn't this document be stored by the IANA?
* 2.2. NFKC or NFC?
* 2.2. (Paul Hoffman) Section 2.2 uses NFKC, but the protocol itself
uses NFC. I think it is useful to make a note to this effect in
Section 2.2.
* 2.8. (Andrew Sullivan): "JoinControl (H)
H: Join_Control(cp) = True
Morfin Expires March 10, 2010 [Page 42]
Internet-Draft wgidnalc September 2009
This category consists of Join Control characters (i.e., they are
not in LetterDigits (Section 2.1)) but are still required in IDN
labels under some circumstances. They require extended special
treatment in Lookup and Resolution."
I think we previously agreed just to call the action "lookup".
Strictly, all the special treatment is part of the lookup process,
but not the resolution process (which is a straight DNS activity
that happens to be using an A-label as its QNAME). As I've argued
before, I want the documents to stay very far away from any
suggestion that they are changing the operation of the DNS as
such.
* 2.10. "It should be noted that Unicode distinguishes between
'unassigned code points' and 'unassigned characters'". Can the
differences (nature and in relation to IDNA) between the
characters and codepoints be explained here?
* 5. IANA consideration. It is suggested that IANA should retain
online copies of the version of external documents that are
normatively referenced in the IETF documents.
* 7. (Andrew Sullivan): " As is usual with IETF specifications,
while the document represents rough consensus, it should not be
assumed that all participants and contributors agree with all
provisions." I'll spare participants my speech on why this is a
bad thing this time.
* "A table from which that registry can be initialized, and some
further discussion, appears in Appendix A. " - Who is to decide
and maintain the table and according to which rules/procedures?
* Appendix A. as a comment, we do not understand, from the presented
kind of logic, as to why:
* Tamil digits cannot be made subject to a rule and added to
CONTEXTO?
* The same for French majuscules?
* The same for any zone specific restriction?
It seems implied that the logic should be the same on the sending
and receiving end? The receiving end is only for decoding what the
sending end chose to encode in its own context. That context needs
to be considered and supported. If my application is in Tamil or
French, it knows it and can be demanded to proceed accordingly.
* Appendix A (Andrew Sullivan): paragraph "Note that "Before" and
Morfin Expires March 10, 2010 [Page 43]
Internet-Draft wgidnalc September 2009
"After" do not refer to the visual display order of the character
in a label, which may be reversed or otherwise modified by the
bidirectional algorithm for labels including characters from
scripts written right-to-left." might benefit from the addition of
another sentence, "Instead, 'Before' and 'After' refer to the
network order of the character in the label."
* Appendix A (Andrew Sullivan): "Appendix A.7. KATAKANA MIDDLE DOT
Code point:
U+30FB
Overview:
Note that the Script of Katakana Middle Dot is not any of
"Hiragana", "Katakana" or "Han". The effect of this rule is to
require at least one character in the label to be in one of those
scripts. ...." there is no "End For" as called for in the
pseudocode definition.
Author's address
Jean-Francois C. Morfin
INTLNET
23 rue Saint Honore
Versailles
78000 Versailles
France
Phone: (33.1) 39 50 05 10
Email: jefsey@jefsey.com
URI: http://intlnet.org
Morfin Expires March 10, 2010 [Page 44]