Network Working Group J. Klensin Internet-Draft February 2003 Expires August 2003 User Interface Evaluation and Filtering of Internet Addresses and Locators draft-klensin-name-filters-00.txt Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026 Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Abstract Many internet applications have been designed to deduce top-level domains (or other domain name labels from partial information. Whether or not this practice is desirable from an overall network standpoint, the designers of the applications believe the it leads to be better and more responsive user experience. The introduction of new top level domains, expecially non-country-code ones, has exposed flaws in some of the methods used by these applications. These flaws make it more difficult, or impossible, for users of the applications to access the full Internet. This memo discusses the used techniques and gives some guidance for minimizing their negative impact as the domain name environment evolves. 1. Introduction Designers of user interfaces to Internet applications have often found it useful to examine user-provided values for validity before passing them to the Internet tools themselves. This type of test, most commonly involving syntax checks or application of other rules to domain names, email addresses, or "web addresses" (URLs or, occasionally, extended URI forms) may enable better-quality diagnostics for the user than might be available from the protocol itself. They are also thought to improve the efficiency of back-office processing programs and to reduce load on the protocols themselves. Certainly they are consistent with the well-established principle that it is better to detect errors as early as possible. The tests must, however, be made correctly or at least safely. If criteria are applied that do not match the protocols, users will be inconvenienced, addresses and sites will effectively become inaccessible to some groups, and business and communications opportunities will be lost. Experience in recent years indicates that syntax tests are often performed incorrectly, perhaps by assuming that the syntax rules are the same for email addresses and URLs, and that tests for top-level domain names are applied using obsolete lists and conventions. We assume that most of these incorrect tests are the result of inability to conveniently locate exact definitions for the criteria to be applied. This document draws summaries of the applicable rules together in one place and supplies references to the actual standards. It does not add anything to those standards; it merely draws the information together into a form that may be more accessible. Many experts on Internet protocols believe that tests and rules of these sorts should be avoided in applications and that the tests in the protocols and back-office systems should be relied on instead. Certainly implementations of the protocols cannot assume that the data passed to them will be valid. Unless the standards specify particular behavior, this document takes no position on whether or not the testing is desirable. It only identifies the correct tests to be made if tests are to be applied. The sections that follow discuss domain names, email addresses, and URLs. 2. Restrictions on domain (DNS) names The authoritative definitions of the format and syntax of domain names appear in RFCs 1035, 1123, and 2181 ([RFC1035], [RFC1123], [RFC2181]). Any characters, or combination of bits, is permitted in DNS names. However, there is a preferred form that is required by most applications. That form has been the only form permitted in TLD names, and most second-level names registered in TLDs. It is known as the "LDH rule", after the characters that it permits. The LDH rule, as updated, provides that the labels (words or strings separated by periods) that make up a domain name must consist only of the ASCII [ASCII] alphabetic and numeric characters, plus the hyphen. No other symbols or punctuation characters are permitted, nor is blank space. If the hyphen is used, it is not permitted to appear at either the beginning or end of a label. There is an additional rule that essentially requires that top-level domain names not be all-numeric. Internet protocols are designed to work well only when given "fully-qualified" domain names, i.e., ones that include all of the labels leading to the root, including the TLD name. Consequently, proported DNS names to be used in applications and to locate resources generally must contain at least one period (".") character. Those that do not are either invalid or require the application to supply additional information. Of course, this principle does not apply when the purpose of the application is to process or query TLD names themselves. There is a long history of applications moving beyond the "one or more periods" test to trying to verify that a valid TLD name is actually present. They have done this either by applying some heuristics to the form of the name or by consulting a local list of valid names. The heuristics are no longer effective. If one is to keep a local list, much more effort must be devoted to keeping it up-to-date than was the case several years ago. The heuristics were based on the observation that, since the DNS was first deployed, all top-level domain names were two, three, or four characters in length. All two-character names were associated with "country code" domains, with the specific labels (with for a few early exceptions), drawn from the ISO list of codes for countries and similar entitles [IS3166]. The three-letter names were "generic" TLDs, whose function was not country-specific. And there was exactly one four-letter TLD, the infrastructure domain "ARPA." [RFC1591]. These length-dependent rules were, however, conventions, rather than anything on which the protocols depended. Before the mid-1990s, lists of valid top-level domain names changed infrequently. New country codes were gradually, and then more rapidly, added as the Internet expanded, but the list of generic domains did not change at all between the establishment of the "INT." domain and ICANN's allocation of new generic TLDs in 2000. Some application developers responded by assuming that any two-letter domain name could be valid as a TLD, but that the list of generic TLDs was fixed and could be kept locally and tested. Several of these assumptions changed as ICANN started to allocate new top-level domains: one two-letter domain that does not appear in the ISO 3166 table was tentatively approved, and new domains were created with three, four, and even six letter codes. As of the first quarter of 2003, the list of valid, non-country, top-level domains was .aero, .biz, .com, .coop, .edu, .gov, .info, .int, .mil, .museum, .name, .net, .org, .pro, and arpa. ICANN is expected to expand that list at regular intervals, so the list that appears here should not be used in testing. Instead, systems that filter by testing top-level domain names should regularly check the current list of TLDs (both "generic" and country-code-related) published by IANA at http://www.iana.org/domain-names.htm. It is likely that the better strategy has now become to make the "at least one period" test, to verify LDH conformance (including verification that the apparent TLD name is not all-numeric), and then to use the DNS to determine domain name validity, rather than trying to maintain a local list of valid TLD names. 3. Restrictions on email addresses Reference documents: RFC 2821, RFC 2822 Contemporary email addresses consist of a "local part" separated from a "domain part" by an at- sign "@". The syntax of the domain part corresponds to that in the previous section, and the same concerns about filtering and lists of names apply. The domain name can also be replaced by an IP address in square brackets, but that form is strongly discouraged except for testing and troubleshooting purposes. The local part may appear using the quoting conventions described below. The quoted forms are rarely used in practice, but are required for some legitimate purposes and should not be rejected. Subject to the quoting constraints, any ASCII character, including control characters, may appear quoted, or in a quoted string. The backslash character is used to quote the following character. For example Abc\@def@example.com is a valid form of an email address. Blank spaces may also appear, as in Fred\ Bloggs@example.com The backslash character may be used to quote itself, e.g., Joe.\\Blow@example.com Conventional double-quote characters may be used to surround strings. For example "Abc@def"@example.com "Fred Bloggs"@example.com are alternate forms of the first two examples above. The quoted forms are rarely recommended, and are uncommon in practice, except insofar as they are needed for transitions from other systems and contexts, but those transitional requirements still arise. Without quotes, local-parts may consist of any combination of alphabetic characters, digits, or any of the special characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~ period (".") may also appear, but may not be used to start or end the local part, nor may two or more consecutive periods appear. Forms such as user+mailbox@example.com customer/department=shipping@example.com $A12345@example.com !def!xyz%abc@example.com _somename@example.com are valid and are seen fairly regularly, but any of the characters listed above are permitted. In the context of local parts, apostrophe ("'") and acute accent ("`") are ordinary characters, not quoting characters. Some of these characters are used in conventions about routing or other types of special handling by some receiving hosts. But, since there is no way to know whether the remote host is using those conventions or just treating these characters as normal text, sending programs (and programs evaluating address validity) must simply accept the strings and pass them on. 4. URLs 4.1 URL syntax definitions and issues The syntax for URLs (Uniform Resource Locators) is specified in RFC 1738 [RFC1738]. The syntax for the more general "URI" (Uniform Resource Identifier) is specified in RFC 2396 [RFC2396]. Programs that require syntax checks should use the general syntax rules of RFC 2396, which are the rules summarized below. <> 4.2 Guessing domain names in web contexts Several web browsers have adopted a practice that permits an incomplete domain name to be used as input instead of a complete URL. This has, for example, permitted users to type "microsoft" and have the browser interpret the input as "http://www.microsoft.com/". Other browser versions have gone even further, trying to build DNS names up through a series of heuristics, testing each variation in turn to see if it appears in the DNS, and accepting the first one found as the intended domain name. If this approach is to be used, it is often critical that the browser recognize the complete list of TLDs. If an incomplete list is used, complete domain names may not be recognized as such and the system may try to turn them into completely different names. For example, "example.aero" is a fully-qualified name, since "aero." is a TLD name. But, if the system doesn't recognize "aero." as a TLD name, it is likely to try to look up "example.aero.com" and "www.example.aero.com" (and then fail or find the wrong host), rather than simply looking up the user-supplied name. As discussed in section 2 above, there are dangers associated with software that attempts to "know" the list of top-level domain names locally and take advantage of that knowledge. These name-guessing heuristics are another example of that situation: if the lists are up-to-date and used carefully, the systems in which they are embedded may provide an easier, and more attractive, experience for at least some users. But finding the wrong host, or being unable to find a host even when its name is precisely known, constitute bad experiences by any measure. 5. Implications of internationalization 6. Summary <> 7. Security considerations Since this document merely summarizes the requirements of existing standards, it does not introduce any new security issues. However, many of the techniques that motivate the document raise important security issues of their own. Rejecting valid forms of domain names, email addresses, or URIs often denies service to the user of those entities. Worse, guessing at the user's intent when an incomplete address, or other string, is given can result in compromises to privacy or accuracy of reference if the wrong target is found and returned. From a security standpoint, the optimum behavior is probably to never guess, but, instead, to force the user to specify exactly what is wanted. When that position involves a tradeoff with an acceptable user experience, good judgment should be used and the fact that it is a tradeoff recognized. 8. References 8.1 Normative References [ASCII] American National Standards Institute (formerly United States of America Standards Institute), X3.4, 1968, "USA Code for Information Interchange". ANSI X3.4-1968 has been replaced by newer versions with slight modifications, but the 1968 version remains definitive for the Internet. [RFC1035] Mockapetris, P.V., "Domain names - concepts and facilities", RFC 1035 and STD 13, November 1987. [RFC1123] Braden, R., Ed., "Requirements for Internet Hosts - Application and Support", RFC 1123 and STD 3, October 1989. [RFC1738] Berners-Lee, T., L. Masinter, and M. McCahill, "Uniform Resource Locators (URL)", RFC 1738, December 1994. [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS Specification", RFC 2181, July 1997. [RFC2396 Berners-Lee, T., R. Fielding, and L. Masinter, "Uniform Resource Identifiers (URI): Generic Syntax", RFC 2396, August 1998. 8.2 Non-normative References [IS3166] [RFC1591] Postel, J., "Domain Name System Structure and Delegation", March 1994. 9. Acknowledgements <> 10. Author's Address John C Klensin 1770 Massachusetts Ave, #322 Cambridge, MA 02140 USA john-ietf@jck.com Expires August 2003