INTERNET-DRAFT Larry Masinter Xerox Corporation Martin Duerst W3C/Keio University draft-masinter-url-i18n-02 August 30, 1998 Expires in 6 months Representing non-ASCII Characters in URIs and Extended URIs Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' To view the entire list of current Internet-Drafts, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). This document is not a product of any working group, but may be discussed on the mailing list url-i18n@unicode.org. Abstract URIs are defined as sequences of characters chosen from a limited subset of the repertoire of ASCII characters, both for transmission in network protocols and representation in spoken and written human communication. This document defines a uniform way of representing non-ASCII scripts in URIs and in an extended 8-bit form (8URI), so these identifiers can be used for the world's languages. The document gives guidelines for the use and deployment of these forms in various elements of software that deal with URIs. 1. Introduction URIs [RFC 2396] are defined as sequences of characters chosen from a limited subset of the repertoire of ASCII characters. The characters in URIs are frequently used for representing English words and phrases; unfortunately, this leaves out most of the world, who do not write merely with the letters A-Z. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. 2. Syntax This document defines two ways of representing non-ASCII characters in resource identifiers: a URI syntax which is compatible with the definition of URI syntax [RFC 2396], and a new syntax which is usable in contexts where resource identifiers are transported within "8-bit" environments. This new syntax is called an "8URI"; it is upward compatible with the URI syntax, but is defined as a sequence of 8-bit octets. 2.1 URI syntax The standard definition of URIs [RFC 2396] requires that URIs be represented with a very limited repertoire of characters which are a subset of those characters representable in ASCII. URIs are defined as a sequence of characters (since URIs may be written on paper or read out loud) which my be represented as a sequence of 7-bit bytes. Character sequences that include non-ASCII characters must be transcribed to represent them in URIs. The transcription to be applied to a character sequence before it is included in an element of a URI (path, etc.) SHOULD be performed by: 1) representing the characters as a sequence of ISO 10646 characters. 2) "normalizing" the character sequence to reduce ambiguity. [UNI15] defines several normalization forms; for the purpose of representing characters in URIs, "Normalization Form CC". 3) encoding the result with the UTF-8 character encoding [RFC 2279] 4) using %HH hex-encoding [RFC 2396] to encode any octet that does not correspond to an allowed, non-reserved character. This syntax is consistent with the definition of the generic URI syntax [RFC 2396], the URN syntax [RFC 2141], as well as recent URL scheme definitions [RFC 2192], [RFC 2384]. 2.2 8URI syntax This specification defines a new protocol element, called an '8URI'. An 8URI is similar to a URI in its use, but is different in that it is solely for use in network protocols that allow the transport of octets outside of the range allowed within URIs. An 8URI MAY have 8-bit octets within it. An 8URI is represented using the same methods (1-4) defined in section 2.1, but in step (4), octets with the leading bit on need not be encoded; all characters outside of those explicitly disallowed in RFC 2396 (reserved, delimiters, white space, unwise special characters) MAY be represented directly by their UTF-8 encoding. An '8URI' for characters outside of the ASCII range will use considerably less space than the corresponding hex-encoded URI. Even within 8URIs, any octet sequence which would likely yield ambiguous or incorrect results when printed or displayed and then subsequently typed by a user SHOULD be hex-encoded. Internet protocols that currently allow the designation of a URI may be extended at some point to allow 8URIs as well as URIs, but this extension must be done explicitly. Section 4 lays out some of the software guidelines that will allow the deployment of 8URIs in existing Internet Protocols. 3. Software Requirements and Upgrade Strategy Supporting URIs for non-ASCII characters requires cooperation from the providers of several different components of URI software: software that allows users to enter URIs, software that generates URIs, software that displays URIs, and software that interprets URIs. 3.1 URI entry One component of software that deals with URIs allows users to enter a URI, e.g., by typing or dictation. For example, a person viewing a visual representation of a URI (as a sequence of glyphs, in some order, in some visual display) might use a keyboard entry method for keys in that language to create the URI. For ASCII characters with standard English keyboards, the process is simple, since there is generally a simple correspondence between letters represented, keys pressed, and internal system representation, but for other languages the process is much more complex. If the visual representation contains only those characters that are allowed [RFC 2396] standard syntax of URIs, the transcription is simple. However, for all other sequences of characters, it is RECOMMENDED that the entry results in characters, in logical order from the ISO 10646 character repertoire, encoded using the UTF-8 method [RFC 2279], and then subsequently encoded as necessary using the URI hex-encoding. The set of octets that require encoding depending on whether the result is a URI or an 8URI. The characters the user has entered should be normalized according to the rules in [RFC-DUERST]; for example, all accented characters should be translated into their combined form, no extraneous BIDI (bidirectional) marks should be left in the resulting stream, and that characters that are intended to represent Western European letters should be transcribed into their ISO-8859-1 equivalents and not, for example, as double-wide characters. Whether URI entry should result in a URI or an 8URI will depend on the capability of the protocol or software to which the result will be submitted. 3.2 URI generation Systems that are offering resources through the Internet, where those resources have logical names, sometimes offer the ability to generate URIs for the resources they offer. For example, some HTTP servers offer the ability to generate a 'directory listing' for file directories under their purvue, and then to respond to the generated URIs with the files. If the names of the files consist solely of US-ASCII characters the transcription is simple, but other file systems offer a wider variety of characters. Many currently deployed systems currently do not transform the local character representation of the underlying system before generating URIs. For maximum interoperability, systems that generate resource identifiers SHOULD translate the local encoding to UTF-8, and the results hex-encoded as appropriate for the URI or 8URI. Whether the generated identifier should result in a URI or an 8URI depends on the capability of the protocol or software to which the result will be submitted. This recommendation applies to HTTP servers as well as those systems that generate and interpret URLs for FTP, gopher and the like. 3.3 Display of URIs Many systems contain software that present URIs to users as part of their user interface (sometimes presenting 'friendly' URIs). This section applies to this presentation, as well as to the strategy for printing URIs in magazines, newspapers, or reading them over the radio. Software that displays identifiers to users should follow a general principle: "Don't display something to a user that the user would not be able to enter." The consequences of this principle require judgement about the availability of software that implements the entry methods described in section 3.1. a) In situations where a viewer is not likely to have software that implements non-ASCII character entry as described in section 3.1, any octet not representable by a character allowed in the [RFC 2396] SHOULD be displayed as if it were hex-encoded. b) In situations where a viewer _is_ likely to have such software, sequences of octets MAY be displayed directly as the non-ASCII character sequence it represents in UTF-8. Character sequences of %HH-encoding which correspond to non-ASCII characters MAY be displayed directly without decoding OR may be displayed as if it were a sequence of hex-encoded UTF-8. 3.4 Interpretation of URIs Software that interprets URIs as the names of local resources SHOULD accept multiple renditions of the URIs in the case where those resources names might have non-ASCII representations; this includes accepting both the URI syntax of section 2.1 and the 8URI form in section 2.2. Just as allowing case-insensitive file names makes URIs more robust (because the person viewing the URI might type the case differently than it is displayed), similarly, URI-interpreting software should be generous in allowing all of the possible representations that might result from the recommendations in section 3.1. In addition, it is useful if unaccented characters are accepted, when possible, as aliases for accented characters, and that other equivalences are made. For example, a URI which contains a string in Japanese might actually arrive with a variety of encodings, due to the variety of interpretations of deployed systems. While this recommendation specifies a canonical encoding of Japanese using %HH-encoded UTF-8, in practice many URIs will be presented which contain characters encoded using Shift-JIS or EUC-JP, either with %HH encoding or not. Thus, to transition to the new regime, URI-interpreting software for Japanese should accept all three of the EUC-JP, Shift-JIS and UTF-8 encodings. 4. Upgrading As this recommendation places further constraints on software for which many instances are already deployed, it is important to introduce upgrade carefully. 4.1 Upgrade sequence The deployment strategy (for both hex-encoded and 8URIs) is in the following sequence: Interpret --> Generation | +-> Entry --> Display Initially, it is most important to upgrade the URI interpreting software according to the recommendations of section 3.4. The upgrade of generating software to use UTF-8 (instead of a local encoding) should happen only after the service is upgraded to accept such URIs. Similarly, 8URIs should only be generated when the service accepts 8URIs and the intervening infrastructure and protocol is known to transport them safely. Similarly, once interpreting software has been modified to accept alternative encodings, then the entry software can also transition. Display software should be upgraded only after upgraded entry software has been widely deployed to the population that will see the displayed result. These recommendations, when taken together, will allow for the extension of URIs to handle scripts other than ASCII while minimizing interoperability problems. 4.2 Examples: upgrading URIs within various contexts 4.2.1 URIs within HTTP The HTTP protocol [RFC HTTP] includes the URI of the resource being accessed as the 'Request-URI' in the request line. Most deployed HTTP servers that access resources with localized non-ASCII naming do not translate the Request-URI's character encoding to a local form, and will need to be upgraded to accept such aliases. Most deployed HTTP servers do not do not restrict the octets allowed in the protocol, and so an upgrade from URI to 8URI will not be difficult. 4.2.2 URIs within HTML and XML Within a HTML [HTML4] or XML [XML1] document the primary difficulty for the use of 8URIs is that the document itself may be represented and labelled with a charset other than UTF-8. In such situations, the document as a whole might be transcoded into another encoding. However, the hex-encoded URIs following the recommendations of this document should pass from the recipient of the document back into the URI interpreting infrastructure without change. 4.2.3 URIs within email and text/plain E-mail messages are frequently transmitted as text/plain; the use of octets outside of US-ASCII requires an encoding of the message using quoted-printable or base64. In addition, text messages that arrive with charset=utf-8 may be transcoded into a local character representation before storage or display. Thus, URIs within email messages should likely remain within the limited repertoire rather than the 8URI representation. However, it is now common for email software to recognize embedded URIs within email messages and present them specially, e.g., as hypertext links. Within such systems, it is reasonable to upgrade the email display software to present URIs as the natural characters they represent, as long as the entry software in the same system has been upgraded. 5. Security Considerations If URI entry software is upgraded to normalize the characters entered, but the URI interpreting software has not been upgraded to treat multiple forms as equivalent, this introduces the possibility of "spoofing": having different resources whose URIs look the same but are not the same. For example, if "abc" and "def" are different encodings of the same visual characters, "http://a.com/abc" and "http://a.com/def" might look the same to users, might display the same, and different URI entry software components might generate different ones; e.g., EUC-JP-based Japanese URI entry software might generate one encoding, while UTF-8-based software would generate another one. In this case, if "a.com" allows multiple users to establish different areas, it might be possible for someone other than the owner of "http://a.com/abc" to put different content at "http://a.com/def" and "spoof" the results. Conceptually, this is no different from the problems surrounding the use of case-insensitive web servers. For example, a popular web page with a mixed case name (http://big.site/PopularPage) might be "spoofed" by someone who obtains access to (http://big.site/popularpage). However, the introduction of the Unicode canonicalization rules in conjunction with mapping from multiple possible native encodings might result in aliasing which is difficult to determine in advance. Administrators of large sites which allow independent users to create subareas may need to be careful that the aliasing rules do not create such conflicts. 6. Acknowledgements Thanks to Francois Yergeau, Chris Wendt, Yaron Goland, Graham Klyne, Roy Fielding and many others for help with this document. 7. Copyright Copyright (C) The Internet Society, 1997. All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." 8. Author's address Larry Masinter Xerox Corporation 3333 Coyote Hill Road Palo Alto, CA 94304 masinter@parc.xerox.com http://www.parc.xerox.com/masinter Fax: +1 650 812-4333 Martin J. Duerst W3C/Keio University 5322 Endo, Fujisawa 252-8520 Japan duerst@w3.org http://www.w3.org/People/D%C3%BCrst/ Tel/Fax: +81 466 49 1170 9. References [RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997. [RFC 2279] F. Yergeau. "UTF-8, a transformation format of ISO 10646." January 1998. [RFC 2396] T.Berners-Lee, R.Fielding, L.Masinter. "Uniform Resource Identifiers (URI): Generic Syntax." August, 1998. [UNI15] M.Davis, "Unicode Normalization Forms", Draft Unicode Technical Report #15, August 1998. [RFC HTTP] R.Fielding, J.Gettys, et al, "Hypertext Transfer Protocol -- HTTP/1.1", . [RFC 2141] R. Moats, "URN Syntax", May 1997. [RFC 2192] C. Newman, "IMAP URL Scheme", September 1997. [RFC 2384] R. Gellens, "POP URL Scheme", August 1998. [RFC FTP] B. Curtis, "Internationalization of the File Transfer Protocol", . [HTML4] "HTML 4.0", World Wide Web Consortium, . [XMl1] "XML 1.0", World Wide Web Consortium Recommendation, .