INTERNET-DRAFT Larry Masinter, Xerox Corporation draft-masinter-url-i18n-01 March 9, 1998 Expires in 6 months Using UTF8 for non-ASCII Characters in Extended URIs Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). This document is not a product of any working group, but may be discussed on the mailing list url-i18n@unicode.org. Abstract URIs are defined as sequences of characters chosen from a limited subset of the repertoire of ASCII characters, both for transmission in network protocols and representation in spoken and written human communication. This document defines a uniform way of representing non-ASCII scripts in URIs and in an Extended URI, so these identifiers can be used for the world's languages. 1. Introduction URIs [RFC-URI-SYNTAX] are defined as sequences of characters chosen from a limited subset of the repertoire of ASCII characters. The characters in URIs are frequently used for representing English words and phrases; unfortunately, this leaves out most of the world, who do not write merely with the letters A-Z. 2. Syntax This memo defines two ways of represting non-ASCII characters within URIs: 1) Within traditional URIs: To be compatible with [RFC-URI-SYNTAX], non-ASCII characters SHOULD be transcribed in URIs by first representing the characters with the UTF-8 character encoding [RFC-UTF8], and then using the hex-encoding defined in [RFC-URI-SYNTAX] to encode any octet that does not correspond to an allowed, non-reserved character. 2) Within a new object, an 8-bit URIs (8URI): for a more compact and natural representation, an 8URI consists of a sequence of octets in the UTF-8 encoding; all characters are represented directly by their UTF-8 encoding, except those disallowed in [RFC-URI-SYNTAX] (reserved, delimiters, white space, unwise special characters), which MUST be hex-encoded. Any octet sequence which would likely yield ambiguous or incorrect results when printed or displayed and then subsequently typed by a user SHOULD be hex-encoded. (See [RFC-DUERST] for details.) 3. Software Requirements Supporting URIs for non-ASCII characters requires cooperation from the providers of three different components of URI software: 3.1 Requirements for URI entry One component of software that deals with URIs allows users to type in the URIs. A human transcribes a visual representation of a URI (as a sequence of glyphs, in some order, in some visual display) using some entry method that will result in a URI. If the visual representation contains only those characters that are allowed [RFC-URI-SYNTAX] standard syntax of URIs, the transcription is simple. However, for all other sequences of characters, it is desirable that the entry results in characters, in logical order from the ISO 10646 character repertoire, encoded using the UTF-8 method [RFC 2044], and then subsequently encoded as necessary using the URI hex-encoding (the set of octets that require encoding depending on whether the result is a URI or an 8URI). Care must be taken in the identification of the characters and character sequence: all accented characters should be translated into their combined form, no extraneous BIDI (bidirectional) marks should be left in the resulting stream, and that characters that are intended to represent Western European letters should be transcribed into their ISO-8859-1 equivalents and not, for example, as double-wide characters. See [RFC-DUERST] for more complete rules. 3.2 Requirements for URI generation and interpretation Systems that are offering resources through the Internet, where those resources have logical names, sometimes offer the ability to generate URIs for the resources they offer. For example, some HTTP servers offer the ability to generate a 'directory listing' for file directories under their purvue, and then to respond to the generated URIs with the files. If the names of the files consist solely of US-ASCII characters the transcription is simple, but other file systems offer a wider variety of characters. For maximum interoperability, the generation of directories SHOULD be in UTF-8, and the results hex-encoded as appropriate for the URI or 8URI. This requirement applies to HTTP servers, FTP servers, gopher servers, and the like. 3.3 Requirements for display of URIs Software that displays URIs to users (or any other kind of transcription, e.g., deciding what to print in a magazine) should follow a general principle: "Don't display a URI that the viewer wouldn't be able to type!" The consequences of this principle require judgement about the availability of software that implements the character input method described in section 3.1. a) In situations where most viewers would not have the capability of typing non-ASCII characters, any octet not allowed in the [RFC-URI-SYNTAX] definition of URIs SHOULD be displayed as if it were hex-encoded. b) In situations where the viewer is likely to have software for non-ASCII character entry as described in section 3.1, sequences of octets MAY be displayed directly as the non-ASCII character sequence it represents in UTF-8. In addition, character sequences of %HH-encoding which correspond to non-ASCII characters MAY be displayed directly, just show the encoding in ASCII, OR may be displayed as if it were a sequence of hex-encoded UTF-8. 3.4 Requirements for interpretation of URIs Software that interprets URIs as the names of local resources SHOULD accept multiple renditions of the URIs in the case where those resources names might have non-ASCII representations. Just as allowing case-insensitive file names makes URIs more robust, because the person viewing the URI might type the case differently than it is displayed, similarly, URI-interpreting software should be generous in allowing all of the possible representations that might result from the recommendations in section 3.1. In addition, it is useful if unaccented characters are accepted, when possible, as aliases for accented characters, and that other equivalences are made. Summary These recommendations, when taken together, will allow for the extension of URIs to handle scripts other than ASCII while minimizing interoperability problems. Acknowledgements Many thanks to Martin Duerst and others for help with this draft. References [RFC 2044] [RFC-URI-SYNTAX] draft-fielding-url-syntax [RFC-DUERST] draft-duerst-url-???