Internet Engineering Task Force M. Nilsson INTERNET DRAFT 17th January 1999 Document: draft-nilsson-latin1-http-uri-00.txt Expires 17th July 1999 8 bit latin1 characters in HTTP URIs Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." To view the entire list of current Internet-Drafts, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Abstract The recent gain of internet users in non-US countries has increased the demand for 8 bit characters in URIs. The lack of recommended character map has lead to several incompatible implementations. This document suggests the use of ISO-8859-1 to represent all the characters present in that character table. 1. The problem The definition of an URI to be used in HTTP/1.0 and HTTP/1.1, as described in RFC 1945 and RFC 2068, includes "International characters" among the allowed characters. It is further stated in the end of section 3.2.1 that: "The BNF above includes national characters not allowed in valid URLs as specified by RFC 1738, since HTTP servers are not restricted in the set of unreserved characters allowed to represent the rel_path part of addresses, and HTTP proxies may receive requests for URIs not defined by RFC 1738." But nothing is said about the representation of these 8 bit characters. Since different applications use different character maps to represent 8 bit URIs, the following problem, and several other similar ones, can occur: 1. A html page is authored and published on a UNIX system. The page contains a link to another page which has an 8 bit name, but neither page has any embedded information regarding its character encoding. 2. A Macintosh user is looking at the first page with a web browser. All characters on the page are displayed as intended, including the 8 bit ones in the link when using 'display source', as the browser assumes ISO-8859-1 which happen to be the character set used on the authoring system. 3. When the user tries to use the link to the document with an 8 bit URI he only gets a "Not found" message from the server, because the browser encodes its URIs with the machintosh character set. 2. Solutions First it must be recognised that URIs are on the whole very difficult to internationalise and a complete internationalisation is not possible. We want at the same time as much internationalisation as possible within the constraints given by the URI definition in the HTTP/1.0 and HTTP/1.1 standards. Considering this suggesting that ISO-8859-1 should be the default mapping is not a solution. A solution that might look as a good one at a glance is to use the same encoding as that of the document in wich the link was found. E.g. if the document was encoded in ISO-8859-1 all links should be treated as ISO-8859-1 and if the document was encoded in ISO-8859-2 all links should be treated as ISO-8859-2. This is a solution for characters present in only one character set, but it does not solve the problem described in section 1. The solutions suggested by this document is to encode all characters present in ISO-8859-1 with ISO-8859-1 and let all other characters remain in their present encoding, if possible. ISO-8859-1 is chosen since it is the default encoding of HTML documents which means that URIs from pages without any content encoding descriptions can be used without modifications. While this isn't a complete solution it is very straight forward and solves most of the problems since the ISO-8859-1 characters are the ones present in most character sets (at different positions that is). 3. Suggested solution All characters present in ISO-8859-1 should be represented with their ISO-8859-1 encoding. 4. Security considerations Since this document does not suggest any technical changes of the URI definition (such as adding or removing valid URI characters) the author does not see any security issues. 5. References [HTTP1.0] T. Berners-Lee, R. Fielding, H. Frystyk, "Hypertext Transfer Protocol -- HTTP/1.0", RFC 1945, May 1996 [HTTP1.1] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068, January 1997 [ISO-8859-1] ISO/IEC DIS 8859-1. 8-bit single-byte coded graphic character sets, Part 1: Latin alphabet No. 1. Technical committee / subcommittee: JTC 1 / SC 2 6. Full Copyright Statement Copyright (C) The Internet Society (1998). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implmentation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." 7. Author's Address Martin Nilsson Rydsvägen 246 C. 30 S-584 34 Linköping Sweden Email: nilsson@id3.org