Network Working Group C. Weber Internet-Draft Casaba Security Intended status: Standards Track July 11, 2011 Expires: January 12, 2012 Guidelines for Implementers of Internationalized Resource Identifiers (IRIs) draft-weber-iri-guidelines-01 Abstract Some members of the implementation community have expressed confusion about the rules and algorithms for processing Internationalized Resource Identifiers (IRIs). This document aims to clarify these matters and improve interoperability around IRI processing by summarizing the steps required to prepare and parse arbitrary Unicode strings as Internationalized Resource Identifiers. Further goals of this document are to define limited scheme-specific rules around IRI processing and to define the steps required for producing the canonical form of an IRI. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 12, 2012. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents Weber Expires January 12, 2012 [Page 1] Internet-Draft IRI Guidelines July 2011 carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4. Pre-processing Arbitrary Unicode Strings . . . . . . . . . . . 4 5. Parsing Unicode Strings into IRI Components . . . . . . . . . 5 5.1. Identify the scheme . . . . . . . . . . . . . . . . . . . 5 5.2. Identify the authority . . . . . . . . . . . . . . . . . . 6 5.2.1. Identify the userinfo . . . . . . . . . . . . . . . . 6 5.2.2. Identify the host . . . . . . . . . . . . . . . . . . 6 5.2.3. Identify the port . . . . . . . . . . . . . . . . . . 7 5.3. Identify the path . . . . . . . . . . . . . . . . . . . . 7 5.4. Identify the query . . . . . . . . . . . . . . . . . . . . 8 5.5. Identify the fragment . . . . . . . . . . . . . . . . . . 8 6. Scheme-Specific Processing . . . . . . . . . . . . . . . . . . 8 6.1. http . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6.2. javascript . . . . . . . . . . . . . . . . . . . . . . . . 9 6.3. mailto . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6.4. data . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 7. IRI Canonicalization . . . . . . . . . . . . . . . . . . . . . 9 7.1. Producing a valid URI from an IRI . . . . . . . . . . . . 9 8. Security Considerations . . . . . . . . . . . . . . . . . . . 10 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 11.1. Informative References . . . . . . . . . . . . . . . . . . 10 11.2. Normative References . . . . . . . . . . . . . . . . . . . 11 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11 Weber Expires January 12, 2012 [Page 2] Internet-Draft IRI Guidelines July 2011 1. Introduction Internationalized Resource Identifiers (IRIs) extend the Uniform Resource Identifier (URI) specification [RFC3986] by opening up the authority, path, query, and fragment components to the character space available in Unicode/ISO 10646. Arbitrary Unicode strings may be prepared and parsed into IRI sub-components which map directly to the same URI sub-components. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 2. Terminology This section defines terminology used in this document. Unicode characters are referred to using the notation described in section 3 of [RFC5137]. A Unicode code point is represented as U+NNNN where the NNNN string consists of the code point's hexadecimal numbers. reference-string The original and unprocessed input string being considered as an IRI reference. pre-processed-reference-string The reference-string that has been through the pre-processing steps. relative reference The term "relative reference" used in this document can be interpreted according to the rules in section 4.2 of [RFC3986] using the additionally allowed characters that [RFC3987] permits in each component. percent-encode Convert each octet of a sequence to %HH, where HH is the hexadecimal notation of the octet value. 3. Sources This document makes reference to the following sources of information about the parsing of IRIs: Weber Expires January 12, 2012 [Page 3] Internet-Draft IRI Guidelines July 2011 1: Internationalized Resource Identifiers (IRIs) as specified in http://tools.ietf.org/html/draft-ietf-iri-3987bis-05 [RFC3987] 2: Parsing URLs for Fun and Profit, http://tools.ietf.org/html/draft-abarth-url-01 [2] 3: HTML Living Standard: URLs, http://www.whatwg.org/specs/web-apps/ current-work/multipage/urls.html#urls [3] 4: Change proposal for ISSUE-56, http://lists.w3.org/Archives/Public/public-html/2010Jul/0036.html [4] 5: URIs, URLs, and URNs: Clarifications and Recommendations 1.0, http://www.w3.org/TR/uri-clarification/ [5] 6: HTML CHANGE PROPOSAL; change definition of URL to normative reference to IRIBIS, http://lists.w3.org/Archives/Public/public-html/2010Feb/0882.html [6] 7: BIDI URL Display, https://docs.google.com/document/d/ 1c8-svx7og0qBUfGBobw7LYfOcNeDVPYbNVMNpSqYCFo/edit?hl=en [7] 4. Pre-processing Arbitrary Unicode Strings This section describes the pre-processing steps required to prepare an arbitrary Unicode reference-string for later parsing into IRI sub- components. 1. Remove leading and trailing instances of ASCII whitespace characters U+0020 SPACE, U+000D CARRIAGE RETURN (CR), U+000A LINE FEED (LF), and U+0009 CHARACTER TABULATION from the string. Note that Unicode has many more characters that are considered whitespace, none of which are affected or considered in this rule. Reference: Section 7.2 of [RFC3987] for details. 2. If more than one reference is allowed, split the string into substrings on blocks of contiguous U+0020 SPACE characters. Each of one of these substrings is an independent reference-string and will be processed individually. If more than one reference is not allowed, either remove blocks of contiguous whitespace or replace each U+0020 SPACE with a single percent-encoded U+0020 SPACE, written as "%20", depending on what is required for the current context. Reference: Mentioned in [4]. 3. If the current string is not already in a Unicode encoding, then transcode the string to the a Unicode encoding such as UTF-8, UTF-16, or UTF-32. Reference: Section 3.1 of [RFC3987]. 4. TODO: Should numerical character references be replaced with their corresponding character? e.g. would become during this step. Or should this step of unescaping be limited to the set? This is the pre-processed-reference-string ready for parsing. 5. Parsing Unicode Strings into IRI Components With an arbitrary IRI string that has been through pre-processing, referred to as the "pre-processed-reference-string", this section describes the subsequent process of parsing the string into its five major IRI sub-components using rules defined by [RFC3896] (using an algorithm equivalent to Appendix B of [RFC3986]) but with updated ABNF of [RFC3987]. These rules are summarized here. Reference: Section 3.2 of [RFC3987]. 5.1. Identify the scheme If the current string does not contain a ":" U+003A then the string does not contain a scheme and the pre-processed-reference-string may be handled as a relative reference according to the rules in section 4.2 of [RFC3986] using the additionally allowed characters that [RFC3987] permits. o Continue to "Identify the path" o Abort further scheme processing If the first character of the string is not an then this is not a valid scheme and the pre-processed-reference-string may be handled as a relative reference. o Continue to "Identify the path" o Abort further scheme processing Consume all characters up to but not including the first occurrence of ":" U+003A. If the consumed substring contains any characters other than < ALPHA / DIGIT / "+" / "-" / "." > then it is not a valid scheme and the pre-processed-reference-string may be handled as a relative reference. o Continue to "Identify the path" o Abort further scheme processing The consumed substring at this point is the scheme. Skip over the ":" U+003A character. Continue parsing the remaining string as an authority part. Weber Expires January 12, 2012 [Page 5] Internet-Draft IRI Guidelines July 2011 5.2. Identify the authority The URI authority component may contain userinfo, a host, and a port. If the current string does not begin with the two characters "//" U+002F U+002F then the string is not an authority and may be handled as a path sub-component. o Continue to "Identify the path" o Abort further authority processing Consume up to the first occurrence of any one of the authority's terminating characters "/" U+002F, "?" U+003F, "#" U+0023, or the end of the string. This is the authority, also known as the iauthority under IRI RFC3987. Continue further parsing of the authority to identify the userinfo, host, and port parts. 5.2.1. Identify the userinfo The userinfo may come in the form of a username and password rendered as "user:password". The userinfo part may be parsed according to the rules of RFC3986 Section 3.2.1 with the updated ABNF for iuserinfo in RFC3987. If the authority does not contain a "@" U+0040 then the string does not contain a userinfo part and the authority may be parsed for a host and port part. o Continue to "Identify the host" o Abort further userinfo processing From the beginning of the authority, consume each character up to but not including the first occurrence of "@" U+0040. This is the userinfo. Further parsing of the userinfo is scheme-specific. Skip over the first occurrence of "@" U+0040 following the userinfo, and continue parsing the remaining authority to identify the host. 5.2.2. Identify the host The host part of authority may contain an IP-literal, IPv4address, or a reg-name according to the ABNF rules of RFC3986 updated to support Unicode characters in the ireg-name as described in RFC3987. Consume all characters up to but not including the last ":" U+003A character or the end of authority. If this substring is determined to be an IP-literal or IPv4address, Weber Expires January 12, 2012 [Page 6] Internet-Draft IRI Guidelines July 2011 then the consumed characters are the host. o If the authority did contain a ":" then continue parsing the remaining authority including the ":" character according to the "Identify the port" section o Abort further host processing Else the substring is determined to be an ireg-name according to the ABNF naming convention from RFC3987. This is the host. o If the authority did contain a ":" then the remaining authority including the ":" character will be processed according to the "Identify the port". o Continue processing the host The host SHOULD be processed according to the rules of IDNA2008, but MAY be processed according to UTS46 or IDNA2003. If the host is in DNS Internet dot-notaion then it's labels SHOULD be converted to punycode. This is the host. TODO: Error handling. Mention leaving the host name in pure Unicode form for intranet/local name scenarios that don't use DNS, e.g. WINS? 5.2.3. Identify the port Further processing SHOULD skip the first occurrence of ":" U+003A and consume the remaining characters. If these characters are not *DIGIT then the port is invalid. o Continue to "Identify the path" o Abort further scheme processing Else this is the port. 5.3. Identify the path Consume the remaining pre-processed-reference-string up to but not including the first occurrence of a terminating character "?" U+003F, "#" U+0023, or the end of the string. Percent-encode all characters present from the ucschar list. If the path contains any characters not allowed by the ABNF of RFC3987 Section 2.2 or Section 7.2 then replace those characters with their percent-encoding. This is the path. Weber Expires January 12, 2012 [Page 7] Internet-Draft IRI Guidelines July 2011 If the terminating character was "?" then process the remaining string including the leading "?" according to "Identify the query". If the terminating character was "#" then process the remaining string including the leading "#" according to "Identify the fragment". TODO: Handling of special characters "/" and "\". 5.4. Identify the query Consume the remaining string starting with the leading "?" and up to but not including the first occurrence of "#" or the end of the string. Percent-encode all characters present from the ucschar list. If the path contains any characters not allowed by the ABNF of [RFC3987] Section 2.2 or the lists in Section 7.2 then replace those characters with their percent-encoding. This is the query component. If the terminating character was "#" then process the remaining string including the leading "#" according to "Identify the fragment". TODO: Handling of special characters "&", "?", "=", and "/" 5.5. Identify the fragment Consume the remaining string starting with the leading "#" and to the end of the string. Percent-encode all characters present from the ucschar list. This is the fragment. TODO: Handling of special characters "?" and "/" 6. Scheme-Specific Processing TODO Apply limited scheme-specific rules. Reference [RFC4395] Weber Expires January 12, 2012 [Page 8] Internet-Draft IRI Guidelines July 2011 6.1. http (NOTE: Taken directly from [RFC3987]) For compatibility with existing deployed HTTP infrastructure, the following special case applies for schemes "http" and "https" and IRIs whose origin has a document charset other than one which is UCS-based (e.g., UTF-8 or UTF-16). In such a case, the "query" component of an IRI is mapped into a URI by using the document charset rather than UTF-8 as the binary representation before pct-encoding. This mapping is not applied for any other scheme or component. Reference: Section 3.5 and 7.2 of [RFC3987]. 6.2. javascript TODO reference? http://tools.ietf.org/html/draft-hoehrmann-javascript-scheme-03 6.3. mailto TODO reference [RFC6068] 6.4. data TODO reference [RFC2397] 7. IRI Canonicalization To follow. [NOTE: Call out Special Case here or earlier? The U+005C should be either percent-encoded or converted to U+002F. Of course, folks who want to name such files will want to use the escaped form of \ in order for their site to work in other browsers. 7.1. Producing a valid URI from an IRI For each character which is not allowed anywhere in a valid URI, apply the following steps. Reference: Section 3.5 of [RFC3987]. o Convert the IRI to the UTF-8 encoding, i.e., convert the character to a sequence of one or more octets using UTF-8 [RFC3629]. TODO: What about they query and fragment components? Should the o For each IRI component, percent-encode each UTF-8 octet representing each character that is not allowed in the same URI component. In general this will include the set of characters specified by ucschar of [RFC3987]. Weber Expires January 12, 2012 [Page 9] Internet-Draft IRI Guidelines July 2011 8. Security Considerations To follow. 9. IANA Considerations This document has no actions for the IANA. 10. Acknowledgements Many thanks to Mykyta Yevstifeyev, Addison Phillips, Mark Davis, Anne van Kesteren, Adam Barth, Martin Duerst, and Julian Reschke for their feedback. 11. References 11.1. Informative References [2] Barth, A., "How Browsers Process URLs", draft-abarth-url-01 (work in progress), April 2011, . [3] Hickson, I., "HTML Living Standard: URLs", WHATWG ?, 2011, . [4] Fielding, R., "Change proposal for ISSUE-56", July 2010, < http://lists.w3.org/Archives/Public/public-html/2010Jul/ 0036.html>. [5] W3C, "URIs, URLs, and URNs: Clarifications and Recommendations 1.0", W3C Note 21 September 2001, September 2001, . [6] Masinter, L., "HTML CHANGE PROPOSAL; change definition of URL to normative reference to IRIBIS", February 2010, . [7] Davis, M., "Revision of UBA for improved display of URL/ IRIs", May 2011, . [8] Davis, M. and M. Duerst, "URLs", April 2003, . 11.2. Normative References [RFC3987] Duerst, M., Suignard, M., and L. Masinter, "Internationalized Resource Identifiers (IRIs)", draft-ietf-iri-3987bis-05 (work in progress), March 2011, . [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI)", Internet-Standard rfc3986, January 2005, . [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", Internet-Standard rfc3629, November 2003, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Author's Address Chris Weber Casaba Security 16625 Redmond Wa, Suite M348 Redmond, WA 98052 USA Email: chris@lookout.net Weber Expires January 12, 2012 [Page 11]