iri A. Barth Internet-Draft Google, Inc. Intended status: Informational November 9, 2010 Expires: May 13, 2011 How Browsers Process URLs draft-abarth-url-00 Abstract This document contains a precise specification of how browsers process URLs. The behavior specified in this document might or might not match any particular browser, but browsers might be well-served by adopting the behavior defined herein. Barth Expires May 13, 2011 [Page 1] Internet-Draft How Browsers Process URLs November 2010 Editorial Note (To be removed by RFC Editor) If you have suggestions for improving this document, please send email to . Further Working Group information is available from . Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on May 13, 2011. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the BSD License. Barth Expires May 13, 2011 [Page 2] Internet-Draft How Browsers Process URLs November 2010 Table of Contents 1. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 3. Parsing a URL . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1. Finding the scheme . . . . . . . . . . . . . . . . . . . . 6 3.2. Finding the authority, path, query, and fragment . . . . . 7 3.3. Finding the user information, host, and port . . . . . . . 8 3.4. Find the user name and password . . . . . . . . . . . . . 8 4. Resolving a string relative to a base URL . . . . . . . . . . 9 4.1. Resolving a string as a relative URL . . . . . . . . . . . 9 4.2. Resolving a string as a scheme-relative URL . . . . . . . 10 4.3. Resolving a string as an authority-relative URL . . . . . 10 4.4. Resolving a string as a query-relative URL . . . . . . . . 11 4.5. Resolving a string as a fragment-relative URL . . . . . . 11 4.6. Resolving a string as a path-relative URL . . . . . . . . 11 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 12 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 13 Barth Expires May 13, 2011 [Page 3] Internet-Draft How Browsers Process URLs November 2010 1. Open Issues Browsers parse URLs differently depending on which operating system they're running on. The problem is that they want to do sensible things for file paths, but file paths look different on Windows and Unix systems. How should we handle cases where browsers disaggree with the regular expression in RFC 3986? Currently, this document aims to describe how browsers behave, but we'll likely need to compare that to RFC 3986 at some point. Some specific differences that have been brought up on the mailing list: o http:///example.com/ o http://example.com; Barth Expires May 13, 2011 [Page 4] Internet-Draft How Browsers Process URLs November 2010 2. Definitions A /control character/ is a character whose value is less than or equal to U+0020 (" "). A /slash character/ is either U+???? ("/") or U+???? ("\"). An /authority terminating characters/ is either a slash charcter, U+???? ("?"), U+???? ("#"), or U+???? (";"). During a parsing algorithm, the /remaining string/ are the characters of the input that have not yet been consumed. Barth Expires May 13, 2011 [Page 5] Internet-Draft How Browsers Process URLs November 2010 3. Parsing a URL Given a string of characters, find the scheme, as described in Section ??. If the URL is invalid: -> Abort these steps. If the scheme is a single upper or lower case ASCII character: -> TODO: Windows drive specs! If the scheme is a ASCII case-insensitive match for "file": -> TODO: File URLs! If the scheme is a ASCII case-insensitive match for "mailto": -> TODO: I think mailto URLs are special, but more testing is required. If the scheme is hierarchical: -> In the after-scheme, if any, find the authority, path, query, and fragment, as decribed in Section ??. -> In the authority, if any, find the user information, host, and port, as described in Section ??. -> In the user-info, if any, find the user name and password, as described in Section ??. -> Abort these steps. The remaining string is the /path/. 3.1. Finding the scheme Consume all leading and trailing control characters. If the remaining string does not contain a ":" character: -> The URL is invalid. -> Abort these steps. Consume characters up to, but not including, the first ":" character. Barth Expires May 13, 2011 [Page 6] Internet-Draft How Browsers Process URLs November 2010 These characters are the /scheme/. Consume the ":" character. The reamining characters are the /after-scheme/. 3.2. Finding the authority, path, query, and fragment Consume any number of slash characters. If the remaining string does not contain any authority terminating characters: -> The remaining string is the /authority/. -> Abort these steps. Consume characters up to, but not including, the first authority terminating character. The consumed characters are /authority/. If the remaining string does not contain a "?" character or a "# character: -> The remaining string is the /path/. -> Abort these steps. Consume characters up to, but not including, the first "?" or "#" charcter. The consumed characters are the /path/. If the first character of the remaining string is a "?" character: -> Consume the "?" character. -> If the remaining string does not contain a "#" character: -> The remaining string is the /query/. -> Abort these steps. -> Consume characters up to, but not including, the first "#" charcter. The consumed characters are the /query/. Consume the "#" character. The remaining string is the /fragment/. Barth Expires May 13, 2011 [Page 7] Internet-Draft How Browsers Process URLs November 2010 3.3. Finding the user information, host, and port If the remaining string contains an "@" character: -> Consume characters up to, but not including the *last* "@" character. The consumed characters are the /user-info/. -> Consume the "@" character. If the remaining string does not contain an ":" character: -> The remaining string is the /host/. -> Abort these steps. If the first character of the remaining string is a "[" character, the remaining string contains a "]" character, and the last ":" character in the remaining string occurs before the last "]" character in the remaining string: -> The remaining string is the /host/. -> Abort these steps. Consume characters up to, but not including, the last ":" character. The consumed characters are the /host/. Consume the ":" character. The remaining string is the /port/. 3.4. Find the user name and password If the remaining string does not contain a ":" character: -> The remaining string is the /user/. -> Abort these steps. Consume characters up to, but not including, the first ":" character. The consumed characters are the /user/. Consume the ":" character. The remaining string is the /password/. Barth Expires May 13, 2011 [Page 8] Internet-Draft How Browsers Process URLs November 2010 4. Resolving a string relative to a base URL Given a string /spec/ and a ParsedURL /base-url/, find the scheme of spec. TODO: We probably need to trim leading and trailing control characters. If spec is an invalid URL: -> The resolved URL is spec resolved as relative URL. -> Abort these steps. If spec's scheme contains any characters which are not "valid scheme characters" (TODO: Define valid scheme characters): -> The resolved URL is spec resolved as relative URL. -> Abort these steps. If base-url's scheme is an ASCII case insensitive match for spec's scheme and the shared scheme is hierarchical: -> The resolved URL is spec's after-scheme resolved as a relative URL. -> Abort these steps. The resolved URL is spec parsed as an absolute URL. 4.1. Resolving a string as a relative URL Given a string /spec/ and a ParsedURL /base-url/... TODO: If base-url's scheme is not hierarchical, we can't resolve as a relative URL. We'll probably want to return an invalid URL. Check what happens when resolving an empty string as a relative URL with a non-hierarchical base. If spec is empty: -> The resolved URL is identical to base-url, with the fragment, if any, removed. -> Abort these steps. If the first character of spec is a slash character: Barth Expires May 13, 2011 [Page 9] Internet-Draft How Browsers Process URLs November 2010 -> If spec has at least two characters and the second character is also a slash character: -> The resolved URL is spec resolved as a scheme-relative URL. Otherwise: -> The resolved URL is spec resolved as an authority-relative URL. -> Abort these steps. If the first character of spec is a "?" character: -> The resolved URL is spec resolved as a query-relative URL. -> Abort these steps. If the first character of spec is a "#" character: -> The resolved URL is spec resolved as a fragment-relative URL. -> Abort these steps. The resolved URL is spec resolved as a path-relative URL. 4.2. Resolving a string as a scheme-relative URL Given a string /spec/ and a ParsedURL /base-url/, let resolved-url be o base-url's scheme o concatenated with ":", o concatenated with spec. The resolved URL is resolved-url parsed as an absolute URL. 4.3. Resolving a string as an authority-relative URL Given a string /spec/ and a ParsedURL /base-url/, let resolved-url be o base-url's scheme o concatenated with "://", o concatenated with base-url's authority, Barth Expires May 13, 2011 [Page 10] Internet-Draft How Browsers Process URLs November 2010 o concatenated with spec. The resolved URL is resolved-url parsed as an absolute URL. 4.4. Resolving a string as a query-relative URL Given a string /spec/ and a ParsedURL /base-url/, let resolved-url be o base-url's scheme o concatenated with "://", o concatenated with base-url's authority, o concatenated with base-url's path, o concatenated with spec. The resolved URL is resolved-url parsed as an absolute URL. 4.5. Resolving a string as a fragment-relative URL Given a string /spec/ and a ParsedURL /base-url/, let resolved-url be o base-url's scheme o concatenated with "://", o concatenated with base-url's authority, o concatenated with base-url's path, o concatenated with "?", o concatenated with base-url's query, o concatenated with spec. The resolved URL is resolved-url parsed as an absolute URL. 4.6. Resolving a string as a path-relative URL TODO: Handle path-relative URLs. This requires a bunch of path and dot semantics. Barth Expires May 13, 2011 [Page 11] Internet-Draft How Browsers Process URLs November 2010 Appendix A. Acknowledgements TODO Barth Expires May 13, 2011 [Page 12] Internet-Draft How Browsers Process URLs November 2010 Author's Address Adam Barth Google, Inc. Email: ietf@adambarth.com URI: http://www.adambarth.com/ Barth Expires May 13, 2011 [Page 13]