Network Working Group B. Hoehrmann Internet-Draft September 20, 2006 Intended status: Informational Expires: March 24, 2007 The application/www-form-urlencoded format draft-hoehrmann-urlencoded-00 Status of This Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on March 24, 2007. Copyright Notice Copyright (C) The Internet Society (2006). Abstract This memo defines a compact data format that encodes data sets of name-value pairs. It is designed to be suitable for use as query strings in Internationalized Resource Identifiers and as standalone format. Hoehrmann Expires March 24, 2007 [Page 1] Internet-Draft application/www-form-urlencoded format September 2006 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology and Conformance . . . . . . . . . . . . . . . . . 4 3. Encoding algorithm . . . . . . . . . . . . . . . . . . . . . . 5 4. Decoding algorithm . . . . . . . . . . . . . . . . . . . . . . 6 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 6. Fragment identifiers . . . . . . . . . . . . . . . . . . . . . 9 7. Application-specific processing . . . . . . . . . . . . . . . 10 8. Compatibility considerations . . . . . . . . . . . . . . . . . 11 9. Security considerations . . . . . . . . . . . . . . . . . . . 13 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 11. Media type registration . . . . . . . . . . . . . . . . . . . 13 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14 12.1. Normative References . . . . . . . . . . . . . . . . . . 14 12.2. Informative References . . . . . . . . . . . . . . . . . 14 Hoehrmann Expires March 24, 2007 [Page 2] Internet-Draft application/www-form-urlencoded format September 2006 1. Introduction RFC 1866 [RFC1866] introduced the application/x-www-form-urlencoded media type to facilitate the encoding and transmission of form data sets. Formats based on RFC 1866 continued to use this media type as default encoding format, and other standards adopted the type for similar purposes, in disregard of the fact that use of experimental media types is discouraged [RFC4288]. The format itself has a number of shortcomings. As noted in RFC 1866, the use of the ampersand character as separator makes inclusion of resource identifiers created based on this encoding in HTML/XML documents unnecessarily difficult as the character has to be escaped; the set of characters that have to be escaped is overly broad, making resulting resource identifiers difficult to read and use, and issues of internationalization are not addressed. Attempts have been made to address some of these issues in specific domains, such as allowing use of alternate separator characters, allowing use of characters outside the ASCII repertoire, and use of a charset parameter along with the media type; while these changes had a positive effect in their respective domains, they have also led to proliferation of the format. This specification introduces application/www-form-urlencoded, a new format intended to address these problems and to obsolete the legacy application/x-www-form-urlencoded format. It provides the following benefits: o The format relies on the UTF-8 character encoding [RFC3629], applications do not have to specify or guess the encoding, and non-ASCII characters do not have to be escaped, which allows for significant space saving when textual data is encoded. o The ';' character is used as separator; resource identifiers construced based on this format can be copied and pasted into HTML and XML documents easily since no additional escaping has to be applied. o Common characters are not escaped, encoded data sets look like "url=http://example.org/" not "url=http%3A%2F%2Fexample.org%2F" as would be the case with application/x-www-form-urlencoded. o Encoded data sets can be labeled with a registered media type in the standards tree; applications do not have to rely on unregistered and experimental types whose use is discouraged. Hoehrmann Expires March 24, 2007 [Page 3] Internet-Draft application/www-form-urlencoded format September 2006 o The format is fully defined in this document and compatible with application/x-www-form-urlencoded which eases transition to the new format in environments where compatibility is needed. Migrating existing applications to application/www-form-urlencoded is expected to be straightforward in many cases; refer to section 8 for details. 2. Terminology and Conformance A character string is a sequence of Unicode code points suitable for use with the UTF-8 character encoding. An octet string is a sequence of octets. A character string conforms to this specification if and only if encoding it using the UTF-8 character encoding yields a octet string that conforms to this specification. A octet string conforms to this specification if and only if it is, after replacing all sequences that match pct-encoded [RFC3986] by the corresponding octets, a valid UTF-8 sequence. A software module that encodes data sets into character strings conforms to this specification if and only if it does so as described in section 3. Character strings that have been produced by means of such a software module are said to be in the canonical form, canonical encoding, or simply canonical. A software module that decodes character strings or octets into data sets conforms to this specification if and only if it does so as described in section 4. A specification conforms to this specification if and only if it describes handling of application/www-form-urlencoded and does not contradict this specification. Software that implements such a specification will conform to this specification as applicable. Note: Specifications defining application/x-www-form-urlencoded processing sometimes include provisions to control, for example, separator characters, character encoding, or which characters have to be escaped. Such provisions cannot apply to the format defined in this document. Specifications are encouraged to specify Unicode normalization of data sets before passing them to the encoding algorithm defined in this specification, to specify CRLF newline normalization, and to take advantage of the notion of undefined values. Hoehrmann Expires March 24, 2007 [Page 4] Internet-Draft application/www-form-urlencoded format September 2006 The algorithms defined in this document are considered equivalent to any and all algorithms that map the same input to the same results. 3. Encoding algorithm This section describes the transformation of an ordered list of (, ) pairs into a compact character string. A list of pairs is empty, has at least one defined , or one of non-zero length. A is a character string; a is either a character string or undefined. Note: This definition rules out the possibility of a list with only one pair where the is an empty string and the undefined; such a list would be transformed into the empty string and therefore be indistinguishable from an empty list. All other lists can be encoded. For the purposes of the following algorithm, a character is escaped by representing the character as sequence of octets using the UTF-8 character encoding, and percent-encoding [RFC3986] the octets using uppercase hexadecimal digits. The algorithm is as follows: 1. Replace all space characters (U+0020) by a plus sign (U+002B) and escape the character '%' in each and . 2. Escape the characters ';', '&', '=' and any character that cannot occur in an iquery [RFC3987] in each and . 3. Replace each pair by if the is undefined and by followed by '=' followed by otherwise. 4. Concatenate the character strings in order, using ';' as separator. The following is the set of characters that have to be escaped in the second step of the algorithm above. Refer to the algorithm for the normative definition of this set, and to [RFC3987] section 2.2 for more information about the notation. The range U+D800-U+DFFF has been included for reasons of completeness, character strings do not include code points in this range. Hoehrmann Expires March 24, 2007 [Page 5] Internet-Draft application/www-form-urlencoded format September 2006 %x00000-00020 / %x00022-00023 / %x00026-000026 / %x00003B-00003D / %x0003E-0003E / %x0005B-0005E / %x00060-000060 / %x00007B-00007D / %x0007F-0009F / %x0D800-0DFFF / %x0FDD0-00FDEF / %x00FFF0-00FFFF / %x1FFFE-1FFFF / %x2FFFE-2FFFF / %x3FFFE-03FFFF / %x04FFFE-04FFFF / %x5FFFE-5FFFF / %x6FFFE-6FFFF / %x7FFFE-07FFFF / %x08FFFE-08FFFF / %x9FFFE-9FFFF / %xAFFFE-AFFFF / %xBFFFE-0BFFFF / %x0CFFFE-0CFFFF / %xDFFFE-E0FFF / %xEFFFE-EFFFF / %xFFFFE-0FFFFF / %x10FFFE-10FFFF ; The resulting string matches the iquery production of [RFC3987] and is thus suitable for use as query string in IRIs. Specifications taking advantage of this feature are expected to define processing in terms of IRIs and in accordance with [RFC3987]; construction of query strings suitable for use in Uniform Resource Identifiers is thus considered out of scope of this specification. 4. Decoding algorithm This section describes the transformation of a character or octet string that conforms to this specification into an ordered list of (, ) pairs. The purpose of this definition is to establish which data set a given string corresponds to, it is not meant to constrain how applications process the resulting data set. The algorithm is as follows: 1. If the input is a character string, encode the sequence using the UTF-8 character encoding. 2. If the octet sequence has a length of zero, the result is an empty list of pairs; otherwise proceed with the algorithm. 3. Split the octet sequence into an ordered list at each occurrence of the octets 0x3B (';') and 0x26 ('&'). If neither octet is part of the sequence, the list contains this sequence only. 4. Split each of the octet sequences into a (, ) pair at the first occurrence of the octet 0x3D ('='); if this octet is not part of the sequence, the is the whole sequence and the is undefined. 5. Replace each occurrence of a sequence that matches pct-encoded [RFC3986] by the corresponding octet and each occurrence of the octet 0x2B ('+') by the octet 0x20 (' ') in each and . 6. Decode all octet sequences using the UTF-8 character encoding. Hoehrmann Expires March 24, 2007 [Page 6] Internet-Draft application/www-form-urlencoded format September 2006 Strings that do not conform to this specification cannot be decoded my means of this algorithm as they represent invalid UTF-8 sequences. Handling of invalid strings is subject to [RFC3629] and otherwise unconstrained by this specification. Avoidance of any form of error recovery when processing application/www-form-urlencoded entities is encouraged. 5. Examples This section provides a number of examples that illustrate encoding and decoding of data sets as defined in this specification. At the beginning of each example is the data set under consideration; it is followed by the representation of the data set in the canonical encoding (==) and then one or more alternate strings that represent the same data set (~~) or a different data set (!!). Note: Implementations of this specification only produce strings in the canonical form. Decoders however will typically also have to handle strings not in the canonical encoding, for example, when the string had to be transformed for use in a context that does not support the canonical form, such as query strings in URIs. Character strings are delimited by '"' characters and may include "" sequences which refer to the Unicode code point U+HHHH. The examples are to be understood in the context of section 7; the equivalence rules provided here hold in the general case, but some applications might apply additional equivalence rules. The following example illustrates handline of space characters. In the canonical encoding space characters are replaced by the '+' character; the decoding algorithm is more tolerant, it also accepts spaces encoded as "%20" and space characters that occur literally. Also note that leading and trailing space characters are preserved. ((" a b c ", " 1 3 ")) == "+a+b+c+=+1++3+" ~~ "%20a%20b%20c%20=1%20%203%20" ~~ "abc=13" Newline characters are not normalized by the encoder or decoder: (("Text", "Line1Line2")) == "Text=Line1%0ALine2" ~~ "Text=Line1Line2" !! "Text=Line1%0D%0ALine2" !! "Text=Line1%0A%0DLine2" Hoehrmann Expires March 24, 2007 [Page 7] Internet-Draft application/www-form-urlencoded format September 2006 Character outside the reportoire that can be encoded using US-ASCII typically do not have to be escaped. As with newlines, the encoding and decoding algorithms do not normalize or otherwise change the content. (("Chevron3", "Botes")) == "Chevron3=Botes" ~~ "Chevron3=Bo%C3%B6tes" !! "Chevron3=Bootes" The character U+0000 can occur in data sets and encoders and decoders have to be prepared to handle them unless applications that employ them gurantee otherwise. While escaped in the canonical encoding, the character might also occur literally. It is incorrect so truncate the data set at the first occurence of such a character. (("Lookup", ",,")) == "Lookup=%00,," ~~ "Lookup=,," !! "Lookup=,," !! "Lookup=" The following example illustrates handling of percent-encoding by the encoding and decoding algorithms. Certain sequences will never occur in the canonical form, but the decoding algorithm is designed to be highly robust and properly decode encoded data sets that contain them anyway. (("Cipher", "c=(m^e)%n")) == "Cipher=c%3D(m%5Ee)%25n" ~~ "Cipher=c=(m%5Ee)%25n" ~~ "Cipher=c=(m^e)%n" ~~ "%43%69%70%68%65%72=%63%3d%28%6D%5E%65%29%25%6e" !! "Cipher%3Dc%3D(m%5Ee)%25n" !! "Cipher=c=(m^e)" !! "Cipher=c" The following six examples illustrate handling of empty name fields, empty value fields, and undefined value fields. The special case (("", undefined)) is not part of this list as it is the only data set the encoding defined in this document cannot encode. (("", undef), ("", undef)) == ";" (("", undef), ("", "")) == ";=" (("", ""), ("", undef)) == "=;" (("", ""), ("", "")) == "=;=" (("", undef)) == "" (("", "")) == "=" Hoehrmann Expires March 24, 2007 [Page 8] Internet-Draft application/www-form-urlencoded format September 2006 The separator characters ";" and "&" can both be used in encoded data sets; they always separate pairs if not escaped, even if both of them occur in a single string. (("a&b", "1"), ("c", "2;3"), ("e", "4")) == "a%26b=1;c=2%3B3;e=4" ~~ "a%26b=1&c=2%3B3&e=4" ~~ "a%26b=1;c=2%3B3&e=4" ~~ "a%26b=1&c=2%3B3;e=4" !! "a&b=1;c=2%3B3;e=4" !! "a%26b=1&c=2;3&e=4" Applications may take advantage of the notion of undefined values in encoded data sets; these are useful, for example, for boolean values, to distinguish between empty strings and "null" values, or queries. For instance, a data set used to control columns in product lists could look as follows. A more conventional way to encode the same information would be, e.g., "c1=img&c2=avail&c3=name&c4=price". (("img",undef), ("avail",undef), ("name",undef), ("price",undef)) == "img;avail;name;price" The following examples do not conform to this specification, because they use Unicode code points not suitable for use with the UTF-8 character encoding or because they contain sequences that do not represent valid UTF-8 sequences. * "Lookup=" * "Lookup=%ED%AD%80%ED%B1%BF" * "Lookup=%FE%83%9E%AB%9B%BB%AF" * "Lookup=%C0%80" * "Lookup=%C3" * "Chevron3=Bo%F6tes" 6. Fragment identifiers Fragment identifiers [RFC3987] of application/www-form-urlencoded resources identify a subset of the data set encoded by the resource. The characters '(', ')', and ',' are reserved and have to be escaped unless they are used as described in this section. Fragment identifiers that include '(' or ')' are reserved for future versions of this specification and have undefined semantics. It is incorrect to ignore or otherwise attempt to process them. The comma separates names. The fragment identifier syntax is a comma-separated list of names, encoded using the UTF-8 character encoding and percent escaped as required by [RFC3986] or [RFC3987]. Hoehrmann Expires March 24, 2007 [Page 9] Internet-Draft application/www-form-urlencoded format September 2006 Applications that generate fragment identifiers are encouraged to avoid escaping of characters that do not need to be escaped. The percent-encoding has to be reversed before the fragment identifier can be processed succssfully. The fragment identifier acts as a filter over the data set, only the fields whose name matches a name in the list of names as encoded in the fragment identifier are identified. The order of names in the fragment identifier is insignificant, the original order is retained. data:application/www-form-urlencoded,a=1;b;a=2;=3;d,;f#e,,a,%64%2C Here the fragment identifier identifies the original data set with all fields that are not named in the fragment identifier removed: (("a", "1"), ("a", "2"), ("", "3"), ("d,", undefined)) Applications supporting this fragment identifier syntax are expected to process the resource identifier given above essentially as if it were the following: data:application/www-form-urlencoded,a=1;a=2;=3;d, Fragment identifiers including the character '(' or ')', for example, , have undefined semantics as specified above. 7. Application-specific processing This specification constrains the behavior of applications only in so far as they act as encoders and decoders. It does not define a protocol or other environment in which such processing takes place. This section provides a few examples to illustrate the implications. This specification does not define how applications obtain a list of pairs suitable for application/www-form-urlencoded encoding. Such a definition is considered part of the processing environment of the application, and if such an environment fails to define the process, applications trying to operate in the environment might fail to interoperate. This specification defines which data set conforming strings in a non-canonical encoding correspond to. For some applications it might be reasonable to assume that it will only have to process strings in the canonical encoding and consider the occurence of non-canonical strings to indicate a malfunction of the system and abort processing. This specification considers the order of pairs significant, treats Hoehrmann Expires March 24, 2007 [Page 10] Internet-Draft application/www-form-urlencoded format September 2006 strings case-sensitively, and allows multiple pairs to use the same name. Some applications might post-process the data set transforming it into a set with unique uppercase names and use that for further processing. Applications will typically be able to understand only a limited set of names and values, require them to be in a specific order, or place other constraints on the data set. In consequence, such applications might have to process data sets that do not meet the application's constraints. This specification makes no attempt to define how such applications should handle this kind of problem. It is expected that specifications of processing environments taking advantage of the format defined in this specification address their respective interoperability needs by restricting application behavior as needed and appropriate for their environment. 8. Compatibility considerations Editorial note: This section is terrible. Replacements for this section are very much welcome. Most applications will have to be changed to take advantage of, or achieve compatibility with, the format defined in this document; in particular, applications encoding that sets and applications which depend on intrinsic knowledge of this format's media type require updating. This section discusses differences between the format defined in this document and application/x-www-form-urlencoded by means of the following HTML form.

Submitting this form will result in a request as follows. Note that the character U+00F6 has been escaped only due to limitations of the underlying protocol. GET /t?url=http://example.org/Ragnar%C3%B6k/;lang=de HTTP/1.1 ... Using method='post' instead, the request would be as follows. Note that in this case the character U+00F6 is not escaped but represented as sequence of octets in the UTF-8 character encoding as in this case Hoehrmann Expires March 24, 2007 [Page 11] Internet-Draft application/www-form-urlencoded format September 2006 the format is not constrained by the protocol. POST /t HTTP/1.1 Content-Type: application/www-form-urlencoded ... url=http://example.org/Ragnark/;lang=de Using the legacy format application/x-www-form-urlencoded instead would result in the following requests for method='get' and 'post' respectively. Note that it is being assumed that here, too, the UTF-8 character encoding is used when encoding the data set. Many existing applications use some other encoding; such applications are considered out of scope of this section. GET /t?url=http%3A%2F%2Fexample.org%2FRagnar%C3%B6k%2F&lang=de ... ... and POST /t HTTP/1.1 Content-Type: application/x-www-form-urlencoded ... url=http%3A%2F%2Fexample.org%2FRagnar%C3%B6k%2F&lang=de The HTML specification recommends that HTML implementations ignore the enctype='application/www-form-urlencoded' attribute specified in the example form above when they do not understand the value. Implementations that follow this recommendation will then use the default value enctype='application/x-www-form-urlencoded' when submitting the form data set. As such, applications will typically have to handle both formats. The decoding algorithm defined in this specification can properly decode any UTF-8-based application/x-www-form-urlencoded encoded data set and as such, if the 't' script can handle the form as specified above, it will also be able to handle submissions from HTML implementations lacking proper application/www-form-urlencoded support, with one caveat: if method='post' is used, it will also have to recognize the application/x-www-form-urlencoded media type. The 't' script might also have been designed for use with the legacy application/x-www-form-urlencoded format; in this case, the script would have to be able to accept characters like '/' and ':' in their unescaped form, accept ';' as separator, and in case of POST, accept the application/www-form-urlencoded media type. Such applications typically have no problems handling unescaped characters; accepting ';' as separator character is recommended by Hoehrmann Expires March 24, 2007 [Page 12] Internet-Draft application/www-form-urlencoded format September 2006 the HTML specification and many applications either follow this recommendation or can easily be configured to do so. Applications will typically not be able to recognize the new media type, and as such using the 'POST' method is unlikely to work with such legacy applications. 9. Security considerations None not already inherent to the processing of the UTF-8 character encoding [RFC3629] and the handling of percent-encoded sequences [RFC3986]. Depending on how the format defined in this document is being used, the security considerations of the aforementioned RFCs, [RFC3987], and [RFC3875] might inform security decisions. 10. IANA Considerations This document registers a new media type as defined in the following section. 11. Media type registration Type name: application Subtype name: www-form-urlencoded Required parameters: none Optional parameters: none Note: The media type does not have a 'charset' parameter, it is incorrect specify one and to associate any significance to it if specified. The character encoding is always UTF-8. The Unicode encoding form signature is not supported; a leading U+FEFF character will be considered part of a . Encoding considerations: 8bit Security considerations: See section 9. Interoperability considerations: None, except as noted in other sections of this document. Published specification: RFC XXXX Applications that use this media type: Systems that interchange data sets of name-value pairs. Additional information: Magic number(s): n/a File extension(s): n/a Macintosh file type code(s): TEXT Fragment identifiers: See section 6. Hoehrmann Expires March 24, 2007 [Page 13] Internet-Draft application/www-form-urlencoded format September 2006 Person & email address to contact for further information: See Author's Address section. Intended usage: COMMON Restrictions on usage: n/a Author: See Author's Address section. Change controller: The IESG. 12. References 12.1. Normative References [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003. [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource Identifiers (IRIs)", RFC 3987, January 2005. 12.2. Informative References [RFC1866] Berners-Lee, T. and D. Connolly, "Hypertext Markup Language - 2.0", RFC 1866, November 1995. [RFC3875] Robinson, D. and K. Coar, "The Common Gateway Interface (CGI) Version 1.1", RFC 3875, October 2004. [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005. [RFC4288] Freed, N. and J. Klensin, "Media Type Specifications and Registration Procedures", BCP 13, RFC 4288, December 2005. Author's Address Bjoern Hoehrmann Weinheimer Strasse 22 Mannheim D-68309 Germany EMail: mailto:bjoern@hoehrmann.de URI: http://bjoern.hoehrmann.de Hoehrmann Expires March 24, 2007 [Page 14] Internet-Draft application/www-form-urlencoded format September 2006 Full Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgement Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA). Hoehrmann Expires March 24, 2007 [Page 15]