INTERNET DRAFT C. Feather Expires: 18 July 2002 Thus plc 18 January 2002 The Hashed URI draft-feather-hashed-uri-02.txt 1. Status of this Document This document is an Internet-Draft and is in full conformance with Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or made obsolete by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accesses at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft shadow directories can be accessed at http://www.ietf.org/shadow.html. This section will be updated with the appropriate verbiage from RFC 2223 when this document has been found ready for publication as an RFC. 2. Abstract There are situations where it is desirable to determine whether two URIs are identical without revealing the value of one or the other if they are not. For example, the publisher of filtering software may want to provide a set of URLs to be filtered without identifying the actual resources in question. This problem has traditionally been done using cryptographically secure hashes. The two items are both hashed and the resulting values then compared. If they are equal it can be assumed that the two original items were equal as well; if not, the hash does not reveal anything about the original item. This technique is eminently suitable for this situation. This document describes a "hashed" URI naming scheme for representing the hashed form of a URI [RFC2396] as another URI. Feather [Page 1] INTERNET DRAFT The Hashed URI 18 January 2002 3. Formal definitions Formal definitions in this document use augmented Backus-Naur syntax [RFC2234]. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 4. The hashed URI scheme 4.1. Syntax A hashed URI is formally described as follows: hashed-uri = hashed-scheme ":" hash-data *flag hash-data = hash-algorithm "=" hash-value hashed-scheme = "hashed" hash-algorithm = identifier hash-value = 1*HEXDIG identifier = 1*(ALPHA / DIGIT / "." / "-") flag = "+frag" / "+query" ALPHA = %x41-5A / %x61-7A ; A-Z / a-z DIGIT = %x30-39 ; 0-9 HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F" The entire URI is case-insensitive. 4.2. Identification of hash algorithms Each hashing algorithm has an identifier for the purposes of the hashed URI scheme. Names beginning with "x-" are available for private use. The Internet Assigned Numbers Authority (IANA) will maintain a registry of hash algorithms and the associated identifier. The initial contents of this registry will be: Identifier Definition "md5" Message Digest algorithm 5 ("MD5") [RFC1321] "sha1" Secure Hash Algorithm ("SHA-1") [FIPS-180-1] A hashed URI MUST only use an identifier that is in the registry or that is available for private use. An application that interprets hashed URIs MUST accept "sha1" and SHOULD accept all identifiers in the registry. Feather [Page 2] INTERNET DRAFT The Hashed URI 18 January 2002 4.3. Generating a hashed URI A hashed URI is generated from an initial URI in the following five steps. (1) Any #fragment or ?query component is removed if appropriate. When generating hashes for publication, whether or not this is appropriate will depend on the purpose and possibly on the particular URI. When generating a hash to compare against a hashed URI, the flags on the latter can be used to determine which to retain. (2) The URI is reduced to its equivalent canonical form. To do this, the scheme name is converted to lowercase and a relative URI is converted to the equivalent absolute URI; any other changes depend on the specific scheme used. Except where done as part of a canonicalization algorithm for the specific scheme, no encoding or decoding of escaped characters is done. A possible canonicalization algorithm for hierarchical URIs is given in Appendix A. (3) The resulting string is passed through the appropriate hash algorithm. The string does not include any trailing zero bytes, a length, delimiting angle brackets, or any other ornamentation. This produces a sequence of octets (16 octets for MD5, 20 octets for SHA-1). If the algorithm requires the data to be a multiple of a particular size and does not specify a padding rule, the string is padded with zero bytes to that size (MD5 and SHA-1 both specify their own padding rules and therefore no padding is done at this stage). (4) Each octet is converted to two hexadecimal digits in the normal manner and these are written in the same order as the corresponding octets. (5) If a #fragment or ?query was retained in step 1, the "+frag" or "+query" flag is appended. 4.4. Examples Initial: mailto:clive@demon.net Hashed: hashed:md5=cd933f3b87ee60e58917448c9678ee32 Initial: http://www.thus.net Hashed: hashed:md5=be96302a0468481aca954431cf293e05 or: hashed:sha1=87ed28009f511ed7ef630180a229ba257180a481 Initial: http://www.thus.net#end Hashed: hashed:sha1=87ed28009f511ed7ef630180a229ba257180a481 or: hashed:sha1=31a79a96e0404778a3b95cb426b04b6114a604a5+frag Initial: HTTP://www.Thus.net Hashed: hashed:md5=be96302a0468481aca954431cf293e05 Initial: http://10.20.30.40/a%62c Hashed: hashed:md5=754a2c63a13aa2e8153d027066e31f8f Feather [Page 3] INTERNET DRAFT The Hashed URI 18 January 2002 5. IANA considerations Section 4.2 requires IANA to maintain a registry of hash algorithms and identifiers. Since implementers will want to handle as many hashes as possible, adding new algorithms places a significant burden on them. Therefore new algorithms should only be added when they are widely accepted in the community. The IETF Consensus process [RFC2434] is recommended. 6. Security Considerations The security considerations for URIs in [RFC2396] continue to apply. The security of the hashed URI depends completely on that of the hash algorithm used. Where the initial URI is limited to a sufficiently small set of possibilities, an exhaustive search can be used to determine the initial URI from the hashed URI. Given H hashed URIs and P possible initial URIs, the time take to do this search is O(H+P) and not O(HP). In particular, the set of URIs of the form , where the domain is in a major domain registry, may be small enough for such a search to be feasible. Because (subject to the above) the initial URI cannot be determined from a hashed URI, it is necessary to trust the originator of the latter as to whether the resource has any of the properties claimed for it. 7. References [FIPS-180-1] "Secure Hash Standard", National Institute of Standards and Technology, U.S. Department Of Commerce, April 1995. Also known as: 59 Fed Reg 35317 (1994). [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, April 1992. [RFC2119] Bradner, S., "Keywords for use in RFCs to Indicate Requirement Levels", RFC 2119. [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. [RFC2396] Berners-Lee, T., R. Fielding and L. Manister, "Uniform Resource Identifiers (URI): Generic Syntax", RFC 2396, August 1998. [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", RFC 2434, October 1998. Feather [Page 4] INTERNET DRAFT The Hashed URI 18 January 2002 8. Acknowledgements The assistance of Richard Clayton in preparing this document is greatly appreciated. I am also grateful to Ian Cooper for his comments. 9. Change history Changes in version -01: * The "sha-1" algorithm identifier changed to "sha1". * The "+frag" and "+query" flags added. * The handling of fragments and queries clarified. * Appendix A altered to eliminate empty path_segments in variant P. * Some further examples added and others altered. Changes in version -02: * None. 10. Author's Address Clive Feather Thus plc 322 Regents Park Road London N3 2QQ United Kingdom Email: clive@demon.net or: clive@davros.org Tel: +44 20 8371 1138 Fax: +44 20 8371 4037 This document expires 18 July 2002. Feather [Page 5] INTERNET DRAFT The Hashed URI 18 January 2002 A Canonicalization of hierarchical URIs The following algorithm is proposed as a way of canonicalizing hierarchical URIs; that is, those that use the production "hier_part" from [RFC2396] appendix A. In the real world it is not always possible to say whether two URIs address the same resource, because this will depend on the actual implementation deployed on network servers. Therefore the algorithm has two variants - "N" and "P" - which respectively tend towards false negatives (the URIs are assumed to be different) and false positives (they are assumed to be the same). The choice of variant will depend on the application domain. (1) The changes described in section 4.3 steps (1) and (2) are made. (2) The URI is decomposed into scheme, authority, and a list of path_segments. If the scheme uses user/host/port triples as authorities, the authority is further split into the three parts. Any of these parts may be empty. (3) If the port is the default port for the scheme, it is deleted. (4) Variant N: the scheme and host (if a domain name) are lowercased. Variant P: all the components are lowercased. (5) The host (if an IP address) and port have all leading zeroes removed. (6) Variant P only: all empty path_segments are removed. (7) Variant P only: if the last path_segment ends with one of the strings in the first column, that string is replaced by the one in the second column: .html .htm .jpeg .jpg .text .txt .ram .ra (8) Scheme "http" only: if the last path_segment is "index.htm" or "index.html", it is removed. (9) The URI is recomposed, including only those components that remain and are not empty strings. That is, it is the concatenation of: [always] scheme ":" [if there is an authority] "//" authority [if there is a user and host] "//" user "@" host [if there is only a host] "//" host [if there is a port] ":" port [for each path_segment in order] "/" path_segment Feather [Page 6] INTERNET DRAFT The Hashed URI 18 January 2002 Examples: Original URI: "http://Fred@WWW.Thus.net:0112/Test/index.html" Variant N: "http://Fred@www.thus.net:112/Test/" Variant P: "http://fred@www.thus.net:112/test/" Original URI: "http://111.011.101.001/test//Image.jpeg" Variant N: "http://111.11.101.1/test//Image.jpeg" Variant P: "http://111.11.101.1/test/image.jpg" This document expires 18 July 2002. Feather [Page 7]