INTERNET DRAFT C. Feather Expires: 8 December 2001 Thus plc 8 June 2001 The Hashed URI draft-feather-hashed-uri-00.txt 1. Status of this Document This document is an Internet-Draft and is in full conformance with Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or made obsolete by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accesses at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft shadow directories can be accessed at http://www.ietf.org/shadow.html. This section will be updated with the appropriate verbiage from RFC 2223 when this document has been found ready for publication as an RFC. 2. Abstract There are situations where it is desirable to determine whether two URIs are identical without revealing the value of one or the other if they are not. For example, the publisher of filtering software may want to provide a set of URLs to be filtered without identifying the actual resources in question. This problem has traditionally been done using cryptographically secure hashes. The two items are both hashed and the resulting values then compared. If they are equal it can be assumed that the two original items were equal as well; if not, the hash does not reveal anything about the original item. This technique is eminently suitable for this situation. This document describes a "hashed" URI naming scheme for representing the hashed form of a URI [RFC2396] as another URI. Feather [Page 1] INTERNET DRAFT The Hashed URI 8 June 2001 3. Formal definitions Formal definitions in this document use augmented Backus-Naur syntax [RFC2234]. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 4. The hashed URI scheme 4.1. Syntax A hashed URI is formally described as follows: hashed-uri = hashed-scheme ":" hash-algorithm "=" hash-value hashed-scheme = "hashed" hash-algorithm = identifier hash-value = 1*HEXDIG identifier = 1*(ALPHA / DIGIT / "." / "-") ALPHA = %x41-5A / %x61-7A ; A-Z / a-z DIGIT = %x30-39 ; 0-9 HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F" The entire URI is case-insensitive. 4.2. Identification of hash algorithms Each hashing algorithm has an identifier for the purposes of the hashed URI scheme. Names beginning with "x-" are available for private use. The Internet Assigned Numbers Authority (IANA) will maintain a registry of hash algorithms and the associated identifier. The initial contents of this registry will be: Identifier Definition "md5" Message Digest algorithm 5 ("MD5") [RFC1321] "sha-1" Secure Hash Algorithm ("SHA-1") [FIPS-180-1] A hashed URI MUST only use an identifier that is in the registry or that is available for private use. An application that interprets hashed URIs MUST accept "sha-1" and SHOULD accept all identifiers in the registry. Feather [Page 2] INTERNET DRAFT The Hashed URI 8 June 2001 4.3. Generating a hashed URI A hashed URI is generated from an initial URI in the following three steps. (1) The initial URI is reduced to its equivalent canonical form. To do this, any #fragment or ?query component is removed, a relative URI is converted to the equivalent absolute URI, and the scheme name is converted to lowercase; any other changes depend on the specific scheme used. Except where done as part of a canonicalization algorithm for the specific scheme, no encoding or decoding of escaped characters is done. A possible canonicalization algorithm for hierarchical URIs is given in Appendix A. (2) The resulting string is passed through the appropriate hash algorithm. The string does not include any trailing zero bytes, a length, delimiting angle brackets, or any other ornamentation. This produces a sequence of octets (16 octets for MD5, 20 octets for SHA-1). If the algorithm requires the data to be a multiple of a particular size and does not specify a padding rule, the string is padded with zero bytes to that size (MD5 and SHA-1 both specify their own padding rules and therefore no padding is done at this stage). (3) Each octet is converted to two hexadecimal digits in the normal manner and these are written in the same order as the corresponding octets. (4) Any #fragment or ?query component is restored. 4.4. Examples Initial URI Hashed URI mailto:clive@demon.net hashed:md5=cd933f3b87ee60e58917448c9678ee32 http://www.thus.net hashed:md5=be96302a0468481aca954431cf293e05 HTTP://www.Thus.net hashed:md5=be96302a0468481aca954431cf293e05 http://10.20.30.40/a%62c hashed:md5=754a2c63a13aa2e8153d027066e31f8f 5. IANA considerations Section 4.2 requires IANA to maintain a registry of hash algorithms and identifiers. Since implementers will want to handle as many hashes as possible, adding new algorithms places a significant burden on them. Therefore new algorithms should only be added when they are widely accepted in the community. The IETF Consensus process [RFC2434] is recommended. Feather [Page 3] INTERNET DRAFT The Hashed URI 8 June 2001 6. Security Considerations The security considerations for URIs in [RFC2396] continue to apply. The security of the hashed URI depends completely on that of the hash algorithm used. Where the initial URI is limited to a sufficiently small set of possibilities, an exhaustive search can be used to determine the initial URI from the hashed URI. Given H hashed URIs and P possible initial URIs, the time take to do this search is O(H+P) and not O(HP). In particular, the set of URIs of the form , where the domain is in a major domain registry, may be small enough for such a search to be feasible. Because (subject to the above) the initial URI cannot be determined from a hashed URI, it is necessary to trust the originator of the latter as to whether the resource has any of the properties claimed for it. 7. References [FIPS-180-1] "Secure Hash Standard", National Institute of Standards and Technology, U.S. Department Of Commerce, April 1995. Also known as: 59 Fed Reg 35317 (1994). [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, April 1992. [RFC2119] Bradner, S., "Keywords for use in RFCs to Indicate Requirement Levels", RFC 2119. [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. [RFC2396] Berners-Lee, T., R. Fielding and L. Manister, "Uniform Resource Identifiers (URI): Generic Syntax", RFC 2396, August 1998. [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", RFC 2434, October 1998. 8. Acknowledgements The assistance of Richard Clayton in preparing this document is greatly appreciated. Feather [Page 4] INTERNET DRAFT The Hashed URI 8 June 2001 9. Author's Address Clive Feather Thus plc 322 Regents Park Road London N3 2QQ United Kingdom Email: clive@demon.net or: clive@davros.org Tel: +44 20 8371 1138 Fax: +44 20 8371 4037 This document expires 8 December 2001. Feather [Page 5] INTERNET DRAFT The Hashed URI 8 June 2001 A Canonicalization of hierarchical URIs The following algorithm is proposed as a way of canonicalizing hierarchical URIs; that is, those that use the production "hier_part" from [RFC2396] appendix A. In the real world it is not always possible to say whether two URIs address the same resource, because this will depend on the actual implementation deployed on network servers. Therefore the algorithm has two variants - "N" and "P" - which respectively tend towards false negatives (the URIs are assumed to be different) and false positives (they are assumed to be the same). The choice of variant will depend on the application domain. (1) The changes described in section 3.3 step (1) are made. (2) The URI is decomposed into scheme, authority, and a list of path_segments. If the scheme uses user/host/port triples as authorities, the authority is further split into the three parts. Any of these parts may be empty. (3) If the port is the default port for the scheme, it is deleted. (4) Variant N: the scheme and host (if a domain name) are lowercased. Variant P: all the components are lowercased. (5) The host (if an IP address) and port have all leading zeroes removed. (6) Variant P only: if the last path_segment ends with one of the strings in the first column, that string is replaced by the one in the second column: .html .htm .jpeg .jpg .text .txt .ram .ra (7) Scheme "http" only: if the last path_segment is "index.htm" or "index.html", it is removed. (8) The URI is recomposed, including only those components that remain and are not empty strings. That is, it is the concatenation of: [always] scheme ":" [if there is an authority] "//" authority [if there is a user and host] "//" user "@" host [if there is only a host] "//" host [if there is a port] ":" port [for each path_segment in order] "/" path_segment Feather [Page 6] INTERNET DRAFT The Hashed URI 8 June 2001 Examples: Original URI: "http://Fred@WWW.Thus.net:0112/Test/index.html" Variant N: "http://Fred@www.thus.net:112/Test/" Variant P: "http://fred@www.thus.net:112/test/" Original URI: "http://111.011.101.001/test/Image.jpeg" Variant N: "http://111.11.101.1/test/Image.jpeg" Variant P: "http://111.11.101.1/test/image.jpg" This document expires 8 December 2001. Feather [Page 7] -- Clive D.W. Feather | Internet Expert | Work: Tel: +44 20 8371 1138 | Demon Internet | Home: Fax: +44 20 8371 4037 | Thus plc | Web: Written on my laptop; please observe the Reply-To address