Network Working Group P. Thiemann Internet-Draft Freiburg University Category: Informational 20 June 2003 Expires: December 20, 2003 A URN Namespace For Content-Based Unique Identifiers draft-thiemann-cbuid-urn-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on December 31, 2003. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract This document describes a URN namespace to identify immutable octet- stream resources using content-based unique identifiers. The naming scheme relies on an algorithm that computes identifiers from media types and cryptographic hashes without a central authority. 1. Conventions used in this document The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" in this document are to be interpreted as defined in "Key words for use in RFCs to Indicate Requirement Levels" [RFC2119]. Thiemann [Page 1] Internet-Draft Content-Based Unique Identifiers 30 June 2003 2. Introduction A URN serves as a unique name for a resource [RFC1630]. Most URN namespaces involve a central authority to ensure uniqueness of assigned names. This approach has its merits but it requires organizational structures for processing requests for naming and for bookkeeping about used names. Thus, acquiring a URN becomes an involved task not to be undertaken on a day-to-day basis. The cbuid proposal enables using and creating URNs on a day-to-day basis for storing and retrieving immutable octet-stream resources. It relies on a decentralized, algorithmic assignment of identifiers by exploiting the uniqueness guarantees of (cryptographic) hashes. In fact, this document contains the assignment algorithm so that everyone can generate identifiers in this namespace. The namespace has two subspaces, an untyped one and a typed one. The untyped subspace considers the referenced resource as a sequence of octets and computes the hash directly from that representation. The typed subspace relies on the media type of the resource to compute one or more hashes depending on the structure of the resource. At present, a special typed subspace is only defined for the media types application/octet-stream and message/rfc822. This namespace specification is for a formal namespace. The specification adheres to the guidelines given in "Uniform Resource Names (URN) Namespace Definition Mechanisms" [RFC3406]. 3. Specification Template Namespace ID: "cbuid" requested. Registration Information: Registration Version Number: 1 Registration Date: 2003-06-?? Declared registrant of the namespace: The CBUID Project Institut fuer Informatik Universitaet Freiburg Georges-Koehler-Allee 079 D-79110 Freiburg Germany Thiemann [Page 2] Internet-Draft Content-Based Unique Identifiers 30 June 2003 Contact: Peter Thiemann info@www.cbuid.org Declaration of syntactic structure: The Namespace Specific Strings (NSS) of all URNs assigned by the schema described in this document will conform to the syntax defined in section 2.2 of RFC2141 [RFC2141]. The formal syntax of the NSS is defined by the following normative ABNF [RFC2234] rules for : cbuid-nss = type-spec ":" hash *1(":" type-specific-extension) type-spec = "*" / media-type *parameter parameter = ";" "mode" "=" 1*DIGIT / ";" token "=" token token = 1*(ALPHA / DIGIT) hash = hash-scheme ":" hash-value *("/" hash-value) hash-scheme = "md5" / "sha1" / "hash127" / hash-token hash-token = token hash-value = "*" / 1*HEXDIG The following are comments and restrictions not captured by the above grammar. A is any MIME media type [RFC2046] which is registered in the appropriate IANA registry [IANA-MT]. A is a pair of a parameter name and a parameter value. If the parameter name is "mode" then the parameter value is a string of digits. A is an alphanumeric token. A may be "*" (unspecific) or it is a non-empty sequence of hexdigits whose value as a sequence of bits must be a valid hash for the specified hash-scheme. If hash-scheme is "md5" or "hash127", then each specified hash-value consists of 32 HEXDIG. If hash- scheme is "sha1", then each specified hash-value consists of 40 HEXDIG. The "mode" parameter MUST specify the number of additional s. Its value defaults to 0 if the parameter is omitted. That is, mode=0 (or mode parameter omitted) means that there is one , for mode=1 there are two s, and so on. If the type specification is "*" (unspecific), then the hash part MUST specify exactly one , the "mode" parameter MUST be 0, and there MUST NOT be a . If a type is specified, then more than one MAY BE given, the "mode" parameter MUST specify the number of extra s, and a MAY BE allowed. If an identifier contains only one , then this MUST NOT be "*" (unspecific). Examples: urn:cbuid:*:md5:5307d294b6ccd9854f2deed8c1628b72 urn:cbuid:*:sha1:7660c8efbe7f656ce7612636c83a138c085bad3f urn:cbuid:message/rfc822:md5:5307d294b6ccd9854f2deed8c1628b72 urn:cbuid:message/rfc822;mode=1:md5:* /d97a43ed7125019c363b00bd27411fa7 urn:cbuid:message/rfc822;mode=1:md5 :b260fb53d7ec3b530e5a6332763a2bfb /d97a43ed7125019c363b00bd27411fa7 In the last two examples, the linebreak and whitespace before the ":" and the "/" are inserted for editorial reasons, they are not part of the URN. Relevant ancillary documentation: None as yet. Identifier uniqueness considerations: Each identifier contains at least one cryptographic hash value for the referenced resource. The probability that two different resources have the same hash value depends on the hash function. For the md5 hash where the hash value has 128 bits, it is conjectured [RFC1321] that the probability of a collision is in the order of 1/2^64. For the sha1 hash where the hash value has 160 bits, this probability should be even smaller. In addition, if a procedure for creating collisions with one of these hash functions is found, then it is easy to extend the "cbuid" NSS with hash functions that provide smaller collision probabilities. Identifier persistence considerations: Thiemann [Page 4] Internet-Draft Content-Based Unique Identifiers 30 June 2003 The binding between the identifier and the referenced resource is permanently established by the assignment algorithm that computes the identifier from the resource. Process of identifier assignment: Assignment is completely open, following the algorithm below. The inputs of the algorithm are - the name of a hash function - a media type for - a mode parameter (0 if omitted) - a resource (a sequence of octets) The algorithm applies the hash function to the resource, converts the resulting bit sequence into a sequence of hexdigits, and constructs the URN from the type-spec, the mode, the hash-scheme, and the sequence of hexdigits using the syntax described above. Algorithms for the predefined hash functions are defined in the following references: md5 [RFC1321] sha1 [RFC3174] hash127 [HASH127] The conversion of a to a string proceeds as follows. The bits in the are converted from most significant to least significant bit, four bits at a time to their ASCII presentation. Each sequence of four bits is represented by its hexadecimal digit from "0123456789abcdef". That is, binary 0000 gets represented by the character '0', 0001, by '1', and so on up to the representation of 1111 as 'f'. For a resource of type "message/rfc822", the mode parameter MAY be 1. In this case two hash values are requested. The algorithm computes separate hash-values for the header and the body of the message and constructs the identifier by concatenating the , the mode parameter, the , and the two s (hash of header, heash of body) according to the syntax described above. [Note: no unspecific hash-values are generated by the algorithm, they are only used in queries.] At present, mode MUST be 0 for any other type. Process of identifier resolution: Thiemann [Page 5] Internet-Draft Content-Based Unique Identifiers 30 June 2003 The cbuid namespace is intended for communication between a group of clients and a repository (for example, an email client and an email store in which identifiers would replace the UID mechanism of IMAP [RFC2060]). Since the repository keeps all available resources, it implements the mechanism outlined below for resolution. In the present stage, global resolution is not intended. However, this might become sensible in a later revision, after experience has been gained. A local resource repository maintains a mapping from and to resources. From that information it evaluates identifiers as follows. If an identifier contains only one (specific) hash-value, then it refers to the entire resource determined by the repository's mapping. The mediatypes "application/octet-stream" and "message/rfc822" have fixed type-specific interpretations. The mediatype "application/octet-stream" is equivalent to "*" (unspecific). It can only have one and refers to the entire resource determined by the repository's mapping. If the mediatype is "message/rfc822", then the mode parameter may be used to select an alternative resolution. The permitted values for mode are 0 and 1. Mode=0 selects the standard resolution as for "application/octet-stream". If mode=1, then two s must be supplied. The first is the result of applying the hash to the header of the message (including the CRLF terminating the last field but without the blank line separating header from body). It MAY be unspecific. The second is the result of applying the hash function to the body of the message. It MUST be specific. (See [RFC2822] for the precise definitions.) If both hash- values are specific, then one specific message is referenced. If only the body hash is specific, then the body is returned with a minimal anonymized header where all fields that may contain personal information are blanked out. More specifically, the Date field remains, the From, To, and Subject fields remain but are blanked out, the MIME header fields remain unchanged [RFC2045], but all further header fields are removed from the header. For the mediatype "message/rfc822", a MAY be present for selecting parts of a message. In this case, the syntax for is given by the non- Thiemann [Page 6] Internet-Draft Content-Based Unique Identifiers 30 June 2003 terminal from the specification of the IMAP URL scheme [RFC2192]. For illustration, here is a review and interpretation of the examples given above. urn:cbuid:*:md5:5307d294b6ccd9854f2deed8c1628b72 urn:cbuid:*:sha1:7660c8efbe7f656ce7612636c83a138c085bad3f Both are references to an untyped resource using the hash value of the entire resource. Their mode parameter is (implicitly) 0. urn:cbuid:message/rfc822:md5:5307d294b6ccd9854f2deed8c1628b72 A reference to an entire email message. urn:cbuid:message/rfc822;mode=1:md5:* /d97a43ed7125019c363b00bd27411fa7 A reference to an anonymized message. Mode=1 indicates that there are two hash values present. [Note: the linebreak and whitespace before the "/" is inserted for editorial reasons, it is not part of the URN.] urn:cbuid:message/rfc822;mode=1:md5 :b260fb53d7ec3b530e5a6332763a2bfb /d97a43ed7125019c363b00bd27411fa7 A reference to an entire email message by header and body hashes. [Note: the linebreak and whitespace before the ":" and the "/" are inserted for editorial reasons, they are not part of the URN.] Rules for Lexical Equivalence: Lexical equivalence is identity after normalization. An identifier in the cbuid URN namespace is nomalized by 1. converting all characters to lower case 2. eliminating the parameter "mode" if its value is 0 3. eliminating all parameters which are not "mode" parameters. Conformance with URN Syntax: There are no additional characters reserved. Thiemann [Page 7] Internet-Draft Content-Based Unique Identifiers 30 June 2003 Validation mechanism: Each identifier in the namespace MUST conform with the syntax specified above. Unknown parameters MUST be ignored. Scope: The namespace is global and public. 4. IANA Considerations This document includes a URN namespace registration that is to be entered into the IANA registry for URN NIDs. 5. Namespace Considerations Most URN namespaces are assigned to organizations and rely on a centralized registry to achieve uniqueness and persistency. In contrast, the cbuid namespace is not tied to any organization. Assignment of identifiers can be performed and verified individually, while uniqueness is still preserved (with a probability close to 1). Moreover, the naming scheme is extensible with respect the addressing requirements of new mediatypes and with respect to improved hash functions. The transition between hash functions does not mean that the "old" identifiers loose their meaning (cf. the discussion in BCP 66 [RFC3406]). However, each resolver implementation will prefer one hash scheme over the other and will resolve identifiers based on the preferred scheme more quickly. The resources managed by the cbuid namespace are immutable octet streams. The only meaningful services are storage and retrieval, as provided by a repository. There is no update operation. 6. Community Considerations We expect to make public implementations of tools available that manage repositories of email messages and other immutable resources. Addressing of resources in these repositories will rely on cbuid URNs. The main impact that we expect from the use of cbuid URNs is sharing of resources among the users of a single repository. Once a distribution scheme has been devised (perhaps also on the basis of the URNs), repositories may be replicated and the benefit of sharing may be exploited further. Thiemann [Page 8] Internet-Draft Content-Based Unique Identifiers 30 June 2003 7. Security Considerations As long as access to a repository is regulated by a suitable form of access control and transport encryption, there are no security considerations inherent in the use of cbuid URNs. If unregulated access (e.g., without authentication) to a repository is allowed, then only anonymized accesses should be permitted (as defined for the message/rfc822 type) to protect the privacy of the registered clients of the repository. Normative References [HASH127] Bernstein, D. J., "Floating-Point Arithmetic and Message Authentication", http://cr.yp.to/papers/hash127.ps, March 2000. [RFC1321] Rivest, R. L., "The MD5 Message-Digest Algorithm", RFC 1321, April 1992. [RFC2045] Freed, N., and Borenstein, N., "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, November 1996. [RFC2046] Freed, N., and Borenstein, N., "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", RFC 2046, November 1996. [RFC2119] Bradner, S., "Key Words for Use in RFCs to Indicate Requirement Levels", RFC 2119, March 1997. [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. [RFC2234] Crocker, D., Editor, and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. [RFC2822] Resnick, P., Editor, "Internet Message Format", RFC2822, April 2001. [RFC3174] Eastlake, E., and Jones, P., "US Secure Hash Algorithm 1 (SHA1)", RFC 3174, September 2001. Informational References [IANA-MT] IANA Registry of Media Types: ftp://ftp.isi.edu/in- notes/iana/assignments/media-types/ [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW," RFC 1630, June 1994. Thiemann [Page 9] Internet-Draft Content-Based Unique Identifiers 30 June 2003 [RFC2060] Crispin, M., "Internet Message Access Protocol - Version 4rev1", RFC 2060, December 1996. [RFC3406] Daigle, L., van Gulik, D.W., Iannella, R., and Faltstrom, P., "Uniform Resource Names (URN) Namespace Definition Mechanisms", RFC 3406, October 2002. Contributors Stephanie Kollenz Matthias Neubauer Author's Address Peter Thiemann Institut fuer Informatik Universitaet Freiburg Georges-Koehler-Allee 079 D-79110 Freiburg Germany Phone: +49 761 203 8051 EMail: thiemann@acm.org URL: http://www.informatik.uni-freiburg.de/~thiemann Intellectual Property Statement The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. Thiemann [Page 10] Internet-Draft Content-Based Unique Identifiers 30 June 2003 Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assignees. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Thiemann [Page 11]