Network Working Group                                       P. Thiemann
Internet-Draft                                      Freiburg University
Category: Informational                                    20 June 2003
Expires: December 20, 2003


          A URN Namespace For Content-Based Unique Identifiers
                    draft-thiemann-cbuid-urn-00.txt

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on December 31, 2003.

Copyright Notice

   Copyright (C) The Internet Society (2003).  All Rights Reserved.

Abstract

   This document describes a URN namespace to identify immutable octet-
   stream resources using content-based unique identifiers.  The naming
   scheme relies on an algorithm that computes identifiers from media
   types and cryptographic hashes without a central authority.

1. Conventions used in this document

   The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY"
   in this document are to be interpreted as defined in "Key words for
   use in RFCs to Indicate Requirement Levels" [RFC2119].



Thiemann                                               [Page 1]

Internet-Draft      Content-Based Unique Identifiers        30 June 2003


2. Introduction

   A URN serves as a unique name for a resource [RFC1630].  Most URN
   namespaces involve a central authority to ensure uniqueness of
   assigned names.  This approach has its merits but it requires
   organizational structures for processing requests for naming and for
   bookkeeping about used names.  Thus, acquiring a URN becomes an
   involved task not to be undertaken on a day-to-day basis.

   The cbuid proposal enables using and creating URNs on a day-to-day
   basis for storing and retrieving immutable octet-stream resources.
   It relies on a decentralized, algorithmic assignment of identifiers
   by exploiting the uniqueness guarantees of (cryptographic) hashes.
   In fact, this document contains the assignment algorithm so that
   everyone can generate identifiers in this namespace.

   The namespace has two subspaces, an untyped one and a typed one.  The
   untyped subspace considers the referenced resource as a sequence of
   octets and computes the hash directly from that representation.  The
   typed subspace relies on the media type of the resource to compute
   one or more hashes depending on the structure of the resource.  At
   present, a special typed subspace is only defined for the media types
   application/octet-stream and message/rfc822.

   This namespace specification is for a formal namespace.  The
   specification adheres to the guidelines given in "Uniform Resource
   Names (URN) Namespace Definition Mechanisms" [RFC3406].

3. Specification Template

   Namespace ID:

         "cbuid" requested.

   Registration Information:

         Registration Version Number: 1

         Registration Date: 2003-06-??

   Declared registrant of the namespace:

         The CBUID Project
         Institut fuer Informatik
         Universitaet Freiburg
         Georges-Koehler-Allee 079
         D-79110 Freiburg
         Germany



Thiemann                                               [Page 2]

Internet-Draft      Content-Based Unique Identifiers        30 June 2003


      Contact:
         Peter Thiemann
         info@www.cbuid.org

   Declaration of syntactic structure:

         The Namespace Specific Strings (NSS) of all URNs assigned by
         the schema described in this document will conform to the
         syntax defined in section 2.2 of RFC2141 [RFC2141].  The formal
         syntax of the NSS is defined by the following normative ABNF
         [RFC2234] rules for <cbuid-nss>:

   cbuid-nss   = type-spec ":" hash *1(":" type-specific-extension)
   type-spec   = "*" / media-type *parameter
   parameter   = ";" "mode" "=" 1*DIGIT / ";" token "=" token
   token       = 1*(ALPHA / DIGIT)
   hash        = hash-scheme ":" hash-value *("/" hash-value)
   hash-scheme = "md5" / "sha1" / "hash127" / hash-token
   hash-token  = token
   hash-value  = "*" / 1*HEXDIG

         The following are comments and restrictions not captured by the
         above grammar.

         A <media-type> is any MIME media type [RFC2046] which is
         registered in the appropriate IANA registry [IANA-MT].

         A <parameter> is a pair of a parameter name and a parameter
         value.  If the parameter name is "mode" then the parameter
         value is a string of digits.

         A <hash-scheme> is an alphanumeric token.  A <hash-value> may
         be "*" (unspecific) or it is a non-empty sequence of hexdigits
         whose value as a sequence of bits must be a valid hash for the
         specified hash-scheme.  If hash-scheme is "md5" or "hash127",
         then each specified hash-value consists of 32 HEXDIG.  If hash-
         scheme is "sha1", then each specified hash-value consists of 40
         HEXDIG.

         The "mode" parameter MUST specify the number of additional
         <hash-value>s.  Its value defaults to 0 if the parameter is
         omitted.  That is, mode=0 (or mode parameter omitted) means
         that there is one <hash-value>, for mode=1 there are two <hash-
         value>s, and so on.

         If the type specification is "*" (unspecific), then the hash
         part MUST specify exactly one <hash-value>, the "mode"
         parameter MUST be 0, and there MUST NOT be a <type-specific-



Thiemann                                               [Page 3]

Internet-Draft      Content-Based Unique Identifiers        30 June 2003


         extension>.  If a type is specified, then more than one <hash-
         value> MAY BE given, the "mode" parameter MUST specify the
         number of extra <hash-value>s, and a <type-specific-extension>
         MAY BE allowed.

         If an identifier contains only one <hash-value>, then this
         <hash-value> MUST NOT be "*" (unspecific).

         Examples:

           urn:cbuid:*:md5:5307d294b6ccd9854f2deed8c1628b72

           urn:cbuid:*:sha1:7660c8efbe7f656ce7612636c83a138c085bad3f

           urn:cbuid:message/rfc822:md5:5307d294b6ccd9854f2deed8c1628b72

           urn:cbuid:message/rfc822;mode=1:md5:*
                                      /d97a43ed7125019c363b00bd27411fa7

           urn:cbuid:message/rfc822;mode=1:md5
                                      :b260fb53d7ec3b530e5a6332763a2bfb
                                      /d97a43ed7125019c363b00bd27411fa7


         In the last two examples, the linebreak and whitespace before
         the ":" and the "/" are inserted for editorial reasons, they
         are not part of the URN.

   Relevant ancillary documentation:

         None as yet.

   Identifier uniqueness considerations:

         Each identifier contains at least one cryptographic hash value
         for the referenced resource.  The probability that two
         different resources have the same hash value depends on the
         hash function.  For the md5 hash where the hash value has 128
         bits, it is conjectured [RFC1321] that the probability of a
         collision is in the order of 1/2^64.  For the sha1 hash where
         the hash value has 160 bits, this probability should be even
         smaller.  In addition, if a procedure for creating collisions
         with one of these hash functions is found, then it is easy to
         extend the "cbuid" NSS with hash functions that provide smaller
         collision probabilities.

   Identifier persistence considerations:




Thiemann                                               [Page 4]

Internet-Draft      Content-Based Unique Identifiers        30 June 2003


         The binding between the identifier and the referenced resource
         is permanently established by the assignment algorithm that
         computes the identifier from the resource.

   Process of identifier assignment:

         Assignment is completely open, following the algorithm below.

         The inputs of the algorithm are
            - the name <hash-scheme> of a hash function
            - a media type for <type-spec>
            - a mode parameter (0 if omitted)
            - a resource (a sequence of octets)

         The algorithm applies the hash function to the resource,
         converts the resulting bit sequence into a sequence of
         hexdigits, and constructs the URN from the type-spec, the mode,
         the hash-scheme, and the sequence of hexdigits using the syntax
         described above.  Algorithms for the predefined hash functions
         are defined in the following references:

            md5      [RFC1321]
            sha1     [RFC3174]
            hash127  [HASH127]

         The conversion of a <hash-value> to a string proceeds as
         follows.  The bits in the <hash-value> are converted from most
         significant to least significant bit, four bits at a time to
         their ASCII presentation.  Each sequence of four bits is
         represented by its hexadecimal digit from "0123456789abcdef".
         That is, binary 0000 gets represented by the character '0',
         0001, by '1', and so on up to the representation of 1111 as
         'f'.

         For a resource of type "message/rfc822", the mode parameter MAY
         be 1.  In this case two hash values are requested.  The
         algorithm computes separate hash-values for the header and the
         body of the message and constructs the identifier by
         concatenating the <type-spec>, the mode parameter, the <hash-
         scheme>, and the two <hash-value>s (hash of header, heash of
         body) according to the syntax described above.  [Note: no
         unspecific hash-values are generated by the algorithm, they are
         only used in queries.]

         At present, mode MUST be 0 for any other type.

   Process of identifier resolution:




Thiemann                                               [Page 5]

Internet-Draft      Content-Based Unique Identifiers        30 June 2003


         The cbuid namespace is intended for communication between a
         group of clients and a repository (for example, an email client
         and an email store in which identifiers would replace the UID
         mechanism of IMAP [RFC2060]).  Since the repository keeps all
         available resources, it implements the mechanism outlined below
         for resolution.

         In the present stage, global resolution is not intended.
         However, this might become sensible in a later revision, after
         experience has been gained.

         A local resource repository maintains a mapping from <hash-
         scheme> and <hash-value> to resources.  From that information
         it evaluates identifiers as follows.

         If an identifier contains only one (specific) hash-value, then
         it refers to the entire resource determined by the repository's
         mapping.

         The mediatypes "application/octet-stream" and "message/rfc822"
         have fixed type-specific interpretations.  The mediatype
         "application/octet-stream" is equivalent to "*" (unspecific).
         It can only have one <hash-value> and refers to the entire
         resource determined by the repository's mapping.

         If the mediatype is "message/rfc822", then the mode parameter
         may be used to select an alternative resolution.  The permitted
         values for mode are 0 and 1.  Mode=0 selects the standard
         resolution as for "application/octet-stream".  If mode=1, then
         two <hash-value>s must be supplied.  The first <hash-value> is
         the result of applying the hash to the header of the message
         (including the CRLF terminating the last field but without the
         blank line separating header from body).  It MAY be unspecific.
         The second <hash-value> is the result of applying the hash
         function to the body of the message.  It MUST be specific.
         (See [RFC2822] for the precise definitions.)  If both hash-
         values are specific, then one specific message is referenced.
         If only the body hash is specific, then the body is returned
         with a minimal anonymized header where all fields that may
         contain personal information are blanked out.  More
         specifically, the Date field remains, the From, To, and Subject
         fields remain but are blanked out, the MIME header fields
         remain unchanged [RFC2045], but all further header fields are
         removed from the header.

         For the mediatype "message/rfc822", a <type-specific-extension>
         MAY be present for selecting parts of a message.  In this case,
         the syntax for <type-specific-extension> is given by the non-



Thiemann                                               [Page 6]

Internet-Draft      Content-Based Unique Identifiers        30 June 2003


         terminal <enc_section> from the specification of the IMAP URL
         scheme [RFC2192].

   For illustration, here is a review and interpretation of the examples
   given above.

           urn:cbuid:*:md5:5307d294b6ccd9854f2deed8c1628b72
           urn:cbuid:*:sha1:7660c8efbe7f656ce7612636c83a138c085bad3f

         Both are references to an untyped resource using the hash value
         of the entire resource.  Their mode parameter is (implicitly)
         0.

           urn:cbuid:message/rfc822:md5:5307d294b6ccd9854f2deed8c1628b72

         A reference to an entire email message.

           urn:cbuid:message/rfc822;mode=1:md5:*
                                      /d97a43ed7125019c363b00bd27411fa7

         A reference to an anonymized message.  Mode=1 indicates that
         there are two hash values present.  [Note: the linebreak and
         whitespace before the "/" is inserted for editorial reasons, it
         is not part of the URN.]

           urn:cbuid:message/rfc822;mode=1:md5
                                      :b260fb53d7ec3b530e5a6332763a2bfb
                                      /d97a43ed7125019c363b00bd27411fa7

         A reference to an entire email message by header and body
         hashes.  [Note: the linebreak and whitespace before the ":" and
         the "/" are inserted for editorial reasons, they are not part
         of the URN.]


   Rules for Lexical Equivalence:

         Lexical equivalence is identity after normalization.  An
         identifier in the cbuid URN namespace is nomalized by

            1. converting all characters to lower case
            2. eliminating the parameter "mode" if its value is 0
            3. eliminating all parameters which are not "mode"
            parameters.

   Conformance with URN Syntax:

         There are no additional characters reserved.



Thiemann                                               [Page 7]

Internet-Draft      Content-Based Unique Identifiers        30 June 2003


   Validation mechanism:

         Each identifier in the namespace MUST conform with the syntax
         specified above.


         Unknown parameters MUST be ignored.

   Scope:

         The namespace is global and public.

4. IANA Considerations

   This document includes a URN namespace registration that is to be
   entered into the IANA registry for URN NIDs.

5. Namespace Considerations

   Most URN namespaces are assigned to organizations and rely on a
   centralized registry to achieve uniqueness and persistency.  In
   contrast, the cbuid namespace is not tied to any organization.
   Assignment of identifiers can be performed and verified individually,
   while uniqueness is still preserved (with a probability close to 1).
   Moreover, the naming scheme is extensible with respect the addressing
   requirements of new mediatypes and with respect to improved hash
   functions.  The transition between hash functions does not mean that
   the "old" identifiers loose their meaning (cf. the discussion in BCP
   66 [RFC3406]).  However, each resolver implementation will prefer one
   hash scheme over the other and will resolve identifiers based on the
   preferred scheme more quickly.

   The resources managed by the cbuid namespace are immutable octet
   streams.  The only meaningful services are storage and retrieval, as
   provided by a repository. There is no update operation.

6. Community Considerations

   We expect to make public implementations of tools available that
   manage repositories of email messages and other immutable resources.
   Addressing of resources in these repositories will rely on cbuid
   URNs.

   The main impact that we expect from the use of cbuid URNs is sharing
   of resources among the users of a single repository.  Once a
   distribution scheme has been devised (perhaps also on the basis of
   the URNs), repositories may be replicated and the benefit of sharing
   may be exploited further.



Thiemann                                               [Page 8]

Internet-Draft      Content-Based Unique Identifiers        30 June 2003


7. Security Considerations

   As long as access to a repository is regulated by a suitable form of
   access control and transport encryption, there are no security
   considerations inherent in the use of cbuid URNs.  If unregulated
   access (e.g., without authentication) to a repository is allowed,
   then only anonymized accesses should be permitted (as defined for the
   message/rfc822 type) to protect the privacy of the registered clients
   of the repository.

Normative References

   [HASH127] Bernstein, D. J., "Floating-Point Arithmetic and Message
   Authentication", http://cr.yp.to/papers/hash127.ps, March 2000.

   [RFC1321] Rivest, R. L., "The MD5 Message-Digest Algorithm", RFC
   1321, April 1992.

   [RFC2045] Freed, N., and Borenstein, N., "Multipurpose Internet Mail
   Extensions (MIME) Part One: Format of Internet Message Bodies", RFC
   2045, November 1996.

   [RFC2046] Freed, N., and Borenstein, N., "Multipurpose Internet Mail
   Extensions (MIME) Part Two: Media Types", RFC 2046, November 1996.

   [RFC2119] Bradner, S., "Key Words for Use in RFCs to Indicate
   Requirement Levels", RFC 2119, March 1997.

   [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.

   [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.

   [RFC2234] Crocker, D., Editor, and P. Overell, "Augmented BNF for
   Syntax Specifications: ABNF", RFC 2234, November 1997.

   [RFC2822] Resnick, P., Editor, "Internet Message Format", RFC2822,
   April 2001.

   [RFC3174] Eastlake, E., and Jones, P., "US Secure Hash Algorithm 1
   (SHA1)", RFC 3174, September 2001.

Informational References

   [IANA-MT] IANA Registry of Media Types: ftp://ftp.isi.edu/in-
   notes/iana/assignments/media-types/

   [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW,"
   RFC 1630, June 1994.



Thiemann                                               [Page 9]

Internet-Draft      Content-Based Unique Identifiers        30 June 2003


   [RFC2060] Crispin, M., "Internet Message Access Protocol - Version
   4rev1", RFC 2060, December 1996.

   [RFC3406] Daigle, L., van Gulik, D.W., Iannella, R., and Faltstrom,
   P., "Uniform Resource Names (URN) Namespace Definition Mechanisms",
   RFC 3406, October 2002.

Contributors

         Stephanie Kollenz

         Matthias Neubauer

Author's Address

         Peter Thiemann
         Institut fuer Informatik
         Universitaet Freiburg
         Georges-Koehler-Allee 079
         D-79110 Freiburg
         Germany

         Phone: +49 761 203 8051
         EMail: thiemann@acm.org
         URL: http://www.informatik.uni-freiburg.de/~thiemann

Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   intellectual property or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; neither does it represent that it
   has made any effort to identify any such rights.  Information on the
   IETF's procedures with respect to rights in standards-track and
   standards-related documentation can be found in BCP-11.  Copies of
   claims of rights made available for publication and any assurances of
   licenses to be made available, or the result of an attempt made to
   obtain a general license or permission for the use of such
   proprietary rights by implementors or users of this specification can
   be obtained from the IETF Secretariat.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights which may cover technology that may be required to practice
   this standard.  Please address the information to the IETF Executive
   Director.




Thiemann                                              [Page 10]

Internet-Draft      Content-Based Unique Identifiers        30 June 2003


Full Copyright Statement

   Copyright (C) The Internet Society (2003).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assignees.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.



















Thiemann                                              [Page 11]