Internet-Draft J. C. Mallery draft-mallery-urn-pdi-00.txt M.I.T. Expires in six months November 10, 1997 Persistent Document Identifiers Filename: draft-mallery-urn-pdi-00.txt Status of This Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet- Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Abstract This document specifies the syntax and semantics of the Persistent Document Identifier (PDI) namespace within the URN framework defined by RFC 2141 [17]. PDIs provide a means to refer to digital objects and fragments that does not depend their storage location or the protocol used to access them. Since 1994, several large-scale applications with these requirements have used PDIs [12] [21]. PDIs are intended primarily as permanent identifiers for archival reference to long-lived documents. PDIs have a fragment syntax to allow permanent references to parts of documents (within specific formats) as well as a citation syntax to allow references to appearances of such fragments in composite documents. PDIs are most useful for any document series that is distributed via multiple protocols, is available from multiple sources, migrates to new locations, needs fragment references, or participates in distributed assertion semantics related to collaboration or access control. 1. Namespace Syntax 1.2 Design Goals Persistent Document Identifiers provide a means to refer to digital objects and fragments that does not depend their storage location or the protocol used to access them. PDIs offer the following capabilities: * Multisourcing: The same resource can be stored in different locations yet retrieved by a virtue of a shared identifier. * Multiple Protocols: Identifiers are not tied to specific transport protocols. * Persistence: PDIs persist across relocation of a digital object to different storage sites. The longevity of a PDI is not limited by lifetime of a directory, domain name, or even, a transport protocol. * Organizational Delegation: PDIs define a hierarchical encoding of the issuing authority that allows delegation in a manner analogous to names in the Domain Name System names but more akin to X.400. * Chronological Delegation: PDIs incorporate a time hierarchy that allows delegation of identifiers with different time ranges to different authorities or to different resolution regimes. * Fragment Syntax: PDIs offer an extensible syntax for referring to part of a resource. This evolutionary approach allows different schemes according to media type as well as multiple schemes per media type. Longevity of reference is sought by defining fragment schemes that are independent of machine representation. Referential consistency is guaranteed by monotonic commitment of versioned PDIs to immutable resource representations. * Citation Syntax: PDIs include a syntax for referring to appearances of document fragments as quoted in other composite documents. This makes fragment quotations first-class objects, about which assertions can be made. * User Friendly: PDIs carry a relatively simple syntax with some mnemonics so that, if need be, people can type them to access a resource. A guiding design principle for PDIs is to minimize the document semantics carried within the identifier. Most semantics is better encoded by assertions about PDIs. Not only is overloading of the identifier avoided, but assertions can also be modified without recourse to changing the identifier. 1. Namespace Syntax Consistent with the URN syntax specification in RFC 2141 [17], each namespace must specify syntax related information that is specific to that namespace. This section provides these specifications for the PDI namespace. The PDI grammar below uses the ABNF [6]. A URN using the Persistent Document Identifier namespace has the form: = "urn:" pdi ; Encoding in URN syntax 1.1. Namespace Identifier (NID) The Namespace Identifier for this namespace is "pdi", which is case insensitive. = "pdi" ":" nss ; Persistent Document Identifier 1.2. Namespace Specific String (NSS) The Namespace Specific String for this namespace is: = resource-identifier [(citation-specifier / fragment-specifier)] 1.2.1 Resource Identifier = "//" document-series "/" iso-date "/" specifier = component *["." component] "." iso-country = alpha-hyphen-digits = 2*alpha ; See ISO Standard 3166 [10] = year "/" month "/" day = 4*digit / wildcard = 2*digit / wildcard = 2*digit / wildcard = unique-id ["." format ["." version]] ;versions require formats = daily-serial-number / encapulated-unique-id / digits / wildcard = digits = unique-id-chars = alpha / digit / other / "%" hex hex = media-type-token / wildcard = "text" / "html" / extension-token = alpha-hyphen = digits / wildcard = "*" 1.2.2 Citation Specifier = "@" origin-position "=" pdi = position 1.2.3 Fragment Specifier = "#" [fragment-scheme "="] position [*("," position)] = "char" / "elt" / "name" / "rect" / "msec" / "sec" / "crop" / "byte" / ext-fragment-scheme = alpha-hyphen = char-position / element-position / element-name / 2-dim-coordinate / frame-number / time / byte-position / ext-position = position-specifier / "(" position-specifier *["," position-specifier] ")" = alphadigits 1.2.4 Supporting Definitions = %x41-5A / %x61-7A ; A-Z / a-z = alphas / digits = alpha-hyphen / digits = alpha / "-" = *alpha-hyphen = *ALPHA = %x30-39 ; 0-9 = *DIGIT = trans / "%" hex hex ;RFC 2141 = alpha / digit / other / reserved = digit / "A" / "B" / "C" / "D" / "E" / "F" / "a" / "b" / "c" / "d" / "e" / "f" = "(" / ")" / "-" / ":" / ";" / "$" / "_" / "!" / "'" = "%" / "." / "," / "/" / "#" / "*" / "@" / "=" / "?" / "+" 1.2.5 Reserved Characters are used as special characters in the PDI grammar. They MUST be encoded according to the character escaping method described in RFC 2141 [17]. 2 Discussion 2.1 Minting PDIs PDIs are issued by the authority named in . is intended to look like a domain name for easy parsing but there is no requirement to serve the name via the Domain Name System (DNS) nor to assure that the name is not assigned for other purposes by DNS. The encoded date in is the date when the identifier is minted. This date is based on Greenwich meantime. The encoded date bears no relationship to dates associated with the resource that the PDI denotes, even if there may be proximity between the time when the resource issues and the time when the PDI is minted. The PDI namespace is monotonic; PDIs cannot be retracted. If a new version of the same document issues, it MUST increment the version number for the previously issued PDI. This requirement assures that any machine representation (byte sequence) associated with formats of a versioned PDI never changes. Byte equivalence for all resource formats denoted by a specific PDI version ensures that digital signatures associated with a PDI check for any uncorrupted resource. More significantly, byte equivalence enables reliable, efficient fragment references for many media types. It eliminates the potentially difficult problem of rolling fragment references forward as a target resource is modified. 2.2 Issuing Authority The issuing authority controls the name in a document series. These names are hierarchical so that administration can be delegated within authority domains. Unlike domain names, the right most component of a MUST be a two digit ISO 3166 country code [10], indicating the country in which the issuing organization resides. In most cases, a SHOULD add a term to the issuing authority in order to differentiate the series from other document sets that the authority might issue. By specializing the document series below the issuing authority, identifiers reflect the chain of delegation. Additionally, it becomes easier to obsolesce an entire document series, if that becomes necessary. For wide use of PDIs, an issuing authority will need to issue toplevel authority names to organizations wishing to mint PDIs in their own document series. Once a toplevel document series name has been obtained, an organization may issue PDIs itself or delegate subseries. A subseries is delegated by adding a name component to the left of . The accretion of components on a document series MAY utilize existing organizational names or acronyms whenever feasible in order to preserve mnemonics in the document series name. Additionally, dropping components from the left SHOULD lead to ever more general issuing authorities in terms of organizational scope. Delegation SHOULD follow de jure organizational structure. Issuing authority SHOULD NEVER be delegated outside the organization unless the external agent is acting directly on behalf of the document series owner. When organizational boundaries are crossed, a new document series toplevel SHOULD be acquired. Within an organization, issuing authority SHOULD be delegated to the level where responsibility for content resides. This facilitates contact with document originators. More importantly, it reduces administrative scope, and thus, encourages more uniform document management policies for a particular document series. 2.3 Hierarchical Date of a PDI MUST be assigned when the identifier is minted. The calendar date MUST correspond to Greenwich Mean time. Inclusion of the ISO date conveys the time when the identifier was minted. Beyond making it easier to guarantee identifier uniqueness, hierarchicalization by date enables reference to ranges of identifiers issued within specific time intervals. Use of ISO dates also ensures that lexical sorts of identifiers produce a chronological ordering of PDIs, making various listings (e.g., directory lists) automatically appear in a meaningful order. Moreover, different administrative policies MAY be applied to any particular time interval. For example, when responsibility for resolving PDIs shifts to a different administrative authority, intervals covered by the new policy are readily specifiable and conveyed. For example, different intervals may be delegated to different URN resolvers and these delegations recorded with relevant URN discovery systems. Operations may be applied to identifiers within an interval. For example, a browser can provide a directory list of all the documents in a year, a month, or on a day. More generally, assertions can be made about identifiers within an interval, such as where to find a resolver. 2.4 Daily Unique ID An application may use a mnemonic name or a serial number as the . The only requirement is that MUST be a unique sequence of for and . If the unique ID is a , serial numbers SHOULD start from 1 and SHOULD be incremented by 1 as each new PDI is minted. When the calendar day is incremented at midnight GMT, the unique ID of the day SHOULD be reset to start at 1 on the new day. This prevents daily unique IDs from growing very large as it enforces date semantics on the identifier. 2.4.1 Encapsulation of Foreign Identifiers The specification of this field has been left open so that foreign document identifiers MAY be incorporated within a PDI as the daily unique ID. For our purposes, a foreign identifier is any identifier used by other naming or reference regimes. Examples of foreign identifiers include, serial numbers, invoice numbers, URIs, URLs or other application-specific identifiers. When encapsulating a foreign identifier, is required and MUST use a that identifies the media type of the resource and format of the encapsulated identifier. The media type token is required in order to allow unambiguous interpretation by applications aware of the identifier semantics. All other applications, MUST treat the unique id as opaque. 2.5 Format Format should use standard, controlled terms that indicate the media type [3] of the resource to which the identifier refers or, in the case of encapsulated identifiers, indicate the type of the encapsulated identifier. is case insensitive. The standards for MIME content types [10] do not as yet provide a single controlled term per media type that can be used as a file extension or here as a PDI format. Below we provide a rule for constructing the . These tokens are created from the registered media types [10] by using the if it is unique, or otherwise, concatenating the and . These tokens are case insensitive and MUST encode any reserved characters () for PDIs. = major-type "/" minor-type [* (";" parameter ["=" value])] = minor-type / (major-type "+" minor-type) = alpha-hyphen-digits = alpha-hyphen-digits There are two media types for which is not : Token Content Type text text/plain header message/header ;RFC 822 message headers is always required when: * A PDI is minted and assigned to a specific resource. * A foreign document ID is encapsulated in . * References to resource fragments are made. * A client requests a resource in a specific format. The format indicates how to interpret encapsulated identifiers and MUST be supplied whenever foreign document identifiers are encapsulated. For example, if an HTTP URL was encapsulated, the PDI might look like: pdi://oma.eop.gov.us/1994/10/20/http%3a%2f%2fwww%2ewhitehouse%2egov%2f.html.1 This PDI encapsulates the URL http://www.whitehouse.gov/ and denotes its content on October 20, 1994, when the site was unveiled. When a PDI contains fragment syntax, a format MUST be provided in order to convey the media type of the resource to which the fragment reference applies. A server may store any subset of formats for a resource. It may compute unstored formats on demand. A client can specify the desired format by using a PDI with the appropriate format field. If format is omitted, the identifier refers to the generic resource denoted by the PDI. Assertions about the generic resource apply to all the instantiations in the various media types indicated by the universe of format in which the resource is available. 2.6 Version The PDI is an optional component indicating a specific version of a resource. is a positive integer greater than 0. When is omitted, it defaults to version 1. Version numbers refer to the generic resource and not the specific format, but a resource cannot have a version without having at least one format. When a resource is changed in any format, version numbers for all formats MUST be updated. In general, when a resource changes significantly, applications SHOULD generate new PDIs. When changes are small or incremental, applications SHOULD increment the version. Any change in the byte count of a resource for a specific is a change and the version SHOULD be incremented. Addition of a new with the same semantics as an existing for the PDI is not a change and does not require the version to be incremented. Consequently, if an HTML document issues under pdi://oma.eop.gov/1997/09/01.html.1 ,and later, the HTML is converted to text, the PDI for the text version is pdi://oma.eop.gov/1997/09/01.text.1 However, if a spelling mistake is corrected later, whether or not it changes the byte count in any format, the version number is incremented. pdi://oma.eop.gov/1997/09/01.text.2 An editing application MAY write internal versions of a document in progress and only commit to the versioned PDI at a point when the editing completed and the document is ready for release. Version numbers MUST be included when: * PDIs are minted and associated with specific resources. * PDIs contain a fragment references. * PDIs contain a fragment citation. Inclusion of a in a fragment references ensures that the fragment reference is resolved against a consistent machine representation of the resource. 3 Fragment Syntax 3.1 Motivation The PDI namespace provides an extensible syntax for referring to parts of resources. Fragment syntax must be extensible because: * There are too many existing media types. * Some media types require highly technical fragment syntax, (e.g., multidimensional points, multiresolution channels). * New media types are coming into existence all the time. The approach adopted here is to allow additional RFCs to extend fragment syntax by adding fragment specifiers as they are needed. The availability of a syntax for referring to resource fragments raises the problem of referring to citations of fragments by composite resources. The PDI namespace provides a fragment citation syntax to address this issue. 3.2 Philosophy 3.2.1 Media Representations A fragment syntax SHOULD differentiate the media representation from the machine representation. If fragment schemes for a particular media type use a media representation, they can be retargeted at new or different machine representations. Otherwise, fragment schemes may become unresolvable in the future when machine representations change. Consequently, although a byte fragment specifier is provided below, it SHOULD be used only for short-term purposes when alternatives are unavailable. 3.2.2 Immediate Fragments URNs require a fragment syntax because the alternative of interning every fragment PDI in a URN namespace does not scale. It requires the resolver to store potentially all possible permutations of the fragment specifier for every resource. Immediate fragments require the fragment syntax to be part of the identifier. With immediate fragments, resolvers need only store those fragment PDIs for which there are assertions beyond the binding to the resource subset. Additionally, immediate fragments enhance privacy by not storing all references to resource subsets. They also conserve storage and reduce computation on resolvers. 3.2.3 Fragment Conjunctions The fragment syntax does not support conjunctions of fragments because this introduces a source of ambiguity when assertions are made about PDIs. Conjunctive fragments SHOULD be handled by creating a new PDI and asserting that it is the conjunction of some fragments. In this way, the set is explicitly represented and ambiguous references are excluded from the syntax. 3.2.4 Decoupling from Reference Mechanics Fragment reference could be accomplished by providing a program that given a resource return the specified part. This is not the approach advocated here. The fragment scheme MUST be a minimal set of parameters required for a program to extract the relevant part. Additionally, these parameters SHOULD be specified in the order of importance for extracting the referent. This increases the probability of finding a referent if an identifier is accidently truncated. In general, new fragment specifiers SHOULD minimize the syntax the of invariants and parameters they require. 3.3 Fragment Scheme The indicates the position syntax used in . A default position scheme should be defined for each Content Type token used in PDIs. For example, text/plain uses character positions as the default. The MAY be omitted when it is the default position scheme for the content type indicated by . In all other circumstances, MUST be supplied in order to ensure unambiguous interpretation of position specifiers. Position schemes are case insensitive. 3.4 Fragment Specifiers The following position reference schemes have been defined: 3.4.1 Text Fragment Specifier Text fragments are defined for the MIME Content Type text/*. Each text fragment is an interval bounded by two character positions in the resource. The fragment is the set of characters from upto but excluding . The first character position starts with 0. Character positions are relative to the canonical,CRLF encoded text for the resource. Therefore, all text/* resources MUST be CRLF encoded to ensure correct fragment references. The PDI for text/plain is "text" and is the default position specifier for the media type. = "#" ["char" "="] start-char "," end-char = digits = digits Although wide-spread encodings for many alphabets use a single 8 bit byte (e.g., ISO-8859 [15]), other encodings (e.g., unicode) employ multi-byte encodings. Consequently, a server MUST be aware of the character set used to encode a text resource. For 8 bit character sets, char fragment resolution reduces to byte position. However, multi-byte character sets require the server to perform appropriate translation from the stored data representation. The following PDI refers to the text starting at character 37 and continuing upto but excluding character 51. pdi://oma.eop.gov.us/1997/09/01/1.text.1#char=37,51 Since the default fragment specifier for text is , the following PDI is equivalent: pdi://oma.eop.gov.us/1997/09/01/1.text.1#37,51 When a text/plain content type uses a multi-byte character set, MUST be the character set token as defined by the IANA Character Set Registry [18]. 3.4.2 HTML Fragment Specifier Fragments may be specified for the MIME Content Type text/html using character fragment specifiers. The PDI for text/html is "html". The default position specifier for text/html is "char" because it simplifies serving fragments. Although character references are simple and effective for HTML document fragments, it is often more convenient to use HTML elements to delimit an interval within a document. Specific HTML elements can be identified using the name parameter value or the position of the tag in the document. In either case, the fragment consists of all text and HTML tags from to and including . References to HTML containers is facilitated by use of a closed interval, but it can be awkward for tags that are not explicitly closed, especially if they are implicitly closed (e.g.,

). Tag positions are counted from the start of the resource, with the first being assigned 0. An refers to the first element whose name parameter value is equal to , which must be encoded according to URN syntax [17], but decoded for case-sensitive equality testing. = "#" start-element "," end-element = char-fragment-scheme / element-fragment-scheme / named-fragment-scheme = "elt" = element-position / element-name = element-position / element-name = digits = "name" = urn-chars Char, elt, and name position references MUST use the same position scheme for and an HTML fragment reference. HTML fragments may depend on surrounding context that is not part of the fragment. HTML rendition without this containing context may produce different effects or incorrect HTML. Responsibility for assuring legal and felicitous HTML must reside with the user or application creating the fragment reference because document authors cannot be expected to anticipate all possible citations. Therefore, the user or application creating the fragment citation MUST NOT create illegal HTML fragments. When fragments require context, the user or application MAY create an intermediate document that uses fragment references to extract both the relevant context and the target fragment. This intermediate document SHOULD be legal HTML capable of standing alone. 3.4.2 SGML & XML Fragment Specifier The element and char fragment schemes can be applied to the more general Standard Generalized Markup Language (SGML) [14] and Extensible Markup Language XML [4] mark up languages, of which it is a subset. The BNF below give the fragment specification for SGML and any subsets, such as XML. = "#" sgml-start-element "," sgml-end-element = element-fragment-scheme / char-fragment-scheme = element-position = element-position The default fragment specifier for SGML and SGML subsets is "char". The following content tokens are defined: text/sgml sgml text/xml xml The context caveats for HTML fragments should be extended pari pasu to SGML and XML fragments. 3.4.5 Image Fragment Specifier Image media types use a variety of encoding schemes and some include multiple frames. Fragment reference for image/* uses a two dimensional cartesian coordinate system with the origin (0, 0) being in the upper left hand corner. The scale of the coordinate system is the pixel level scale of the containing image. References to subrectangles are made by specifying for the image fragment the as the upper left most point and as the lower right most point. These x and y coordinates are in coordinate system of the containing image. When multiple frames are present in an image, the reference frame is specified by providing , which is 0 based and defaults to 0 when omitted. is the default fragment specifier for the media types image/*. = # ["rect" "="] start-coordinate "," end-coordinate ["," frame] = digits = 2-dim-coordinate = 2-dim-coordinate <2-DIM-COORDINATE> = "(" x-coordinate "," y-coordinate ")" = digits = digits The example below refers to an image fragment whose origin is x=5, y=10 and extends to x=25, y=30. This yields the maximal rectangle including the coordinates (5,10), (24,10), (24,29), (5,29). Note that the zero-based coordinate system does not include the point denoted by the . pdi://images.satellite.nasa.gov.us/1997/09/30/1234.gif#(5,10),(25,30) Since frame is unspecified, it defaults to zero and this PDI is equivalent to pdi://images.satellite.nasa.gov.us/1997/09/30/1234.gif#(5,10),(25,30),0 The next PDI refers to the third frame in an animated GIF. As it simplifies array references, the zero-based index shifts references to the left by 1. pdi://images.satellite.nasa.gov.us/1997/09/30/1234.gif#(5,10),(25,30),2 3.4.4 Audio Fragment Specifier Audio media types use various encoding schemes (including variable quality) that make byte ranges problematic for fragment references. Start and end times provide a coordinate scheme that can be resolved for any audio media type. A fragment reference includes data from and including upto and excluding . The position scheme for the temporal reference gives the time units. Two time position schemes are defined. "msec" is millesconds and "sec" is seconds. Temporal position schemes MUST NOT be intermixed. The default time position scheme for audio/* is "sec". is the default fragment specifier for audio/*. = # time-position-scheme "=" start-time "," end-time = "msec" / "sec" / ext-time-position-scheme = time = time