Internet-Draft: draft-kunze-ark-00.txt J. Kunze ARK Identifier Scheme 22 February 2001 Expires 22 August 2001 The ARK Persistent Identifier Scheme (http://www.ietf.org/internet-drafts/draft-kunze-ark-00.txt) Status of this Document This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Distribution of this document is unlimited. Please send comments to jak@ckm.ucsf.edu. Copyright (C) The Internet Society (2001). All Rights Reserved. 1. Abstract This document introduces the ARK (Archival Resource Key) scheme for persistent naming, describes the ARK syntax, and shows how ARK services work. These services are defined, for now, over the HTTP protocol, given its current reign in networked information retrieval. The first step is to make an ARK actionable by giving it a semantically inert (for identity comparison) hostport combination as a prefix. This requires discovering a Name Mapping Authority (an ARK service provider) based on the Name Assigning Authority that created the ARK. Two methods for discovery are proposed: one file based, the other based on the DNS NAPTR record. The resulting string is then submitted to a web browser as the basis for the three core ARK services underpinning credible claims of persistence: access to objects, to their descriptions, and to the commitments made regarding their preservation. Different services are activated by a set of conventions for appending a `?' followed by an ARK "request". Thus, J. Kunze 1. Abstract [Page 1] Internet Draft ARK Identifier Scheme February 2001 upon release an ARK identifies not one, but three information units (data, metadata, and policy), each unit accessible under the one identifier. 2. Introduction This document describes a scheme for the high-quality naming of information resources. The scheme, called the Archival Resource Key (ARK), is ideally suited to long-term access and identification for any information resources that accommodate reasonably regular electronic description. This includes digital documents, databases, and software, as well as physical objects (such as books, bones, and bottles) and abstract objects (chemicals, diseases, vocabulary terms, performances). Hereafter the term "object" refers to an information resource. The term ARK itself refers both to the scheme and to any single identifier that conforms to it. Schemes for persistent identification of network-accessible objects are not new. In the early 1990's, the design of the Uniform Resource Name [URN], responding to the failure rate of URLs in practice, recognized the promise of indirect, non-hostname-based naming and the need for responsible name management. Meanwhile, promoters of the Digital Object Identifier [DOI] succeeded in building a community of providers around a mature software system that supports name management. The Persistent Uniform Resource Locator [PURL] was a third scheme that had the unique advantage of working with unmodified web browsers. The ARK scheme is a new approach. A founding principle of the ARK is that persistence is purely a matter of service. Persistence is neither inherent in an object nor conferred on it by a particular naming syntax. Rather, persistence is achieved through a provider's successful stewardship of objects and their identifiers. The highest level of persistence will be girded by a provider's efforts to develop contingency, redundancy, and succession strategies. It is further safeguarded to the extent that a provider's mission is shielded from marketplace and political instability. 2.1. Three Reasons to Use ARKs The first requirement of an ARK is to give users a link from an object to a promise of stewardship for it, as made by an identified service provider. No one can tell if successful stewardship will take place because no one can predict the future. Reasonable conjecture, however, may be based on past performance. There must be a way to tie a promise of persistence to a provider's demonstrated or perceived ability -- its reputation -- in that arena. Provider reputations would then rise and fall as promises are observed variously to be kept and broken. This is perhaps the best way we have for gauging the strength of any persistence promise. The second requirement of an ARK is to give users a link from an J. Kunze 2.1. ARK Reasons [Page 2] Internet Draft ARK Identifier Scheme February 2001 object to a description of it; possession of an identifier without a description does not constitute positive identification. Identifiers common today are relatively opaque, though some may contain ad hoc clues pertaining to fleeting lifecyle events, such as entry into a filesystem hierarchy. Possession of both an identifier and an object is some improvement, but positive identification may still be elusive since the object itself need not include a matching identifier or be transparent enough to reveal its identity without significant scrutiny and background research. In either case, what is called for is a record bearing witness to the identifier's association with the object, as supported by a recorded set of object characteristics. This descriptive record can act as a kind of information "receipt" that allows users and archivists briefly to inspect a retrieved object for plausible match with some recorded characteristics, for example, its title and size. (MD5 checksums as identifiers permit an easy computation to verify its association with an object, provided the object remains bit for bit unchanged over time.) The final requirement of an ARK is to give users a link to the object itself (or to a copy) if at all possible. Persistent access is the central duty of an ARK, with persistent identification playing a vital but supporting role. Object access may not be feasible for various reasons, such as catastrophic loss of the object, or licensing agreements that keep an archive "dark" for a period of years. But attempts to simplify the persistence problem by decoupling access from identification and concentrating first on the latter are of questionable utility. A perfect system for assigning forever unique identifiers might be created, but if it did so without reducing access failure rates, no one would be interested. The central issue -- which may be summed up as the "HTTP 404 Not Found" problem -- would not have been addressed. 2.2. Organizing Support for ARKs Co-location of persistent access and identification services is natural. Any organization undertaking persistent identification and description is in an advantaged position to undertake persistent access, and vice versa. The former task becomes all the easier if the organization controls, owns, or otherwise has clear access to the objects. Similarly, the latter task (persistent access) cannot be managed without at least internal support for the former task. Thus, organizing ARK services under one roof tends to make sense. ARK support is not for everybody, as the bar for participation is high. By requiring specific, revealed commitments to preservation, object access, and description, ARK services are not cheap. On the other hand, it would be hard to grant credence to a persistence promise from an organization that could not muster the minimum ARK services. Still, there is a business model for building an ARK-like, description-only service on top of another organization's full complement of ARK services; for example, there might be competition at the description level for abstracting and indexing scientific literature archived in an open pre-print repository. Such a business J. Kunze 2.2. Support for ARKs [Page 3] Internet Draft ARK Identifier Scheme February 2001 model, however, would benefit from persistence without directly supporting it. 2.3. A New Definition for Identifier Talking about persistent identifiers can be very confusing because the term "identifier" has not been carefully defined. The best working definition thus far comes as a side effect of defining the Uniform Resource Identifier in [RFC2396]. Identifier (RFC2396): a sequence of characters with a restricted syntax, that can act as a reference to something that has identity The term works for that document, but things break down when it is employed for persistence. Troubling phrases arise, such as "we want an identifier that does not break..." An identifier seems to be a sequence of characters, yet breakage isn't really about them. If it is the reference that breaks, fingers point to a wildly diverse group of suspects: browser manufacturers, the maintainer of the page where the link was found, the syntax of the link itself, the DNS administrator, the firewall, the educational system, etc. The following new definition is proposed. An identifier is an _association_ between a string (sequence of characters) and an information resource. That association is made manifest by a record (e.g., a cataloging record) that binds the identifier string to a set of identifying resource characteristics. The identifier (association) is now vouched for by a metadata record, and that association is made public in so far as the record is shared; for example, an internal identifier from an organization has limited value to outsiders since the nature of the association has not been disclosed. Without evidence of that association, a sequence of characters may not be recognizable as an identifier. The metadata record can act as a kind of association "receipt" or "declaration". Best of all, it now makes sense to speak of an identifier (an association) that "breaks". 3. ARK Anatomy J. Kunze 3. ARK Anatomy [Page 4] Internet Draft ARK Identifier Scheme February 2001 An ARK is represented by a sequence of characters (a string) that always begins with the prefix "ark:". Here is a diagrammed example. ark:foobar.zaf.org/12025/654xz321 \__/\____________/\_____/\______/ | (optional) | | ARK Prefix | | Name (assigned by the NAA) | | Name Mapping Authority Name Assigning Authority Hostport (NMAH) Number (NAAN) 3.1. The Name Mapping Authority Hostport (NMAH) After the prefix may appear an optional Name Mapping Authority Hostport (NMAH) that is a temporary address where ARK service requests may be sent. It consists of an internet hostname or host- port combination having the same format and semantics as the host- port part of a URL. The most important thing about the NMAH is that it is semantically inert from the point of view of object identification. In other words, ARKs that differ only in the optional NMAH part identify the same object. Thus, for example, the following ARKs are equivalent: ark:foobar.zaf.org/12025/654xz321 ark:sneezy.dopey.com/12025/654xz321 ark:/12025/654xz321 The NMAH makes it easy to derive an identifier that is actionable in today's web browsers (i.e., a URL). This amounts to replacing the "ark:" prefix with "http://". The first example ARK above thus becomes http://foobar.zaf.org/12025/654xz321 The NMAH part is temporary, disposable, and replaceable. Over time the NMAH will likely stop working and have to be replaced with a currently active service provider. This relies on one of the NMAH discovery processes outlined in a later section. Meanwhile, a carefully chosen NMAH is as valid as any internet domain name, and may last for a decade or longer. Users should be prepared, however, to replace the NMAH because the one found in an ARK may have stopped working at any time. The above method for creating an actionable identifier from an ARK (replacing "ark:" with "http://") is also temporary. Assuming that the reign of HTTP in information retrieval will end, one day ARKs will have to be converted into new kinds of actionable identifiers. If ARKs see widespread use, web browsers would presumably evolve to perform this simple transformation automatically. J. Kunze 3.1. ARK Hostport Part [Page 5] Internet Draft ARK Identifier Scheme February 2001 3.2. The Name Assigning Authority Number (NAAN) The next part of the ARK is the Name Assigning Authority Number (NAAN) enclosed in `/' (slash) characters. This part is always required, as it identifies the organization that originally assigned the Name of the object. It is used to discover a currently valid NMAH and it also provides top-level partitioning of the space of all ARKs. NAANs are registered in a manner similar to URN NIDs, but they are pure numbers consisting of 5 digits or 9 digits. The first 100,000 NAAs fit compactly into the 5 digits, and if growth warrants, the next billion fit into the 9 digit form. In either case the fixed odd number of digits leaves clues to reduce confusing them with, say, 4-digit dates. 3.3. The Name Part The final part of the ARK is the Name assigned by the NAA, and is also required. The Name is a string of printable ASCII characters less than 128 bytes in length, but the characters `/', `.', and `?' are reserved and must not be used. The length restriction leaves room to append ARK requests to an ARK and transport them within HTTP GET requests. The creation of names that include linguistically based constructs is strongly discouraged. Such names do not age or travel well, and names that look more or less like numbers avoid common problems that defeat persistence and international acceptance. The use of digits is highly recommended, mixed in with non-vowel alphabetic characters if compact names are desired. Hyphens are always ignored in ARKs. Hyphens may be added to an ARK for readability, or during the formatting and wrapping of text lines, but they must be treated as if they were not present. Like the NMAH, hyphens are semantically inert in comparing ARKs for equivalence. Thus, for example, the following ARKs are equivalent: ark:/12025/65-4-xz-321 ark:sneezy.dopey.com/12025/654--xz32-1 ark:/12025/654xz321 3.4. Naming Considerations Names must be chosen with great care. Poorly chosen and managed names will devastate any persistence strategy, and they do not discriminate based on naming scheme. Whether a mistakenly re- assigned identifier is a URN, DOI, PURL, URL, or ARK, the damage -- failed access -- is not mitigated more in one scheme than in another. Conversely, properly managed names will go much further towards safeguarding persistence than any choice of naming scheme or its underlying protocols. Similarly, hostnames appearing in any identifier meant to be persistent must be chosen with extreme care. While this certainly J. Kunze 3.4. ARK Naming Considerations [Page 6] Internet Draft ARK Identifier Scheme February 2001 affects names that are based on URLs and PURLs, it should be of concern in selecting names even for explicitly disposable entities such as the NMAHs that serve as temporary "booster rockets" to help make ARKs actionable. There is no excuse for a provider that manages its internal names impeccably not to apply the same skill to choosing an exceptionally durable internet name (e.g., a generic name in the ".org" domain), especially if it would form the prefix for all the provider's URL-based external names. Dubious persistence speculation should be avoided. For example, there are really no obvious reasons why the organizations registering DNS names, URN NIDs, and DOI publisher IDs should have among them one that is intrinsically more fallible than the next. Moreover, it is a misconception that the demise of DNS and of HTTP need adversely affect the persistence of URLs. Certainly URLs from the present day would not then be actionable by our present-day mechanisms, but there is no more stable a namespace than one that is dead and frozen, and resolution systems for future non-actionable URLs are no harder to imagine than for currently non-actionable URNs and DOIs. The important point, however, is that just because hostnames have been carelessly used in their brief history does not mean that they are unsuitable in NMAHs (and URLs) intended for use in situations that demand the highest levels of persistence. A well-considered naming strategy is everything. 4. Assigners of ARKs A Name Assigning Authority (NAA) is an organization that creates (or delegates creation of) long-term associations between identifiers and information objects. Examples of NAAs include national libraries, national archives, and publishers. An NAA may arrange with an external organization for identifier assignment. The US Library of Congress, for example, allows OCLC (the Online Computer Library Center, a major world cataloger of books) to create associations between Library of Congress call numbers (LCCNs) and the books that OCLC processes. An NAA does not so much create an identifier as create an association. The NAA first draws an identifier from its namespace, which is the set of all identifiers under its control. It then records the assignment of the identifier to an information object having sundry witnessed characteristics, such as a particular author and modification date. A namespace is usually reserved for an NAA by agreement with recognized community organizations (such as IANA and ISO) that all names beginning with a particular string be under its control. In the ARK an NAA is represented by the Name Assigning Authority Number (NAAN). The ARK namespace reserved for an NAA is the set of names bearing its particular NAAN. For example, all strings beginning with "ark:/12025/" are under control of the NAA registered under 12025, which might be the US National Library of Medicine. Each NAA is free J. Kunze 4. ARK Creators [Page 7] Internet Draft ARK Identifier Scheme February 2001 to assign names from its namespace (or delegate assignment) according to its own policies. These policies must be documented in a manner similar to the declarations required for URN NID registration. [ The details of ARK NAA registration have not been worked out. The next section has what might pass for informal registration. ] 5. Finding a Name Mapping Authority In order to derive an actionable identifier from an ARK, a host-port (hostname or hostname-port combination) for a working Name Mapping Authority (NMA) must be found. An NMA is a service that is able to respond to the three basic ARK service requests. NMAs make known which NAAs' identifiers they are willing to service. Upon encountering an ARK, a user (or client software) looks inside it for the optional NMAH part. If it contains an NMAH, and if the user trusts it, and if the service works, the NMAH discovery step may be skipped. Otherwise, the client looks inside the ARK again for the NAAN (Name Assigning Authority Number). Then, using a global database, it uses the NAAN to look up all current NMAHs that service ARKs issued by the identified NAA. In the interests of long-term persistence, ARK mechanisms are defined in high-level, protocol-independent terms so that mechanisms may evolve and be replaced over time without compromising fundamental service objectives. Such is the case then for mapping authority discovery and the two specific NMAH lookup methods outlined below. Either or both methods may eventually be supplanted by better methods since, by design, the ARK scheme does not depend on a particular method, but only on having some method to locate an active NMAH. At the time of issuance, at least one NMAH for an ARK should be prepared to service it. That NMA may or may not be administered by the Name Assigning Authority (NAA) that created it. Consider the following hypothetical example of providing long-term access to a cancer research journal. The publisher needs to recover costs and the National Library of Medicine needs to preserve the scholarly record. By agreement the publisher would act as the NAA and the national library would archive the journal issue when it appears, but without providing direct access for the first six months. During the first six months of peak commercial viability, the publisher would retain exclusive delivery rights and would charge access fees. Again, by agreement, both the library and the publisher would act as NMAs, but during that initial period the library would redirect requests for issues less than six months old to the publisher. At the end of the waiting period, the library would then begin servicing requests for issues older than six months by tapping directly into its own collection. Meanwhile, the publisher might redirect incoming requests for old back issues to the library. Long-term access is thereby preserved, and so is the commercial incentive to publish content. J. Kunze 5. NMAH Discovery [Page 8] Internet Draft ARK Identifier Scheme February 2001 There is never a requirement that an NAA also run an NMA service, although it seems not an unlikely scenario. Over time NAAs and NMAs will come and go. One NMA will succeed another, and there may be many NMAs serving the same ARKs simultaneously. These may be mirrored NMAs, competing NMAs, or asymmetric but coordinated NMAs, as in the library-publisher example above. 5.1. Looking Up NMAHs in a Globally Accessible File This section describes a way to look up NMAHs using a simple text file. For efficient access the file may be stored in a local filesystem, but it should be reloaded periodically to incorporate updates. It is not expected that the size of the file or frequency of update should impose an undue maintenance burden any time soon. It is modeled on the /etc/hosts file mechanism that supported internet host address lookup for several years before the advent of the Domain Name System (DNS). A copy of the current file (at the time of writing) appears in an appendix and is available on the web. It looks like this, with comment lines (lines that begin with `#') explaining the format. # # Name Assigning Authority / Name Mapping Authority Lookup Table # Last change: 22 February 2001 # Reload from: http://xxx.xxx.xxx/etc/naa2nma.txt # # Each NAA appears at the beginning of a line with the NAA Number # first, then a colon and the name of the NAA organization # # All the NMA Host-ports that service an NAA are each listed, # one per line, indented, after the corresponding NAA line. # 121: US Library of Congress foobar.zaf.org sneezy.dopey.com # 122: US National Library of Medicine lhc.nlm.nih.gov:8080 foobar.zaf.org sneezy.dopey.com # 123: US National Agriculture Library foobar.zaf.gov:80 # # The enclosed Perl script takes an NAA as argument and prints # the NMAs in this file listed under the first matching NAA. # # my $naa = shift; # while (<>) { # next if (! /^$naa:/); # while (<> && /^) { print "$10 if (/^; } # } # end of file J. Kunze 5.1. NMAH Discovery [Page 9] Internet Draft ARK Identifier Scheme February 2001 5.2. Looking up NMAHs Distributed via DNS NAPTR Records This section sketches a way to look up NMAHs using a method very similar to the method, described in [RFC2168], for locating URN resolvers. It relies on querying the DNS system already installed in the background infrastructure of most networked computers. A query is submitted asking for a list of resolvers that match a given NAAN. The query is distributed via normal DNS channels to the particular DNS servers that can best provide the answer, unless the answer is already in a local DNS cache as a side-effect of a recent query. Responses come back inside Name Authority Pointer (NAPTR) records. The result is zero or more candidate NMAHs. [ Details to follow in a revised draft. ] 6. Generic ARK Service Definition Again, ARK services must be couched in protocol-independent terms if persistence is to outlive today's infrastructural assumptions. The high-level ARK service definitions listed below are followed in the next section by a concrete method (one of many possible methods) for delivering these services with today's technology. An ARK request's output is delivered information, such as the object itself, a policy declaration (e.g., a promise of support), a descriptive metadata record, or an error message. 6.1. General ARK Access Service (access, location) Returns (a copy of) the object or a redirect to the same. A sensible object surrogate may be substituted; for example, a table of contents for a large complex object, a home page for a web site hierarchy, or a rights clearance challenge for protected data. May also return a discriminated list of alternate object locators. If access is denied, returns an explanation of the object's current (perhaps permanent) inaccessibility. 6.2. General Policy Service (permanence, naming, addressing) Returns declarations of policy and support commitments for given ARKs. Declarations are returned in either a structured metadata format or a human readable text format (one format may serve both purposes). Policy areas may be returned separately, but covered areas should including object permanence, object naming, object fragment addressing, and operational service support. The permanence declaration for an object is a rating defined with respect to the identified permanence guarantor, an includes the following aspects. (a) "object availability" -- whether and how access to the object is supported (e.g., online 24x7, or offline only), J. Kunze 6.2. Generic Policy [Page 10] Internet Draft ARK Identifier Scheme February 2001 (b) "identifier validity" -- under what conditions the ARK will be or has been re-assigned, (c) "content invariance" -- under what conditions the content of the object is subject to change, and (d) "link stability" -- whether and how the hypertext links within and among parts of the object are subject to change. Naming policy for an object includes an historical description of the NAA's (and its successor NAA's) policies regarding differentiation of objects. It includes the following aspects. (a) "similarity" -- the set of criteria, defined by the NAA, at which point two similar objects become dissimilar enough to warrant separate ARKs, and (b) "granularity" -- the limit, defined by the NAA, to the level of object subdivision below which sub-objects do not warrant separately assigned ARKs. Addressing policy for an object includes a description of how, during access, object components (e.g., paragraphs, sections) or views (e.g., image conversions) may or may not be "addressed", in other words, how the NMA permits arguments (or parameters) to modify the object delivered as the result of an ARK request. These sorts of operations, if allowed, would support things like byte-ranged fragment delivery and open-ended format conversion too numerous to list, let alone to identify with many separately assigned ARKs. Support policy includes a description of general operational aspects of the NMA service, such as after hours staffing and trouble reporting procedures. 6.3. General Description Service Returns a description of the object. Descriptions are returned in either a structured metadata format or a human readable text format (one format may serve both purposes). A description must at a minimum answer the who, what, when, and where questions concerning an expression of the object. A description must always be accompanied by the identity of the description's source and the its modification date. May also return discriminated lists of ARKs that are related to the given ARK. 7. Overview of the HTTP Key Mapping Protocol (HKMP) The HTTP Key Mapping Protocol (HKMP) is a way of taking a key (an identifier) and asking such questions as what information does this identify and how permanent is it? HKMP is in fact one specific method for delivering ARK services. It runs over HTTP in order to exploits HTTP's simplicity and the pre-eminence of the web browser J. Kunze 7. HKMP Overview [Page 11] Internet Draft ARK Identifier Scheme February 2001 user interfaces that rely on it. The method is designed so that an asker (a person or client program) can enter ARK commands directly into the location field of current browsers. The asker starts with an identifier, such as an ARK (or a URL). The identifier reveals to the asker (or should allow the asker to infer) the internet host name and port number of a server system that responds to questions. This is just the NMAH that is obtained by inspection, and possibly lookup based on the ARK's NAAN. The asker then sets up an HTTP session with the server system, sends a question via an HKMP request (contained within an HTTP request), receives an answer via an HKMP response (contained within an HTTP response), and closes the session. That concludes the connected portion of the protocol. An HKMP request is a string of characters that is appended to the identifier string after first adding a `?' (question mark). The resulting string is sent as an argument to HTTP's GET command. Request strings too long for GET may be sent using HTTP's POST command. An HKMP response is contained in the headers and message body of the HTTP response. The headers convey a few parameters relevant to the HKMP process and the HTTP message body consists of a set of one or more records. Records themselves tend to look very similar in format to HTTP response headers (also very similar to email headers), with records separated by empty lines. Precise record formatting rules and semantics are defined as for Electronic Resource Citations [ERC]. [ Although ERCs are still work in progress, a sense of how they work may be obtained from the examples below. ] Here's an example using an HTML document associated with a key given by the hypothetical URL, http://foo.bar.org:8080/12025/543cb The asker prepares to request that a copy of the information resource associated with the key be returned by appending to it the HKMP request, ?get(it) The entire client-server session looks as follows, and can easily be conducted by keyboard using an ordinary [TELNET] program. C: [opens session] C: GET http://foo.bar.org:8080/12025/543cb?get(it) HTTP/1.1 C: S: HTTP/1.1 200 OK S: Content-Type: text/html S: HKMP-Status: 200 OK S: S: J. Kunze 7. HKMP Overview [Page 12] Internet Draft ARK Identifier Scheme February 2001 S: ... (text of document) ... S: S: [closes session] The first and last lines correspond to the client's steps to start a [TCP] session with the server (e.g., starting up a TELNET program directed at the server's host and port) and the server's steps to end that session, respectively. Although they are needed for each session, to keep things simple future examples will omit them. The first two server response lines above are typical of HTTP. The next line is peculiar to HKMP, and indicates a normal return status. The remainder of the response is the stream of HTML that comprises the returned document. The particular request serviced in this example is so common that it is taken as the default request in the event that the asker appends nothing at all to the identifier string. Thus naked identifiers carried over HKMP end up requesting straightforward object access. In asking for a citation for the information associated with a key, the following session might be conducted. Here we assume that the client initiates the session and the server terminates it. The appended HKMP request is simply "?get(cite)" and the returned data is a single ERC record. Because this request is so common, the asker can simply append the HKMP request "?" to get equivalent behavior. C: GET http://foo.bar.org:8080/12025/543cb?get(cite) HTTP/1.1 C: S: HTTP/1.1 200 OK S: Content-Type: text/plain S: HKMP-Status: 200 OK S: S: erc: S: who: Dr. Seuss S: what: Green Eggs and Ham S: when: 1962 S: where: http://nma.seussfans.org/geah.txt S: erc-about: S: what: poultry products | pork products S: | fear of the unknown S: erc-from: S: who: NLM S: what: ui12345678 S: when: 2000 11 09 S: where: http://nma.nih.gov/ui12345678?%{ S: db = books S: & param1 = 259 S: & xyz = tt25_90 S: & type = basic_cite + topics S: %} Several ERC features are worth noting. If there were more than one record, each would be separated from the next by an empty line. Each J. Kunze 7. HKMP Overview [Page 13] Internet Draft ARK Identifier Scheme February 2001 ERC itself may be organized in different segments according to the different stories each segment tells. In the above example, the first story concerns an expression of an object that happens to be a children's book. The second story (containing but one element) concerns what the object is about. The third and final story concerns the provenance (or origin) of the ERC record itself. The ERC also has an appealing set of uniform lexical features. Long elements may be continued on subsequent indented lines and there is a standard way to indicate subelement boundaries. Especially long, dense strings (e.g., some URLs) can be leavened with whitespace (newlines, spaces, tabs) through an encoding trick that makes them easier for human being to read and edit; of course, decoding restores the string to its original form. The date format accommodates lists and ranges, and any element may use a special marker syntax to indicate that its data value has been inverted to create a sequence of bytes that forms a better sorting point; the marker syntax can be used to indicate how to recover the natural sequence of bytes. As a final example, here is a session in which an asker requests a permanence declaration associated with a key. The returned data is in the form of an ERC because the HKMP request is "?get(permanence)". If the request were instead "?show(permanence)", the server would be instructed to return a declaration formatted for human display; one server might still return an ERC, but another might return an HTTP Content-Type of text/html followed by an HTML document that describes the policy declaration in natural language. C: GET http://a.b.org/92284/6xz21?get(permanence) HTTP/1.1 C: S: HTTP/1.1 200 OK S: Content-Type: text/plain S: HKMP-Status: 200 OK S: S: erc-permanence: S: who: LC S: what: (:psc) Permanent, Stable Content S: when: 1997 12 04 S: identifier_validity: Permanent S: content_invariance: Subject to Correction S: resource_availability: Always Available S: erc-from: S: who: NLM S: what: ui9876543 S: when: 1999 03 24 S: where: http://nma.nih.gov/ui12345678?pstmt 8. Advice for Web Clients This section offers some advice to web clients. It is hard to write about because it tries to anticipate a series of events that may (or may not) lead to native web browser support for ARKs. J. Kunze 8. ARKs in Web Clients [Page 14] Internet Draft ARK Identifier Scheme February 2001 ARKs are envisaged to appear wherever durable object references are planned. Library cataloging records, literature citations, and bibliographies are important examples. In many of these places URLs (Uniform Resource Locators) currently stand in, and URNs, DOIs, and PURLs have been proposed as alternatives. The strings representing ARKs are thus also envisaged to appear in some of the places where URLs appear: inside hypertext links where they are not normally shown to users and as manifest text where the ARK itself may be of interest (e.g., printed in a document footer, or in search results). In many cases both the effect of a hypertext link and the manifest link are of interest, as is typically found in the results from the internet search engines. A normal URL-based HTML link looks like Click Here The same link with an ARK instead of a URL looks like Click Here Web browsers would in general require a small modification to convert this kind of ARK into a specified-host ARK, as in Click Here whence a trivial browser modification is required to treat this as Click Here No browser modification is required if the prefix is omitted, as in Click Here because the prefix "http://" is generally assumed. An NAA will typically make known the associations it creates by publishing them in catalogs, actively advertizing them, or passively leaving them on web sites for visiting indexing spiders. A valuable technique for provision of persistent objects is to try to have the identifier appear on, with, or near its retrieved object. An object could then easily be traced back to its metadata, to alternate versions, to updates, etc. This has seen reasonably widespread success, for example, in software distributions. 9. Security Considerations The ARK naming scheme poses no direct risk to computers and networks. Implementors of ARK services need to be aware of security issues when querying networks and filesystems for Name Mapping Authority services, and the concomitant risks from spoofing and obtaining J. Kunze 9. Security Considerations [Page 15] Internet Draft ARK Identifier Scheme February 2001 incorrect information. These risks are no greater for ARK mapping authority discovery than for other processes of service discovery. For example, recipients of ARKs with a specified hostport (NMAH) should treat it like a URL and be aware that the identified ARK service may no longer be operational. Similarly, ARK clients and servers subject themselves to all the risks that accompany normal operation of the protocols (e.g., HTTP, Z39.50) underlying mapping services. As specializations of such protocols, a concrete ARK service may actually limit exposure to their usual risks. Indeed, ARK services may enhance security by helping users identify long-term reliable references to information objects. On the other hand, due to extreme age of some ARK identifiers, the chances may be higher of coming across systems far enough out of the mainstream that spoofing will be harder to detect. 10. Acknowledgements The ARK scheme would not have come to maturity without many long and in-depth conversations with Rick Rodgers. The generous support of the US National Library of Medicine (NLM) is gratefully acknowledged, as are the very stimulating discussions I had with members of NLM's Permanence Working Group. 11. Author's Address John A. Kunze Center for Knowledge Management University of California, San Francisco 530 Parnassus Ave, Box 0840 San Francisco, CA 94143-0840, USA Fax: +1 415-476-4653 EMail: jak@ckm.ucsf.edu 12. References [RFC2168] Daniel, R. and M. Mealling, "Resolution of Uniform Resource Identifiers using the Domain Name System", RFC 2168, June 1997. [RFC2169] Daniel, R., "A Trivial Convention for using HTTP in URN Resolution", RFC 2169, June 1997. [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. [RFC1737] Sollins, K. and L. Masinter, "Functional Requirements for Uniform Resource Names", RFC 1737, December 1994. [RFC2396] T. Berners-Lee, R. Fielding, L. Masinter, "Uniform J. Kunze 12. References [Page 16] Internet Draft ARK Identifier Scheme February 2001 Resource Identifiers (URI): Generic Syntax", RFC 2396, August 1998. [RFC2616] R. Fielding, J. Gettys, J. Mogul, et al, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. [ERC] J. Kunze, "Electronic Resource Citations", work in progress 13. Appendix: Current /etc/arkna File # # Name Assigning Authority / Name Mapping Authority Lookup Table # Last change: 22 February 2001 # Reload from: http://xxx.xxx.xxx/etc/naa2nma.txt # # Each NAA appears at the beginning of a line with the NAA Number # first, then a colon and the name of the NAA organization # # All the NMA Host-ports that service an NAA are each listed, # one per line, indented, after the corresponding NAA line. # 121: US Library of Congress foobar.zaf.org sneezy.dopey.com # 122: US National Library of Medicine lhc.nlm.nih.gov:8080 foobar.zaf.org sneezy.dopey.com # 123: US National Agriculture Library foobar.zaf.gov:80 # # The enclosed Perl script takes an NAA as argument and prints # the NMAs in this file listed under the first matching NAA. # # my $naa = shift; # while (<>) { # next if (! /^$naa:/); # while (<> && /^) { print "$10 if (/^; } # } # end of file 14. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it J. Kunze 14. Copyright Notice [Page 17] Internet Draft ARK Identifier Scheme February 2001 or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. Expires 22 August 2001 J. Kunze 14. Copyright Notice [Page 18] Internet Draft ARK Identifier Scheme February 2001 Table of Contents Status of this Document ........................................... 1 1. Abstract ...................................................... 1 2. Introduction .................................................. 2 2.1. Three Reasons to Use ARKs ................................... 2 2.2. Organizing Support for ARKs ................................. 3 2.3. A New Definition for Identifier ............................. 4 3. ARK Anatomy ................................................... 4 3.1. The Name Mapping Authority Hostport (NMAH) .................. 5 3.2. The Name Assigning Authority Number (NAAN) .................. 6 3.3. The Name Part ............................................... 6 3.4. Naming Considerations ....................................... 6 4. Assigners of ARKs ............................................. 7 5. Finding a Name Mapping Authority .............................. 8 5.1. Looking Up NMAHs in a Globally Accessible File .............. 9 5.2. Looking up NMAHs Distributed via DNS NAPTR Records .......... 10 6. Generic ARK Service Definition ................................ 10 6.1. General ARK Access Service (access, location) ............... 10 6.2. General Policy Service (permanence, naming, addressing) ..... 10 6.3. General Description Service ................................. 11 7. Overview of the HTTP Key Mapping Protocol (HKMP) .............. 11 8. Advice for Web Clients ........................................ 14 9. Security Considerations ....................................... 15 10. Acknowledgements ............................................. 16 11. Author's Address ............................................. 16 12. References ................................................... 16 13. Appendix: Current /etc/arkna File ........................... 17 14. Copyright Notice ............................................. 17 J. Kunze 14. Copyright Notice [Page 19]