Internet Engineering Task Force M. Castelan Castro Internet-Draft 17beta Intended status: Informational April 28, 2020 Expires: October 30, 2020 The ARK URI scheme draft-ark-uri-scheme-00 Abstract This specification defines the (ARK) URI scheme that is especially suitable for persistent identifiers. Persistent identifiers for latest version of this document: . Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on October 30, 2020. Copyright Notice Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Castelan Castro Expires October 30, 2020 [Page 1] Internet-Draft The ARK URI scheme April 2020 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. 1. The ARK (Archival Resource Key) identifier scheme is flexible, dereferenceable and especially suitable for persistent identifiers. A founding principle of the design of the ARK scheme is that persistence is a matter of service not conferred by any particular identifier scheme; ARK is designed to ease the task of achieving persistence. This document specifies the technical details of the ARK system as an URI and IRI scheme and does not elaborate at length on the design rationale of the ARK system; for that see [Kunze_ARK]. 2. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [req_words]. The terms "identifier", "resource", "representation", "information resource" and "non-information resource" are used as described in [webarch]. For conciseness we use the term "referent" to mean the resource identified by an identifier. Note that identifiers are strings of characters, representations are strings of octets paired with an interpretation and resources are like the book "Alice's Adventures in Wonderland" by Lewis Carroll or Zermelo-Fraenkel set theory. The notation used to describe syntax is that described in [ABNF] extended as follows: A literal preceeded by " " matches any string that is equivalent when corresponding uppercase and lowercase codepoints in the range to are taken as equivalent. The syntax is augmented with set difference indicated by the operator " " whose precedence is between Alternative and Concatenation. All the syntactic terms defined in [ABNF] are referenced here. Castelan Castro Expires October 30, 2020 [Page 2] Internet-Draft The ARK URI scheme April 2020 3. "ARK" stands for Archival Resource Key. The URI scheme defined in this document is named "ARK scheme". Every identifier that uses this scheme is called an "ARK". The ARK scheme is designed to ease the creation and maintenance of persistent and dereferenceable resource identifiers. An ARK may be used either to identify an information resource or a non-information resource. There are 3 forms of ARKs. The following is an example of a Basic ARK: . " " is the URI scheme " " is the NAAN and " " is the Name. The Embedded ARK corresponds to the above Basic ARK; being a web URI, it can potentially be accessed by any web browser without need for specific support for the ARK scheme. A founding principle of the design of ARK is that persistence is a matter of the service provided by the resolver servicing a persistent identifier not conferred by the identifier scheme itself. Users MUST NOT automatically assume that any published ARK is a persistent identifier. Publishers of ARK that are commited to keep an ARK persistent SHOULD make this clear to the reader. For example, a publisher MAY state "Please use the persistent identifer to reference this page". Every piece of information included in an identifier is subject to become invalid or obsolete with time. An identifier is one that includes no manifest information about what resource it identifiers. When a NAA allocates ARKs that are intended as persistent identifiers, those ARKs are RECOMMENDED to be opaque. The URI that an ARK resolves to (if any) MAY be non-opaque. 3.1. ARKs are assigned by NAAs (Name Assigning Authority). Each NAA has an unique NAAN (Name Assigning Authority Number), a string of characters that is included within all the ARKs it allocates. The NAAN is included in ARKs to partition the ARK namespace and avoid collisions between ARKs assigned by different NAAs. The NAA assigns Castelan Castro Expires October 30, 2020 [Page 3] Internet-Draft The ARK URI scheme April 2020 the Name of an ARK that identifies an individual resource. An agent that wants to obtain a NAAN in order to assign ARK identifiers MUST register with the ARK Maintenance Agency [ARK_agency]. A NAA MAY transfer control over its NAAN space to a successor organization to take care of the ARKs it assigned. The NAAN is reserved in perpetuity to be used in examples and NAAN is reserved in perpetuity for invalid ARKs. Applications that handle ARKs SHOULD NOT handle in any special way and SHOULD recognize NAAN as being invalid; the same rationale as in [resrv_domains] applies. The ARK Maintenance Agency keeps the authoritative public registry of all NAA registered along with relevant associated data. See [registry]. As of the time of writing of the present document, the official registry is in a simple plain-text-based format described in a comment at the beginning of the registry itself. Once a NAAN is assigned to a real organization that requested it (as opposed to an assignment that was done on a technical mistake or a misunderstanding) this assignment is permanent and MUST be kept in the registry for as long as the ARK system operates, which is to say, into the indefinite future. Every NAA MUST contact the ARK Maintenance Agence when necessary to keep up to date the prefix of its authoritative resolver and in the same way SHOULD keep up to date the other information kept in the registry. A NAA MUST notify the ARK Maintenance Agency if it expects to stop existing or stop operating a resolver for its NAAN indefinitely (i.e.: notification is not required for temporary downtime of its resolver). 3.2. An ARK MAY include a qualifier after the Name. The qualifier can have a ComponentPath and a VariantPath. The ComponentPath is delimited by slashes and indicates hierarchical structure. For example, the ARK has VariantPath . Publishing an ARK with a ComponentPath has the implication that the ARKs obtained by truncating the last segment of the ComponentPath and Castelan Castro Expires October 30, 2020 [Page 4] Internet-Draft The ARK URI scheme April 2020 the previous slash is a container for the untruncated ARK. This semantic implication extends recursively until the ARK with no ComponentPath. Thus in the above example the hiearchical structure (from most general resource to least general) is These implied hierarchical semantics do not extend beyond the ARK with no ComponentPath. A string with no Name part like is not an ARK. If a NAA wants to allocate a ARK to refer to itself, it MUST do so by allocating a Name under its NAAN like for any other resource. The VariantPath is delimited by periods and indicates a language version, media type version, or similar variant of the Basic ARK obtained by stripping the VariantPath. The order of components within a VariantPath is meaningless. ARKs that differ only in order of VariantPath components identify the same resource. Example: The following ARKs are variants of : This specification does not define the concrete semantics of the VariantPath; a NAA SHOULD document the semantic of the ARK it assigns and SHOULD make this documentation accessible to users that dereference the ARK (for example, by including an hyperlink that points to the policy of the NAA in assinging ARKs which in turn describes the semantics of the VariantPath). The use of qualifiers is entirely optional. It is RECOMMENDED that a Basic ARK without qualifiers is used to identify a generic resource (independent of media type and perhaps language). A qualifier MAY be used to identify specific variants that could be short-lived as the preferred media type and languages change in the span of decades and centuries. 3.3. A Name Mapping Authority is an agent that provides a dereference service for a set of ARKs; this service MAY be an ARK resolver operated by the NMA (see below) or it MAY be any other suitable means. Example: A NMA could operate a library that lends physical copies of the books identified by the ARKs it services. Ideally all NMAs that service a given ARK would provide the same service. This could fail to be the case in practice because of technical limitations or political reasons, for example: Castelan Castro Expires October 30, 2020 [Page 5] Internet-Draft The ARK URI scheme April 2020 3.4. Inflections are variations of a Basic ARK obtained by adding a URI query. The question sign that introduces the URI query is part of the inflection. If there is no URI query, the inflection is considered the empty string. The inflections of an ARK are meant to provide information and services related to its referent. The ARK system reserves the inflection to request metadata about the referent of an ARK and the association of the ARK with its referent, including any relevant persistence statement. Example: If is the ARK of a PDF document then and can be expected by the user to resolve to a web page with metainformation about the PDF document: Title, author, date of creation and last modification, a statement of persistence (if applicable) by the NAA or NMAH and others. The latter ARK can be entered in a web browser by an user seeking an assurance that the ARK will resolve indefinitelly to this PDF document in order to use the ARK to cite the document in print. A NMAH SHOULD implement the inflection with the semantics described in this document. MUST NOT be used for a different purpose. If a NMAH provides additional inflections, it SHOULD publicly document what they are and their meanings and make this information available to the users of its ARKs. and are reserved for a possible standard meaning in a future revision of this specification. In prior drafts the inflection was recommended to provide metadata about the referent of the ARK and the inflection was recommended to provide information about the assignment of the ARK to its referent including a persistence statement (if Castelan Castro Expires October 30, 2020 [Page 6] Internet-Draft The ARK URI scheme April 2020 applicable). It is RECOMMENDED that a ARK resolver gives the inflections and the same semantics as until and if it is redefined in a future version. 4. The syntax of the ARK scheme is described in the context of Internationalized Resource Identifiers (IRIs). For ARKs as URIs the only difference that only ASCII characters are allowed. The core of the ARK system are Basic ARKs. Extended ARKs are the set of all strings allowed under the IRI scheme. Every Basic ARK is an Extended ARK. The syntax of the URI and IRI systems allow identifiers with the scheme that do not met the production rule; those character strings are not ARKs of any type; we call them pseudo-ARKs. The production rules , , , and are taken from [IRIs]. It is RECOMMENDED that applications do not generate Extended ARKs longer than 255 Unicode codepoints. Where a Basic ARK or an Extended ARK is expected, applications MUST NOT impose a limit on length of less than 255 codepoints (that is, Basic ARKs and Extended ARKs of 255 codepoints or shorter MUST NOT be rejected by any conforming application on the basis of length). Applications MAY support only URIs and therefore reject Extended ARKs that include non-ASCII characters. Castelan Castro Expires October 30, 2020 [Page 7] Internet-Draft The ARK URI scheme April 2020 4.1. Extended ARKs MAY be embedded as a part of another IRI (URIs are a subset of IRIs). The main application is to couple an extended ARK with a HTTP resolver to make the ARK dereferenceable by ordinary web tools without any additional requirement on the part to the user. Embedded ARKs MUST match the production rule below. The production rules , , , and are taken from [IRIs]. An Extended ARK combined with the of an ARK resolver is an Embedded ARK. Other specification MAY extend the set of Embedded ARKs. The set of Embedded ARK (as defined by the aggregate of all specifications of the Internet) is thus open ended. 5. As an IRI scheme, the ARK scheme allows for non-ASCII Unicode characters. It is RECOMMENDED that ARKs minted for new resources use only ASCII characters. Note that ARK normalization always percent- encodes non-ASCII characters. Security issues related to Unicode are mentioned in Section 8. ARK normalization always percent-encodes non-ASCII characters, thus leaving a longer identifier. For example, ARK normalization maps the ARK to . 5.1. Percent-encoded characters have long been allowed in the ARK system. Internationalized Resource Identifiers (IRIs) allow non-ASCII characters to be used transparently in IRIs via a mapping to percent- Castelan Castro Expires October 30, 2020 [Page 8] Internet-Draft The ARK URI scheme April 2020 encoded characters. Applications widely implement this mapping; in specific, most web browsers. Forbidding non-ASCII characters in ARKs would have been a moot point because browsers would still allow non- ASCII characters in pseudo-ARKs via the transparent IRI-to-URI mapping. Thus, a decision was made to allow non-ASCII characters in this specification and recommend against them. The only way to reliably disallow non-ASCII characters where ARKs are expected would have been to forbid percent-encoded characters outside ASCII so that the IRI-to-URI mapping always yields invalid ARKs. However this would have broken backward compatibility with previous versions of the ARK scheme which allow percent-encoded characters without restriction. 5.2. It may be desirable to avoid Latin characters in a a text written in a different script. In principle, resources in a fixed language that uses a script other than the Latin script could be assigned an opaque persistent identifier with characters in their native script. For example, a scientific journal that publishes articles in Russian language could assign persistent identifiers like to its articles. This comes with an inconvenience for users of non- Cyrillic scripts; they will have more difficulty manually entering this ARK. Therefore, it is RECOMMENDED to avoid this practice. Instead it is RECOMMENDED that a NAA that wants to avoid Latin characters in its identifiers mints ARKs from only decimal characters (" "-" "). Decimal characters are present in most keyboard layouts and are familiar to people around the world more so than the Latin script. The Latin characters in the " " substring at the start of ARKs is unavoidable as long as IRIs are used; IRIs do not allow for non-ASCII characters in the scheme. In hypertext, ARKs can be published with the "ARK" part transliterated into the native script, with the rest of the identifier linked to an Embedded ARK. For example, in Russian one can write "АРК: " where is an hyperlink to . Castelan Castro Expires October 30, 2020 [Page 9] Internet-Draft The ARK URI scheme April 2020 6. Normalization is defined for Extended ARKs. Given an Extended ARK, the following algorithm produces an Exended ARK in normal form. The domain of this algorithm is only Extended ARKs as described in this specification. This algorithm is explicitly undefined for strings other than Extended ARKs. Note that the normalization algorithm decodes all percent-encoded instances of " " in the step of URI syntax-based normalization. Those hyphens are subsequently removed. Therefore, Basic ARKs that differ only by insertion or removal of " " are equivalent. ARKs are said to be equivalent if they have the same normal form. ARK equivalency is an equivalence relation. An extended ARK is a Basic ARK if and only if its normal form is a Basic ARK. The set of Extended ARKs that have the same normal form identify exactly the same resource (this is part of the ARK system independent of any NAA-specific policy in assigning ARKs). Agents MUST NOT declare conflicting assignations for equivalent ARKs; doing so is an error. 7. A resolver is an application accessible under a dereferenceable URI scheme that provides a suitable representation for ARKs under its scope. Resolvers MAY use any suitable URI scheme. This specification only describes HTTP resolvers. Other specifications MAY describe additional methods to resolve an ARK. A resolution request is the process of using an ARK resolver to dereference an ARK. Every NAA MUST declare at time of registration at least 1 prefix under which it intends to run an authoritative ARK resolver for its NAAN. Every NAA MUST send a request to the ARK Maintenance Agency [ARK_agency] when necessary to keep the set of its authoritative resolvers up to date. The official list of allocated NAANs and their authoritative resolvers is [registry]. Castelan Castro Expires October 30, 2020 [Page 10] Internet-Draft The ARK URI scheme April 2020 Any method of ARK resolution SHOULD be able to distinguish whether the representation obtained is a representation the resource identified by the ARK or a representation to the resource identified by the ARK. This distinction is made because it is necessary for resources referenced in the Semantic Web. See [cool_URIs]. 7.1. A HTTP resolver is one that is accessible through the or scheme. The of an HTTP resolver MUST match the production rule. A HTTP resolver MUST serve HTTP requests for URIs beginning with its corresponding . Given the semantics of the HTTP protocol, resolution is only directly applicable to Extended ARKs with no URI fragment. The fragment, if present, has semantics given by the media type of the response obtained (if any) for resolving the corresponding ARK without the fragment. ARKs that contain non-ASCII characters must be percent-encoded resolution because the in the HTTP protocol only allows URIs (not proper IRIs). The constraints on length limitations apply to the URI resulting after this percent-encoding. URI queries are used for inflections; their semantics and requirements are described in Section 3.4. Clients of the HTTP resolver MUST set in the HTTP request to a Basic ARK. Servers MAY respond with an error status code for requests with a that is not a Basic ARK. Castelan Castro Expires October 30, 2020 [Page 11] Internet-Draft The ARK URI scheme April 2020 A HTTP ARK resolver MUST treat equally all resolution requests for Extended ARKs with the same normal form with the exception that it MAY reject some Extended ARKs on the basis that they are too long. The official resolver for the ARK system has equal to and is operated by [ARK_agency]. 7.1.1. The following ARKs MUST NOT be rejected on the basis that they are too long: When a HTTP ARK resolver declines to serve a request for resolution on the basis of length it MUST reply with the HTTP status code 414. Note that the length limit is with respect to the length of Extended ARKs, not the Embedded ARKs used to query an ARK resolver. Internal processing may differ provided these constraints is satisfied. Example: A resolution request for must be treated the same as if it was for or . A HTTP ARK resolver MAY return an error code for requests to resolve something that is not an Extended ARK. 7.1.2. If the request is for an Embedded ARK with no inflection, the reply of the resolver is to be interpreted according to the semantics of HTTP with the considerations specific to the ARK system described in this section. Note that these considerations do not apply in the case of an inflected ARK because then the request is not for the referent of the ARK, but for associated metadata instead as described in section Section 3.4. When resolution of an ARK results in a chain of redirects (HTTP status code 301, 302, 303, 307 and 308 MUST be recognized as redirects) followed by a success response which is not a redirect (HTTP status code 200, 204, 206, 226 and 304 MUST be recognized as success), if redirection has status code 303, then the resource at the final location is considered Castelan Castro Expires October 30, 2020 [Page 12] Internet-Draft The ARK URI scheme April 2020 to the ARK resolved, otherwise the resource at the final location the referent of the ARK resolved and the representation obtained is a representation this referent. When a chain of redirects is followed by an error (HTTP status codes 400-599 MUST be recognized as error) this specification does not specify any semantics; therefore, it is unspecified whether the error is of the referent of the ARK or of the resolution of the ARK. Additional responses MAY be recognized as redirect and success or handled the same way as HTTP status code 303 provided this is consistent with the relevant specifications. This specification does not define any semantics for HTTP request with an URI corresponding to a HTTP ARK resolver that is not an Embedded ARK. ARK resolvers MAY provide other services under request URIs that are not Embedded ARKs. 7.1.3. The following algorithm MAY be used to resolve an ARK using a HTTP resolver. Other algorithms -whether custom or described in a specification- MAY be used instead. If a standard defines an additional resolution procedure it SHOULD follow the same intent as the reference resolution algorithm changing only technical details necessary to adapt to the respective protocols it employs. The reference algorithm presented below is designed to distinguish between information resources and non-information resources identified by an ARK by making use of HTTP status codes as described in [cool_URIs]. The description of the follows. Set to the Embedded ARK formed with the prefix of the ARK resolver specified and the Extended ARK to be resolved. Set to the number specified by the user. Set to the symbol or as specified by the user. Set state to the symbol Castelan Castro Expires October 30, 2020 [Page 13] Internet-Draft The ARK URI scheme April 2020 . Then while is 0 or more: If the above loop ends because reached a negative value, return failure. 7.1.4. The ARK Maintenance Agency [ARK_agency] operates an ARK HTTP resolver at . This resolver can resolve any ARK that is globally resolvable by redirecting to the local ARK resolver as stated in [registry]. 8. General security considerations of communication within computer networks apply. Ideally resolvers SHOULD be reachable via a secure means. For the case of HTTP resolvers this means using HTTP over TLS. The possibility of connecting securely to an HTTP resolver SHOULD be announced by using the URI scheme in the NMAH. If the resolver is also available under plain HTTP directly over TCP then it SHOULD use HTTP Strict Transport Security (see [HSTS]) to direct users to contact the server securely in the future. The ARK system allows for resolution of identifiers. Many of the security implications of DNS apply. As with any resolution system, a malicious agent can operate an ARK resolver and return undesired responses. Using any ARK resolver requires trust that it will return an honest answer or error message and not a malicious answer analogous to DNS hijacking. Using the ARK system in any way requires some trust in the ARK Maintenance Agency. There is little additional trust required in using the official ARK resolver which is operated by the ARK Maintenance Agency. It is RECOMMENDED that users use the official ARK resolver to resolve ARKs for which there is no particular reason to use another resolver. 8.1. The ARK scheme allows non-ASCII Unicode characters in the part assigned by NAAs. See [Unicode_security] and Section 8 in [IRIs] for security implications. The NAAN is always limited to ASCII characters. If a NAA allows a non-trusted party to assign ARKs under its NAAN it SHOULD limit the character set allowed to avoid homoglyph Castelan Castro Expires October 30, 2020 [Page 14] Internet-Draft The ARK URI scheme April 2020 attacks and misplaced formatting characters. An application that displays ARKs can avoid most Unicode-related security problems by displaying ARKs in normalized form which only uses ASCII characters. Applications that expect an ARK and allow non-ASCII characters MUST be prepared for inputs with control or formatting characters inserted maliciously and either reject the input or percent-encode the problematic characters. The production rules of IRIs forbid characters in the range - , - which are control characters. 8.1.1. The IRI specification states in prose ([IRIs], p. 18): "IRIs MUST NOT contain bidirectional formatting characters (LRM, RLM, LRE, RLE, LRO, RLO, and PDF).". The set of bidirectional formatting characters is open-ended; therefore it is not possible to forbid all future bidirectional formatting characters in a fixed syntax other than by forbidding unallocated codepoints. For example, (left-to-right isolate) and (right-to-left isolate) were added in Unicode 6.3.0 after the IRI standard was written. Applications MUST avoid passing characters with unknown semantics to other applications. E.g: a program with a command-line interface that handles IRIs should avoid sending unescaped bidi formatting characters in IRIs to the terminal becuase they can garble the following text, unrelated to the IRI. Web software MAY place IRIs that can potentially contain formatting characters inside a XHTML element to limit the effect of bidi formatting characters to the IRI. 9. permanent Existing ARK resolvers including the central resolver . Existing NAAs registered in [registry]. Castelan Castro Expires October 30, 2020 [Page 15] Internet-Draft The ARK URI scheme April 2020 Mario Xerxes Castelan Castro (Ksenia) regarding this specification; The ARK Maintenance Agency [ARK_agency] regarding the ARK system in general. ARK Maintenance Agency [ARK_agency]. This document. 10. References [ABNF] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications", 2008. [ARK_agency] Agency, A. M., "ARK Maintenance Agency web site", . [cool_URIs] W3C, "Cool URIs for the Semantic Web", 2008, . [HSTS] Hodges, J., Jackson, C., and A. Barth, "HTTP Strict Transport Security (HSTS)", 2012. [IRIs] Duerst, M. and M. Suignard, "Internationalized Resource Identifiers (IRIs)", 2005. [Kunze_ARK] Kunze, K., "The ARK Identifier Scheme", 2008, . [registry] Agency, A. M., "Name Assigning Authority Number (NAAN) Registry", . [req_words] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", 1997. [resrv_domains] Cheshire, S. and M. Krochmal, "Special-Use Domain Names", 2013. [Unicode_security] Davis, M. and M. Suignard, "Unicode Technical Report #36: Unicode Security Considerations, revision 15", 2014, . Castelan Castro Expires October 30, 2020 [Page 16] Internet-Draft The ARK URI scheme April 2020 [URIs] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", 2005. [webarch] W3C, "Architecture of the World Wide Web, Volume One", 2004, . Author's Address Mario Xerxes Castelan Castro (Ksenia) 17beta Email: ksenia@17beta.top Castelan Castro Expires October 30, 2020 [Page 17]