Internet DRAFT - draft-pwid-urn-specification

draft-pwid-urn-specification







Internet Engineering Task Force                           E. Zierau, Ed.
Internet-Draft                                      Royal Danish Library
Intended status: Informational                         September 5, 2019
Expires: March 8, 2020


            A Persistent Web IDentifier (PWID) URN Namespace
                    draft-pwid-urn-specification-09

Abstract

   This document specifies a Uniform Resource Name (URN) for Persistent
   Web IDentifiers for web material in web archives using the 'pwid'
   namespace identifier.

   The main purpose of the standard is to support specification of
   references that are not covered by other reference techniques: to
   support references to material in web archives with restricted
   access.  Furthermore, it supports persistent technology agnostic
   references to web archives in general, in a form that can work as an
   algorithmic basis for finding web archive resources in general.  An
   additional important benefit is that the standard can be used for
   specifying web collections, which can then form a persistent
   computational basis for the extract of the archived collection parts.

   The PWID URN is designed to meet requirements for proper referencing
   needed by researchers.  Therefore, it is designed as general, global,
   sustainable, humanly readable, technology agnostic, persistent and
   precise web references for web materials in web archives.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on March 8, 2020.





Zierau                    Expires March 8, 2020                 [Page 1]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


Copyright Notice

   Copyright (c) 2019 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   5
   2.  Namespace Registration Template . . . . . . . . . . . . . . .   5
   3.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  19
   4.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  19
     4.1.  Normative References  . . . . . . . . . . . . . . . . . .  19
     4.2.  Informative References  . . . . . . . . . . . . . . . . .  20
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  22

1.  Introduction

   The PWID URN is a supplement to existing reference standards, where
   the PWID URN will support references to web archives, including areas
   that are not supported today: support of references to material in
   web archives with restricted access.  Furthermore, the PWID URN
   enables technology agnostic references to web archives in general,
   which can be needed, for instance for references to dynamic web
   material with frequent updates (e.g. a news site) or a specific
   version of a web material (e.g. specific version of the DOI
   handbook).

   The PWID URN is in a form which can work as an algorithmic basis for
   finding the resource.  This also enables computation of archived web
   parts to a collection from one or more web archives, if the
   collection parts are specified by PWID URNs.

   Furthermore, the PWID URN includes information about the resource
   which makes it possible to find alternative resources, in cases where
   the original precise resource has become unavailable.





Zierau                    Expires March 8, 2020                 [Page 2]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


   The PWID URN is designed to be a persistent reference that is
   general, global and technology agnostic in order to enhance its
   chances of being sustainable.  Furthermore, it is designed to be
   humanly readable and with an ability to specify precision about what
   the referenced web archive resource covers.  This design enables a
   PWID URN to:

   o  be used in technical solutions, e.g. to make them resolvable

   o  cover references to materials from all sorts of web archives

   The motivation for defining a PWID namespace is the growing
   challenges of references to archived web resources, and the PWID as a
   URN can assist in overcoming a lot of these challenges.  The standard
   is needed to address web materials meeting precision and persistency
   issues on par with precision in traditional references for analogue
   material.  Furthermore, it is needed in order to address web archive
   resources that are not freely available online.  The PWID URN covers
   both referencing of web resources from research papers and definition
   of web collections/corpora.  In detail the challenges are:

   o  Persistent Identifier systems (like DOI [DOI]) will only cover
      registered resources.  In general, citation guidelines do not
      cover general and persistent referencing techniques for web
      resources that are not registered.  However, an increasing number
      of references point to resources that only exist on the web, e.g.
      blogs that turn out to have a historical impact.  In order to
      obtain persistency for a reference, the target needs to be stable.
      For non-registered web resources, the common rule is that the
      resource will change, since the live-web is constantly changing.
      Persistency can only be obtained by referring to something stable,
      i.e. an archived version of the resource from the web.  The PWID
      URN is therefore focused on referencing archived web material in a
      technology agnostic way (research documented in [IPRES2016] and
      [ResawRef]).

   o  References to materials, which only exist in web archives (i.e. no
      longer on the live web) are not well supported, especially not for
      materials that only exists in archives with restricted access.
      There are many new initiatives for web archive referencing, - most
      of which are centralized solutions offering harvesting and
      referencing, but these cannot be used for materials that only
      exist in web archives.  The PWID URN can be used for all web
      archives, including web archives with restricted access.

   o  One of the referencing initiatives for open web archives uses URLs
      which depend on the current setup of the web archive's access
      platform.  These URLs are usually technology and placement



Zierau                    Expires March 8, 2020                 [Page 3]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


      dependent, and therefore such a reference style is not suited for
      references that are important to retrace for a long period.  The
      PWID URN can be used for such reference purposes, since it is
      technology agnostic.

   o  Another referencing initiative, for open web archives, is omitting
      specification of the web archive where the resource was found.
      This strategy is used in order to open the possibility of using
      alternatives from other archives.  However, this also adds a risk
      of imprecision since different archives tend to have different
      versions even when harvesting at the same time.  Therefore, such a
      reference style is not suited for references where it is important
      that the reference is precisely the verified reference.  The PWID
      URN can provide an exact reference for where the reference was
      validated.  Additionally, the PWID contains the needed information
      in order to search for alternative resource, if needed.

   o  For web collections/corpora (possibly across different web
      archives), recent research have found that various legal and
      sustainability issues has led to a need of a collection definition
      of references to their web parts.  Furthermore, there is a need
      for a similar persistent referencing for all parts for calculation
      and sustainability reasons.  So far, there has been no stable
      standard for definition of such collection parts.  The PWID URN
      can be used for such definitions in order to fulfil these
      requirements (research documented in [ResawColl]).

   The PWID URN is especially useful for web material where precision is
   in focus and/or there are references to materials from web archives
   requiring special permissions in order to gain access.  The precision
   regards two aspects.  Firstly, pointing out the archive where the
   resource was found and validated against its purpose (other archived
   versions in other web archives may differ both regarding completeness
   and contents even within short time periods).  Secondly, specifying
   whether the referred resource is a web page or a part in form of one
   file.

   The possibility of specifying the part/file precision enables the
   PWID URN to be used in specification of contents of a web collection.
   Definitions of web collections are often needed for extraction of
   data used in production of research results, e.g. for future
   evaluations.  Current practices are not persistent as they often use
   some CDX version, which vary for different implementations.

   Strict syntax is needed for the PWID URN, in order to ensure that it
   can act as a reference which can used for computational purposes.
   This is especially relevant for automatic extraction of parts from
   web collection definitions.  Furthermore, today's readers of research



Zierau                    Expires March 8, 2020                 [Page 4]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


   papers are expecting to be able to access a referenced resource by
   clicking an actionable URI, therefore a similar possibility will be
   expected for references to available archived web material, and this
   is possible with a strict syntax.  A prototype for resolving URN
   PWIDs has been developed for the Danish web archive data and open web
   archives with standard patterns for the current technologies.
   Implementations for resolution of PWID URNs for other web archives
   may be developed.

   The purpose of the PWID URN is also to express a web archive
   reference as simple as possible and at the same time meet the
   requirements for sustainability, usability and scope.  Therefore, the
   PWID URN is focused on having only the minimum required information
   to make a precise identification of a resource in an arbitrary web
   archive.  Recent research have shown that this can be obtained by the
   following information [ResawRef]:

   o  Identification of web archive

   o  Identification of source:

      *  Archived URI or identifier

      *  Archival timestamp

   o  Intended precision (page, part/file)

   The PWID URN represents this information in a human readable way as
   well as a well-defined way that enables technical solutions to
   interpret the URN.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

2.  Namespace Registration Template

   Namespace Identifier:

      PWID

   Version:

      1

   Date:



Zierau                    Expires March 8, 2020                 [Page 5]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


      2019-09-06

   Registrant:

      Eld Maj-Britt Olmuetz Zierau
      Royal Danish Library
      Soeren Kierkegaards Plads 1
      1219 Copenhagen
      Denmark
      ph: +45 9132 4690
      email: elzi@kb.dk


   Purpose:

      The PWID URN is a supplement to existing reference standards,
      where the PWID URN will support references to web archives,
      including areas that are not supported today: support of
      references to material in web archives with restricted access.
      Furthermore, the PWID URN enables technology agnostic references
      to web archives in general, which can be needed, for instance for
      references to dynamic web material with frequent updates (e.g. a
      news site) or a specific version of a web material (e.g. specific
      version of the DOI handbook).

      The PWID URN is in a form which can work as an algorithmic basis
      for finding the resource.  This also enables computation of
      archived web parts to a collection from one or more web archives,
      if the collection parts are specified by PWID URNs.

      Furthermore, the PWID URN includes information about the resource
      which makes it possible to find alternative resources, in cases
      where the original precise resource has become unavailable.

      The PWID URN is designed to be a persistent reference that is
      general, global and technology agnostic in order to enhance its
      chances of being sustainable.  Furthermore, it is designed to be
      humanly readable and with an ability to specify precision about
      what the referenced web archive resource covers.  This design
      enables a PWID URN to:

      *  be used in technical solutions, e.g. to make them resolvable

      *  cover references to materials from all sorts of web archives

      The motivation for defining a PWID namespace is the growing
      challenges of references to archived web resources, and the PWID
      as a URN can assist in overcoming a lot of these challenges.  The



Zierau                    Expires March 8, 2020                 [Page 6]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


      standard is needed to address web materials meeting precision and
      persistency issues on par with precision in traditional references
      for analogue material.  Furthermore, it is needed in order to
      address web archive resources that are not freely available
      online.  The PWID URN covers both referencing of web resources
      from research papers and definition of web collections/corpora.
      In detail the challenges are:

      *  Persistent Identifier systems (like DOI [DOI]) will only cover
         registered resources.  In general, citation guidelines do not
         cover general and persistent referencing techniques for web
         resources that are not registered.  However, an increasing
         number of references point to resources that only exist on the
         web, e.g. blogs that turn out to have a historical impact.  In
         order to obtain persistency for a reference, the target needs
         to be stable.  For non-registered web resources, the common
         rule is that the resource will change, since the live-web is
         constantly changing.  Persistency can only be obtained by
         referring to something stable, i.e. an archived version of the
         resource from the web.  The PWID URN is therefore focused on
         referencing archived web material in a technology agnostic way
         (research documented in [IPRES2016] and [ResawRef]).

      *  References to materials, which only exist in web archives (i.e.
         no longer on the live web) are not well supported, especially
         not for materials that only exists in archives with restricted
         access.  There are many new initiatives for web archive
         referencing, - most of which are centralized solutions offering
         harvesting and referencing, but these cannot be used for
         materials that only exist in web archives.  The PWID URN can be
         used for all web archives, including web archives with
         restricted access.

      *  One of the referencing initiatives for open web archives uses
         URLs which depend on the current setup of the web archive's
         access platform.  These URLs are usually technology and
         placement dependent, and therefore such a reference style is
         not suited for references that are important to retrace for a
         long period.  The PWID URN can be used for such reference
         purposes, since it is technology agnostic.

      *  Another referencing initiative, for open web archives, is
         omitting specification of the web archive where the resource
         was found.  This strategy is used in order to open the
         possibility of using alternatives from other archives.
         However, this also adds a risk of imprecision since different
         archives tend to have different versions even when harvesting
         at the same time.  Therefore, such a reference style is not



Zierau                    Expires March 8, 2020                 [Page 7]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


         suited for references where it is important that the reference
         is precisely the verified reference.  The PWID URN can provide
         an exact reference for where the reference was validated.
         Additionally, the PWID contains the needed information in order
         to search for alternative resource, if needed.

      *  For web collections/corpora (possibly across different web
         archives), recent research have found that various legal and
         sustainability issues has led to a need of a collection
         definition of references to their web parts.  Furthermore,
         there is a need for a similar persistent referencing for all
         parts for calculation and sustainability reasons.  So far,
         there has been no stable standard for definition of such
         collection parts.  The PWID URN can be used for such
         definitions in order to fulfil these requirements (research
         documented in [ResawColl]).

      The PWID URN is especially useful for web material where precision
      is in focus and/or there are references to materials from web
      archives requiring special permissions in order to gain access.
      The precision regards two aspects.  Firstly, pointing out the
      archive where the resource was found and validated against its
      purpose (other archived versions in other web archives may differ
      both regarding completeness and contents even within short time
      periods) Secondly, specifying whether the referred resource is a
      web page or a part in form of one file.

      The possibility of specifying the part/file precision enables the
      PWID URN to be used in specification of contents of a web
      collection.  Definitions of web collections are often needed for
      extraction of data used in production of research results, e.g.
      for future evaluations.  Current practices are not persistent as
      they often use some CDX version, which vary for different
      implementations.

      Strict syntax is needed for the PWID URN, in order to ensure that
      it can act as a reference which can used for computational
      purposes.  This is especially relevant for automatic extraction of
      parts from web collection definitions.  Furthermore, today's
      readers of research papers are expecting to be able to access a
      referenced resource by clicking an actionable URI, therefore a
      similar possibility will be expected for references to available
      archived web material, and this is possible with a strict syntax.
      A prototype for resolving URN PWIDs has been developed for the
      Danish web archive data and open web archives with standard
      patterns for the current technologies.  Implementations for
      resolution of PWID URNs for other web archives may be developed.




Zierau                    Expires March 8, 2020                 [Page 8]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


      The purpose of the PWID URN is also to express a web archive
      reference as simple as possible and at the same time meet the
      requirements for sustainability, usability and scope.  Therefore,
      the PWID URN is focused on having only the minimum required
      information to make a precise identification of a resource in an
      arbitrary web archive.  Recent research have shown that this can
      be obtained by the following information [ResawRef]:

      *  Identification of web archive

      *  Identification of source:

         +  Archived URI or identifier

         +  Archival timestamp

      *  Intended precision (page, part/file)

      The PWID URN represents this information in a human readable way
      as well as a well-defined way that enables technical solutions to
      interpret the URN.

   Syntax:

      The syntax of the PWID URN is specified below in Augmented Backus-
      Naur Form (ABNF) [RFC5234] and conforms to URN syntax defined in
      [RFC8141].  The syntax definition of the PWID URN is:

        pwid-urn = "urn:" pwid-NID ":" pwid-NSS

        pwid-NID = "pwid"
        pwid-NSS = archive-domain ":" archival-time ":" precision-spec
                              ":" archived-uri

        archival-time = utc-date ["T" utc-time] "Z"
        utc-date   = utc-year "-" utc-month "-" utc-day
        utc-year   = 4DIGIT
        utc-month  = 2DIGIT  ; 01-12
        utc-day    = 2DIGIT  ; 01-28, 01-29, 01-30, 01-31 based on
                         ; month/year in UTC time
        utc-time   = utc-hour ":" utc-minute [":" utc-second [secfrac]]
        utc-hour   = 2DIGIT  ; 00-23
        utc-minute = 2DIGIT  ; 00-59
        utc-second = 2DIGIT  ; 00-58, 00-59, 00-60 based on leap second
                                   ; rules
        secfrac       = "." 1*9DIGIT

        precision-spec = "part" / "page"



Zierau                    Expires March 8, 2020                 [Page 9]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


      where

      *  All parts of the pwid-NSS are case insensitive, except for
         cases where the archived-uri represents a URI with case
         sensitive parts.  According to [RFC8141] (section 3.1) this
         means that the PWID URNs in general are case insensitive,
         except from cases where it includes a case sensitive archived
         URI.

      *  'archive-domain' is defined as in (section 3.5) [RFC1034].
         The 'archive-domain' must identify the web archive by the
         domain for the archive leading to descriptions of how to access
         (or apply for access) materials in the archive.  (Discussion of
         this way to identify the web archive is described in the
         "Assignment" section and discussed in the "Additional
         information" section).

      *  'archival-time' is a UTC timestamp which conforms to the W3C
         profile of [ISO8601] [W3CDTF] and a subset of date-time
         specified in [RFC3339] (except from allowing partial time
         specification).
         The 'archival-time' may be specified at any of the levels of
         granularity, as long as it reflects exactly the granularity of
         the timestamp recorded in the archive (which is in accordance
         with the WARC standard [ISO28500]).

      *  'DIGIT' is defined as in [RFC5234].

      *  'archived-uri' is defined as 'URI' in [RFC3986] but where
         occurrences of "[", "]", "?", "#" and "%" are %-encoded in
         order not to clash with URN reserved characters [RFC8141] as
         well as having unambiguous use of "%".
         The 'archived-uri' must be the URI for the archived source.

      The precision specification is expressing the intended precision
      of the reference, which is needed for specification of

      *  precise coverage of the reference
         e.g. to an html file, since the precise meaning of what the
         reference covers can be very varied (the html file itself? the
         web page it renders to?) or precise web parts of a collection
         specification.

      *  degree of how precise the reference is with respect to what can
         be viewed in the future
         The html file itself will be the same.  However for web pages,
         there are interpretation involved, which mean the result of
         rendering them in the web archive can change over time.  This



Zierau                    Expires March 8, 2020                [Page 10]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


         may happen in case the web archive's algorithm for calculation
         of which archived web parts to use for the web page.  It may
         also happen if the web page refers to parts which are added to
         the web archive later, and therefore will give another
         expression than the originally referenced expression.

      The following valid precision-spec values are exists:

      *  'page'
         Meaning that an application like Wayback calculates a resulting
         web page based on calculated referenced web parts (display
         templates, images etc.).  For example, an html page displaying
         an image will need both the html and the referred image.

      *  'part'
         Meaning the single archived file/web part harvested as from the
         specified URI.  For references to web pages with html code
         (i.e. pages where there is an option to "View page source"),
         this will mean the actual file with the html code.  It is
         relevant to refer to web pages this way, in case it is part of
         a collection specification or in case it is the html that is of
         interest (e.g. java scripts or hidden links that are not
         visible when rendering the web page).
         For all other types of files, the URI will be for single files
         to be interpreted a file.

   Assignment:

      The PWID URNs do not have to be assigned by an authority, as they
      are based on the information created at the time of archiving.  In
      other words: a PWID URN is created independently, but following an
      algorithm which ensures that the referred item can be found if it
      is still available.  A prerequisite for assignment of a PWID is
      that the web archive can be identified (with a domain describing
      the web archive) and that it has registered metadata about the
      archived URI and the time of archiving (also discussed in section
      "Additional Information").

      A PWID URN is created by finding the relevant information of the
      syntax parts of the PWID:

        "urn:pwid:" archive-domain ":" archival-time ":" precision-spec
                               ":" archived-uri

      The PWID URN for an archived item at hand can be constructed by
      exchanging the unspecified PWID parts with relevant information,
      as explained in the following:




Zierau                    Expires March 8, 2020                [Page 11]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


      *  archive-domain (identification of web archive):
         This must be the domain of the web archive as identification of
         the web archive (e.g. archive.org for Internet Archive's open
         web archive and netarkivet.dk for the Danish web archive with
         restricted access).  Use of the web archive domain as an
         identification of a web archive is chosen, since most web
         archives have a web archive domain page that leads to a
         description of how to access the web archive, e.g. by online
         access or by applying for access grants.  Furthermore, it is
         more precise than e.g.  the name of the archive, since there
         may be more than one installation of web archives at the same
         organization, e.g.  archive.org and archive-it.org are both
         covered by Internet Archive.
         A more precise and persistent identification would require a
         formal registry of web archives, but such a registry does not
         yet exist.

      *  archival-time (archival timestamp):
         The archival time for the archived item must be specified with
         as much granularity as possible in order to make sure it
         uniquely identifies the resource at hand.  The archival time
         may be displayed along with the archived item, but there are
         different implementations.  It is important to be aware of
         whether a more precise timestamp can be found, and whether the
         correct timestamp is used.  In many Wayback implementations,
         the precise timestamp can be found as part of the URI used for
         viewing the archived item.  For example, the archive http URI
         https://web.archive.org/web/20160122100823/https://www.dr.dk
         for an archived resource viewable via the Internet Archive's
         Wayback installation, the number 20160122100823 represents the
         archival time 2016-01-22T10:08:23Z.  In other installations,
         the most precise timestamp may be found in the URI from a
         search result leading to the resource (which usually redirects
         on basis of a call to the underlying archive index).
         Especially for web pages with frames, there may be cases where
         the actual time is not displayed with the source, since only
         the times for the contents of the frames are displayed.

      *  precision-spec (part or page):
         The precision specification specifies how the referred item
         should be regarded.  A typical PWID URN reference in a paper
         would be 'page', where a tool will be needed to render the web
         page.  Alternatively, the precision-spec can be 'part', which
         is the most precise reference since it reference a specific
         file where no additional calculations are needed (e.g. as part
         of a collection, a specific html file with hidden links or to
         indicate that a single image is referenced).  In order to see
         whether a viewed browser page is a computed web page or a



Zierau                    Expires March 8, 2020                [Page 12]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


         single file, browsers have a function "View page source" which
         is not activated if for single files).

      *  archived-uri (archived URI):
         The URI that was harvested by the web archive for the
         referenced resource.

      A much easier way to construct PWID URNs is to use tools that
      construct them.  Currently, there is also a prototype for a SOLR-
      Wayback tool (Source at https://github.com/netarchivesuite/
      solrwayback) [PWIDprovider], which can assist in finding the most
      precise reference to an archived web page.  This Wayback version
      can provide all PWID URNs belonging to a shown page where the page
      PWID URN is provided at the top of the PWID URN list with 'part'
      precision, i.e. the page PWID URN can be taken replacing the
      'part' with 'page' or all provided PWID URNs can be taken and e.g.
      used in a collection definition.

   Security and Privacy:

      Security and privacy considerations are restricted to accessible
      web resources in web archives.  Resolvers to PWID URNs will
      usually only be possible using the web archives' access tools,
      where security and privacy are covered by these tools.  In such
      cases, security and privacy will be as covered by these tools.

      It should be noted that an archived web page or part could be just
      as dangerous as a "live" page or part; for instance, it could
      include insecure scripts, malware, trackers, etc.  Furthermore, an
      archived page can in fact be more dangerous, because it could
      include outdated scripts with known vulnerabilities that can never
      be patched because the script is archived for all time in a
      vulnerable state.

   Interoperability:

      This is covered by comments in the Syntax description:

      *  the PWID URN conforms to the URI standard defined as in
         [RFC3986] and the URN standard [RFC8141]

      *  the 'archival-time' of the PWID URN conforms to the UTC
         timestamp as described in the W3C profile of ISO 8601 [ISO8601]
         [W3CDTF] and is in accordance with the WARC standard ISO 28500
         [ISO28500].

      *  for 'archived-uri', this URI conforms to the URI standard
         defined as in [RFC3986], with %-encodings of "[", "]", "#", "?"



Zierau                    Expires March 8, 2020                [Page 13]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


         and "%" in order to conform to the URN standard [RFC8141] as
         well as having unambiguous use of "%"

   Resolution:

      The information in a PWID URN can be used for locating a web
      archive resource, for any kind of web archive.  It includes the
      minimum information for web archive materials, which enables
      resolvability, manually or by a resolver.  Resolution of a PWID
      URN is the primary motivation of making a formal URN definition,
      instead of just textual representation of the needed parts of a
      PWID.

      Resolution is done based on the PWID parts.  This can be done
      manually by using information from the PWID parts to lookup the
      web archive and use the web archives tools to search for the
      resource.  It can also be done automatically by using the
      information from the PWID parts to construct an URI to locate the
      archived resource the internet (for online web archives) or a
      local restricted network (for web archives with access
      restrictions).  The relevant information from the PWID parts are:

      *  Web archive domain for web archive holding referred resource
         The domain name for the web archive.  For the manual solution,
         this domain is used to find a description of how to access the
         web archive's materials.  For example, "archive.org" is the
         domain name leading to the Internet Archive's interface to
         their online web collection, and "netarkivet.dk" is the domain
         name leading to the website for the Danish web archive with
         information about how to apply for access permission to the web
         collections.  For an automatic solution, the domain will be
         used to identify how to calculate the pattern for the URI for
         the resource.

      *  Archived URI of archived resource
         For the manual solution this domain, the archived URI for the
         resource must be used in search for the resource.  For the
         automatic solution, this is used as a parameter for
         construction of the URI for the resource.

      *  Date and time associated with the archived item
         The archival date and time must be used in manual search for
         the resource or as parameter to automatic construction of the
         URI for the resource.

      *  Precision of what is referred
         The precision contributes to the guidance of how to view the
         referred item.  If the precision is 'page', the resource must



Zierau                    Expires March 8, 2020                [Page 14]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


         be browsed using the web archive browsing tool, which computes
         all parts needed for browsing of the page.  If the precision is
         'part', the "View page source" browser function can be used for
         pages to get the referred resource.  If the resource is a
         single file (this option is not activated, since the full
         resource is already shown).  The part precision can also be
         indicator for tools (e.g. a collection extraction tool) that
         they can fetch the contents by fetching the file pointed to.

      In the following, the different resolution techniques are
      explained (manual as well as via a service) using the following
      PWID URN as an example:

      urn:pwid:archive.org:2016-01-22T10:08:23Z:page:https://www.dr.dk

      In this example the information from the URN PWID parts are:

      *  "archive.org"
         Currently known identifier in form of the Internet Archive
         domain name for their open access web archive.

      *  "2016-01-22T10:08:23Z"
         UTC date and time associated with the archived URI

      *  "page"
         Clarification that the reference cover the full web page with
         all its inherited parts selected by the web archive

      *  "https://www.dr.dk"
         archived URI of the referenced resource

      A manual resolution technique would be to go through the following
      steps using the specified web archive's search interface (which
      will work for both open web archives and web archives with
      restricted access onsite):

      *  Browse the web archive domain "archive.org"
         In this case, the domain leads directly into a page where you
         can search for archived URIs (in other cases there may be need
         for additional clicks to get to search interface or
         descriptions of how to apply for access).

      *  Enter the archived URI "https://www.dr.dk" in the search field
         and make a search, which will result in an overview of the
         different times that the resource was archived.

      *  Use the archival time "2016-01-22T10:08:23Z" to select the
         correct resource



Zierau                    Expires March 8, 2020                [Page 15]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


      The "page" information is used in verification that the right
      precision level is reached.  In case the precision-spec had been
      'part', it would require an extra step selecting "View page
      source" on the resulting page.

      It is also noteworthy that the information in the PWID can help in
      finding an alternative resource, in case the original referred
      resource is no longer available.  The archived URI can be searched
      in other web archives, where the date and time can help to find
      the best match, e.g. via Memento [MEMENTO] (for some open web
      archives) or via possible coming web archive infrastructures.

      Alternative resolution (automatically or manually) of this URN
      PWID can be deduced based on the current (2019) knowledge of
      Internet Archive's open Wayback access web interface, which has
      the pattern:

         https://web.archive.org/web/<time>/<uri>

      Using this pattern (where only digits from the timestamp is
      included), it is possible to deduce the online https URI:

         https://web.archive.org/web/20160122100823/https://www.dr.dk

      The same recipe can be used for other Wayback platforms for open
      web archives.  For web archives with restricted access, there may
      be similar recipes, but it may also require special applications
      to extract the local URI for the resource (e.g. for Netarkivet, it
      is constructed using an API which uses the local CDX to generate
      the correct local URI for the resource).

      A resolving service is currently available in form of code for a
      prototype which run at the Royal Danish Library [PWIDresolver] and
      is planned to be more widely available.  This service currently
      covers both the Danish web archive (with the proper rights) and
      open web archives with access services based on a pattern
      including archive, archival time and archived URI.  In other
      words, for open web archives it covers conversion of PWID URNs
      for: archive.org, archive-it.org, arquivo.pt, bibalex.org,
      nationalarchives.gov.uk, stanford.edu and vefsafn.is.  For the
      Danish web archive with restricted access, the prototype works
      locally accessing the CDX of the library, and providing access via
      a local proxy to a restricted environment.  The source code for
      this prototype is available from
      https://github.com/netarchivesuite/NAS-research/releases/
      tag/0.0.6.

   Documentation:



Zierau                    Expires March 8, 2020                [Page 16]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


      None relevant

   Additional Information:

      Background:
      The PWID was originally suggested as a URI, based on research
      between a computer science researcher with knowledge of web
      archiving and researchers from humanity subjects (History and
      Literature).  This resulted in the paper "Persistent Web
      References - Best Practices and New Suggestions" [IPRES2016] from
      the iPres 2016 conference.  In this paper, the PWID is referred to
      as WPID.  However, feedback was received displaying a concern that
      WPID was interpreted as a PID related to a PID-system, e.g. as the
      DOI.  Although the definition of a PID does not contradict the
      name "WPID", there would still be a danger of confusing it with
      PID-systems, which is not the intension.  Consequently, this
      suggestion names the PWID instead.

      Comments on the drafted PWID URI ([DraftPwidUri]) have suggested
      that it should be a URN rather than a URI, which is why the PWID
      URN is defined here.

      There has been expressed interest for the PWID at several
      occasions, where it has been presented (iPRES 2016 [IPRES2016]
      paper, RESAW 2017 [ResawRef][ResawColl] papers, iPRES 2018
      [IPRES2018] best poster, iDCC 2019 [IDCC2019] poster.  Especially,
      web researchers from digital humanities have expressed a strong
      interest in the PWID, since it will fill a gap and make it
      possible for the researchers to make the necessary references.


      Limitations to when a PWID URN can be created:
      It can be argued that the PWID URN should not have any
      restrictions to which material it can be applied.  However, in
      order to make a standardized general way to identify material,
      there need to be assumptions on a set of information that can be
      used for identification.

      The limits made are can also be argued to be essential for
      material that are to be referenced on a long time basis, to have
      information about which archive, when it was archived and what was
      archived.  (See also discussion of web archive identification
      below).


      Discussion of the web archive identification:





Zierau                    Expires March 8, 2020                [Page 17]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


      Using the domain for a web archive as an identifier of the web
      archive is not ideal, but it is workable.  There are a number of
      examples where the domain may not work in the future:

         The web archive no longer exist

         The web archive have been merged into another web archive

         The web archive have change the domain they use

      In the first case, the precise material has been lost, which would
      be a similar situation if the web archive had been identified in
      any other way.  It is however recommendable for any user of
      references to evaluate the possible sustainability of a web
      archive before using the reference, e.g. by evaluating the
      probability of continues funding for the web archive.  In any
      case, the PWID contain information about archived URI and archival
      time, which enables a possibility to search for an alternative
      (possibly less precise) reference in other web archives, e.g. by
      using Memento.

      In the second and the third case, identification of the resource
      will require that the new domain is found.  The likelihood of
      finding such information is rather high for well-established web
      archives, by using one of two ways.  One way is to search for the
      domain change information online (if transition is described for
      the web archive at the new domain for the web archive).  Another
      way would be to search other web archives for the last harvests of
      the archive domain with information about forthcoming transition
      (many web archives harvests each other's domain home page, e.g.
      all the web archives mentioned in this document can be found in
      both archive.org and arquivo.pt).

      It would of course be ideal to have a registry that has exactly
      one identifier for a web archive, with different domains/patterns
      for online material for different periods if there have been
      changes.


      Possible extensions to be investigated:
      This first version of the PWID only contains a basic definition,
      which means that it does not include all of the possible
      extensions which have been suggested at different conferences.
      The reason is that these suggestions are not mature enough to be
      included at this stage.  The extensions suggested so far have
      been:





Zierau                    Expires March 8, 2020                [Page 18]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


      *  Having web archives identified by registered identifiers.
         There will be work on looking at an update to the PWID URN, if
         there can be found a workable solution e.g. by making such a
         registry by IANA.

      *  Having the possibility to use PWID for other web material than
         archived URIs, e.g. snapshots and collections

      *  Various possibilities for specifying the identified material,
         e.g. snapshot

      *  Discussion of how to extend use of PWID URNs via plugins in
         browsers, standardized way to ask web archives for resource
         specified as a PWID URN and access via future web research
         infrastructure

   Revision Information:

      This is the first version of PWID as a URN.

3.  Acknowledgements

   A special thanks to Caroline Nyvang and Thomas Kromann who have
   contributed to the research identifying the minimum information
   required in a persistent web reference, and to Bolette Jurik who
   contributed with supplementary research concerning requirements for
   web collection/corpora definitions.  Also thanks to everybody who has
   contributed to this work with the research parts and with reviewing
   of this RFC.

4.  References

4.1.  Normative References

   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
              STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987,
              <https://www.rfc-editor.org/info/rfc1034>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC3339]  Klyne, G. and C. Newman, "Date and Time on the Internet:
              Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002,
              <https://www.rfc-editor.org/info/rfc3339>.





Zierau                    Expires March 8, 2020                [Page 19]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
              Resource Identifier (URI): Generic Syntax", STD 66,
              RFC 3986, DOI 10.17487/RFC3986, January 2005,
              <https://www.rfc-editor.org/info/rfc3986>.

   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", STD 68, RFC 5234,
              DOI 10.17487/RFC5234, January 2008,
              <https://www.rfc-editor.org/info/rfc5234>.

   [RFC8141]  Saint-Andre, P. and J. Klensin, "Uniform Resource Names
              (URNs)", RFC 8141, DOI 10.17487/RFC8141, April 2017,
              <https://www.rfc-editor.org/info/rfc8141>.

4.2.  Informative References

   [DOI]      International DOI Foundation, "The DOI System", 2016,
              <https://web.archive.org/web/20161020222635/
              https:/www.doi.org/>.

              urn:pwid:archive.org:2016-10-20T22:26:35:page:https://www.
              doi.org/

   [DraftPwidUri]
              Zierau, E., "DRAFT: Scheme Specification for the pwid URI,
              Version 4", June 2018, <https://datatracker.ietf.org/doc/
              draft-pwid-uri-specification/>.

   [IDCC2019]
              Zierau, E., "Web References Meeting Requirements for
              Proper Referencing Principles"", February 2019,
              <http://www.dcc.ac.uk/sites/default/files/documents/IDCC19
              /222_Web%20References%20Meeting%20Requirements%20for%20Pro
              per%20Referencing%20Principles.pdf>.

              Poster at 14th International Digital Curation Conference
              (iDCC) 2019

   [IPRES2016]
              Zierau, E., Nyvang, C., and T. Kromann, "Persistent Web
              References - Best Practices and New Suggestions", October
              2016, <http://www.ipres2016.ch/frontend/organizers/media/
              iPRES2016/_PDF/
              IPR16.Proceedings_4_Web_Broschuere_Link.pdf>.

              In: proceedings of the 13th International Conference on
              Preservation of Digital Objects (iPres) 2016, pp. 237-246




Zierau                    Expires March 8, 2020                [Page 20]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


   [IPRES2018]
              Zierau, E., "Precise and Persistent Web Archive References
              - Status, context and expected progress of the PWID",
              September 2018, <https://osf.io/u5w3q/>.

              In: proceedings of the 15th International Conference on
              Preservation of Digital Objects (iPres) 2018, DOI:
              10.17605/OSF.IO/U5W3Q

   [ISO28500]
              International Organization for Standardization,
              "Information and documentation -- WARC file format", 2017,
              <https://www.iso.org/standard/68004.html>.

   [ISO8601]  International Organization for Standardization, "Data
              elements and interchange formats -- Information
              interchange -- Representation of dates and times", 2004,
              <https://www.iso.org/standard/40874.html>.

   [MEMENTO]  Memento Development Group, "About the Memento Project",
              January 2015, <http://mementoweb.org/about/>.

              urn:pwid:archive.org:2018-11-
              01T15:26:28Z:page:http://mementoweb.org/about/

   [PWIDprovider]
              Royal Danish Library (Netarkivet), "SolrWayback 3.1",
              2018, <https://github.com/netarchivesuite/solrwayback>.

              urn:pwid:archive.org:2018-06-
              11T02:00:05Z:page:https://github.com/netarchivesuite/
              solrwayback

   [PWIDresolver]
              Royal Danish Library (Netarkivet), "NAS-research version
              0.0.6", 2018, <https://github.com/netarchivesuite/NAS-
              research/releases/tag/0.0.6>.

              urn:pwid:archive.org:2018-07-
              16T06:53:51Z:page:https://github.com/netarchivesuite/NAS-
              research/releases/tag/0.0.6

   [ResawColl]
              Jurik, B. and E. Zierau, "Data Management of Web archive
              Research Data", 2017,
              <https://archivedweb.blogs.sas.ac.uk/files/2017/06/
              RESAW2017-JurikZierau-
              Data_management_of_web_archive_research_data.pdf>.



Zierau                    Expires March 8, 2020                [Page 21]

Internet-DraftA Persistent Web IDentifier (PWID) URN NamesSeptember 2019


              In: proceedings of the RESAW 2017 Conference, DOI:
              10.14296/resaw.0002

   [ResawRef]
              Nyvang, C., Kromann, T., and E. Zierau, "Capturing the Web
              at Large - a Critique of Current Web Referencing
              Practices", 2017,
              <https://archivedweb.blogs.sas.ac.uk/files/2017/06/
              RESAW2017-NyvangKromannZierau-
              Capturing_the_web_at_large.pdf>.

              In: proceedings of the RESAW 2017 Conference, DOI:
              10.14296/resaw.0004

   [W3CDTF]   W3C, "Date and Time Formats: note submitted to the W3C. 15
              September 1997", 1997,
              <http://www.w3.org/TR/NOTE-datetime>.

              urn:pwid:archive.org:2017-04-
              03T03:37:42Z:page:http://www.w3.org/TR/NOTE-datetime

Author's Address

   Eld Maj-Britt Olmuetz Zierau (editor)
   Royal Danish Library
   Soeren Kierkegaards Plads 1
   Copenhagen  1219
   Denmark

   Phone: +45 9132 4690
   Email: elzi@kb.dk




















Zierau                    Expires March 8, 2020                [Page 22]