Network Working Group                                           E. Wilde
Internet-Draft                                                ETH Zurich
Expires: December 10, 2005                                   Jun 8, 2005


         URI Fragment Identifiers for the text/plain Media Type
                      draft-wilde-text-fragment-04

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on December 10, 2005.

Copyright Notice

   Copyright (C) The Internet Society (2005).

Abstract

   This memo defines URI fragment identifiers for text/plain MIME
   entities.  These fragment identifiers make it possible to refer to
   parts of a text MIME entity, identified by character count or range,
   line count or range, or a regular expression.  These identification
   methods can be combined to identify more than one sub-resource of a
   text/plain MIME entity.  Fragment identifiers may also contain hash
   information to make them more robust.


Wilde                   Expires December 10, 2005               [Page 1]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
     1.1   What is text/plain?  . . . . . . . . . . . . . . . . . . .  3
       1.1.1   Line Endings in text/plain MIME Entities . . . . . . .  3
     1.2   What is a URI Fragment Identifier? . . . . . . . . . . . .  4
     1.3   Why text/plain Fragment Identifiers? . . . . . . . . . . .  4
     1.4   Incremental Deployment . . . . . . . . . . . . . . . . . .  5
   2.  Fragment Identification Methods  . . . . . . . . . . . . . . .  5
     2.1   Fragment Identification Schemes  . . . . . . . . . . . . .  6
       2.1.1   Principles . . . . . . . . . . . . . . . . . . . . . .  6
       2.1.2   Combining the Principles . . . . . . . . . . . . . . .  7
       2.1.3   Regular Expressions  . . . . . . . . . . . . . . . . .  8
       2.1.4   Combining Fragment Identification Scheme Parts . . . .  8
     2.2   Fragment Identifier Robustness . . . . . . . . . . . . . .  9
   3.  Fragment Identification Syntax . . . . . . . . . . . . . . . .  9
     3.1   Non-ASCII Characters in Regular Expressions  . . . . . . . 10
     3.2   Hash Sums  . . . . . . . . . . . . . . . . . . . . . . . . 10
   4.  Fragment Identifier Processing . . . . . . . . . . . . . . . . 11
     4.1   Handling of position Values  . . . . . . . . . . . . . . . 11
     4.2   Handling of Hash Sums  . . . . . . . . . . . . . . . . . . 11
     4.3   Syntax Errors in Fragment Identifiers  . . . . . . . . . . 11
   5.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 13
   7.  Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 13
     7.1   From -03 to -04  . . . . . . . . . . . . . . . . . . . . . 13
     7.2   From -02 to -03  . . . . . . . . . . . . . . . . . . . . . 13
     7.3   From -01 to -02  . . . . . . . . . . . . . . . . . . . . . 14
     7.4   From -00 to -01  . . . . . . . . . . . . . . . . . . . . . 14
   8.  Open Issues  . . . . . . . . . . . . . . . . . . . . . . . . . 14
     8.1   To Do  . . . . . . . . . . . . . . . . . . . . . . . . . . 14
     8.2   Open Questions . . . . . . . . . . . . . . . . . . . . . . 15
   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
     9.1   Normative References . . . . . . . . . . . . . . . . . . . 16
     9.2   Non-Normative References . . . . . . . . . . . . . . . . . 16
       Author's Address . . . . . . . . . . . . . . . . . . . . . . . 17
   A.  POSIX BRE Syntax . . . . . . . . . . . . . . . . . . . . . . . 17
   B.  Where to send Comments . . . . . . . . . . . . . . . . . . . . 17
   C.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17
       Intellectual Property and Copyright Statements . . . . . . . . 18


Wilde                   Expires December 10, 2005               [Page 2]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


1.  Introduction

   Compliant software MUST follow this specification.  The capitalized
   key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [1].

1.1  What is text/plain?

   Internet Media Types as defined in RFC 2045 [2] and RFC 2046 [3] are
   used to identify different types and sub-types of media.  RFC 2046
   [3] and RFC 3676 [4] specify the text/plain media type, which is used
   for simple, unformatted text.  Quoting from RFC 2046 [3]: "Plain text
   does not provide for or allow  formatting commands, font attribute
   specifications, processing instructions, interpretation directives,
   or content markup.  Plain text is seen simply as a linear sequence of
   characters, possibly interrupted by line breaks or page breaks."

   The text/plain media type does not restrict the character encoding,
   any character encoding may be used.  In the absence of an explicit
   character encoding declaration, US-ASCII is assumed as the default
   character encoding.  This variability of the character encoding makes
   it impossible to count characters in a text/plain MIME entity without
   taking the character encoding into account, because there are many
   character encodings using more than one octet per character.

   The biggest advantage of text/plain MIME entities is their ease of
   use and their portability among different platforms.  As long as they
   use popular character encodings (such as US-ASCII), they can be
   displayed and processed on virtually every computer system.

1.1.1  Line Endings in text/plain MIME Entities

   RFC 2046 [3] and RFC 3676 [4] specify that line endings in text/plain
   MIME entities are represented by CR+LF character sequences.  In
   implementation practice, however, text/plain MIME entities use
   different conventions, for example depending on the operating system
   they have been created with (in most cases, Unix uses LF, MacOS uses
   CR, and Windows uses CR+LF).  Because of this diversity of
   conventions, implementations interpreting text/plain fragment
   identifiers MUST take different line ending conventions into account.

   Line endings in text/plain MIME entities MAY be represented by other
   character (sequences) than CR+LF, specifically CR, LF, NEL, and CR+
   NEL.  All these character (sequences) MUST be interpreted as line
   endings.  This interpretation MUST affect the evaluation of text/
   plain fragment identifiers.  All representations of line endings
   (CR+LF, CR, LF, NEL, and CR+NEL) MUST be treated as a single


Wilde                   Expires December 10, 2005               [Page 3]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   character in character counts.  For the purpose of regular expression
   matching, all representations of line endings MUST be treated as
   single LF characters.  The reason for this is that fragment
   identifiers should not be broken by converting a file from one line
   ending convention to another.

   In general, the line ending conventions used in text/plain MIME
   entities depends on the character encoding of the MIME entity.
   Implementations SHOULD attempt to be as accurate as possible in
   recognizing line ending specific to particular character encodings,
   and MUST treat all these line endings as one character in character
   counts, and single LF characters for regular expression matching.

1.2  What is a URI Fragment Identifier?

   URIs are the identification mechanism for resources on the Web. The
   URI syntax specified in RFC 3986 [5] includes as part of a URI a
   fragment identifier, which (quoting from RFC 3986 [5]) "consists of
   additional reference information to be interpreted by the user agent
   after the retrieval action has been successfully completed.  As such,
   it is not part of a URI, but is often used in conjunction with a URI.
   The semantics of a fragment identifier is a property of the data
   resulting from a retrieval action, regardless of the type of URI used
   in the reference.  Therefore, the format and interpretation of
   fragment identifiers is dependent on the media type of the retrieval
   result."

   The most popular fragment identifier is defined for text/html
   (defined in RFC 2854 [10]), and makes it possible to refer to a
   specific element (identified by a 'name' or 'id' attribute) of an
   HTML document.

1.3  Why text/plain Fragment Identifiers?

   Referring to specific parts of a resource can be very useful, because
   it enables users and applications to create more specific references.
   Rather than pointing to a whole resource, users can create references
   to the part they really are interested in or want to talk about.
   Even though it is suggested that fragment identification methods are
   specified in a media type's MIME registration, many media types do
   not have fragment identification methods associated with them.

   Fragment identifiers are only useful if supported by the client,
   because they are only interpreted by the client.  Therefore, a new
   fragment identification method will require some time to be adopted
   by clients, and older clients will not support it.  However, because
   the URI still works even if the fragment identifier is not supported
   (the resource is retrieved, but the fragment identifier is not


Wilde                   Expires December 10, 2005               [Page 4]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   interpreted), rapid adoption is not highly critical to ensure the
   success of a new fragment identification method.

   Fragment identifiers for text/plain make it possible to refer to
   specific parts of a text MIME entity, using concepts of positions and
   ranges, which may be applied to characters and lines.  The also
   support locating a fragment by using a regular expression for
   searching for a specific character sequence.  Thus, text/plain
   fragment identifiers enable users to exchange information more
   specifically, thereby reducing time and effort that is necessary to
   manually search for the relevant part of a text/plain MIME entity.

1.4  Incremental Deployment

   As long as support for text/plain fragment identifiers is not
   implemented by all programs, it is important to consider the
   implications of incremental deployment.  Clients (for example, Web
   browsers) not supporting the text/plain fragment identifier described
   in this memo will work with URI references to text/plain MIME
   entities, but they will fail to locate the sub-resource identified by
   the fragment identifier.  This is a reasonable fallback behavior, and
   in general users should take into account the possibility that a
   program interpreting a given URI will fail to interpret the fragment
   identifier part.  Since fragment identifier evaluation is local to
   the client (and happens after retrieving the MIME entity), there is
   no way for a server to determine whether a requesting client is using
   a URI containing a fragment identifier.

2.  Fragment Identification Methods

   The identification of fragments of text/plain MIME entities can be
   based on different foundations.  Since it is not possible to insert
   explicit, invisible identifiers into a text/plain MIME entity (as for
   example used in HTML documents, implemented through special
   attributes), fragment identification has to rely on certain inherent
   criteria of the MIME entity.  This memo specifies fragment
   identification using six different methods, which are character
   positions and ranges, line positions and ranges, regular expression
   matching, and a mechanism for improving the robustness of fragment
   identifiers (entity hashes).

   When interpreting character or line numbers, implementations MUST
   take the character encoding of the MIME entity into account, because
   character count and octet count may differ for the character encoding
   being used.  For example, a MIME entity using UTF-16 encoding (as
   specified in RFC 2718 [11]) uses two octets per character, and it may
   have a leading BOM (Byte-Order Mark), which does not count as a
   character and thus also affects the mapping from a simple octet count


Wilde                   Expires December 10, 2005               [Page 5]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   to a character count.

2.1  Fragment Identification Schemes

   Fragment identification can be done using regular expressions or
   combining two orthogonal principles, which are positions and ranges,
   and characters and lines.  The following section describe the
   principles themselves, while Section 2.1.2 describes the combination
   of the principles.

2.1.1  Principles

2.1.1.1  Positions and Ranges

   A position does not identify an actual fragment of the MIME entity,
   but a position inside the MIME entity, which could be regarded as a
   fragment of zero length.  The use case for positions is to provide
   pointers for applications which may use them to implement
   functionalities such as "insert some text here", which needs a
   position rather than a fragment.  Positions are counted from zero
   (position zero being before the first character or line of a text/
   plain MIME entity), so that a text/plain MIME entity having one
   character has two positions, one before the first character (position
   0), and one after the first character (position 1).

   Since positions are fragments of length zero, applications SHOULD use
   other methods than highlighting to indicate positions, the most
   obvious way being the positioning of a cursor (if the application
   supports the concept of a cursor).

   Ranges, on the other hand, identify fragments of a MIME entity that
   have a length that may be greater than zero.  As a general principle
   for ranges, they specify both a lower and a upper bound.  The start
   or the end of a range specification may be omitted, defaulting to the
   first repectively last position of the MIME entity.  The ending
   position of a range must have a value greater than or equal to the
   lower position (consequently, a range with identical lower and upper
   positions is legal, and identifies a range of length 0, which is
   equivalent to a position).  Counting for ranges uses positions, so
   that a fragment containing one entity is specified by using a range
   with two adjacent positions.

   Since ranges are fragments with a length greater than zero,
   applications SHOULD use methods like highlighting to indicate ranges
   (if the application supports the concept of highlighting).

   For positions and ranges it is implicitly assumed that if a number is
   greater than the actual number of elements in the MIME entity, then


Wilde                   Expires December 10, 2005               [Page 6]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   it is referring to the last element of the MIME entity (see Section 4
   for the processing model).

2.1.1.2  Characters and Lines

   The concept of positions and ranges may be applied to characters and
   lines.  In both cases, positions indicate points between entities,
   while ranges identify zero or more entities by indicating positions.

   Character positions are numbered starting with zero (ignoring initial
   BOM marks or similar concepts that are not part of the actual textual
   content of a text/plain MIME entity), and counting each character
   separately, with the exception of line endings, which are always
   counted as one character (Section 1.1.1 describes how line endings
   MUST be identified).

   Line positions are numbered starting with zero (with line position
   zero always being identical with character position zero), with
   Section 1.1.1 describing how line endings MUST be identified.
   Fragments identified by lines include the line endings, so
   applications identifying line-based fragments MUST include the line
   endings in the fragment identification they are using (eg, the
   highlighted selection).  If a MIME entity does not contain any line
   endings, then it consists of a single (the first) line.

2.1.2  Combining the Principles

   In the following sections, the principles described in the preceding
   section (positions/ranges and characters/lines) are combined,
   resulting in four use cases.

2.1.2.1  Character Position

   Using the char scheme followed by a single number, it is possible to
   point to a character position (ie, a fragment of length zero between
   two characters).  Rather than identifying a fragment consisting of a
   number of characters, this method identifies a position between two
   characters (or before the first or after the last character).
   Character position counting starts with 0, so the character position
   before the first character of a text/plain MIME entity has the
   character position 0, and a MIME entity containing n distinct
   characters has n+1 distinct character positions, the last one having
   the character position n.

2.1.2.2  Character Range

   If it is necessary to identify a fragment of one or more characters
   using character counting, this can be done by using a character


Wilde                   Expires December 10, 2005               [Page 7]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   range, using the char scheme followed by a range specification.  A
   character range is a consecutive region of the MIME entity that
   extends from the starting character position of the range to the
   ending character position of the range.

2.1.2.3  Line Position

   Using the line scheme followed by a single number, it is possible to
   point to a line position (ie, a fragment of length zero between two
   lines).  Rather than identifying a fragment consisting of a number of
   lines, this method identifies a position between two lines (or before
   the first or after the last line).  Line position counting starts
   with 0, so the line position before the first line of a text/plain
   MIME entity has the line position 0, and a MIME entity containing n
   distinct lines has n+1 distinct line positions, the last one having
   the line position n.

2.1.2.4  Line Range

   If it is necessary to identify a fragment of one or more lines using
   line counting, this can be done by using a line range, using the line
   scheme followed by a range specification.  A line range is a
   consecutive region of the MIME entity that extends from the starting
   line position of the range to the ending line position of the range.

2.1.3  Regular Expressions

   A common problem with fragment identifiers is their robustness (to
   changes in the MIME entity), and character and line counts can break
   very easily.  A more robust way of identifying a fragment is by
   searching for a specific pattern (another way of making fragment
   identifiers more robust is described in Section 2.2 about including
   entity hash sums in the fragment identifier).  Thus, it is possible
   to use a Basic Regular Expression (BRE) as defined by ISO 9945-2 [6]
   (the POSIX standard) as a fragment identifier (Appendix A contains a
   short summary of the POSIX BRE syntax).

2.1.4  Combining Fragment Identification Scheme Parts

   While in most cases only one fragment identification scheme part will
   be used, it is possible to combine them.  By simply concatenating
   different fragment identification scheme parts, separated by a
   semicolon, the whole fragment identifier refers to the union of all
   fragments of the text/plain MIME entity identified by the individual
   fragment identification scheme parts.  This way, it is possible to
   identify disjoint ranges, such as multiple line ranges.

   It should be noticed that regular expressions by themselves may


Wilde                   Expires December 10, 2005               [Page 8]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   identify disjoint fragments, which is true in any case where the
   regular expression matches more than one occurrence in the MIME
   entity.

   Since disjoint fragments can be identified, implementations SHOULD
   make sure that these fragments are appropriately marked, for example
   by highlighting the fragment (rather than only scrolling to some
   line, which only identifies a single position in the MIME entity).
   If an implementation can not mark disjoint fragments, it MAY resort
   to marking only the first of the disjoint fragments.  However, the
   exact method of how implementations deal with disjoint fragments
   depends on the application and interface, and is beyond the scope of
   this memo.

2.2  Fragment Identifier Robustness

   While regular expressions (as described in Section 2.1.3) may make
   fragment identifiers more robust than character or line counts, it is
   still possible that modifications of the resource will break the
   fragment identifier.  If applications want to create more robust
   fragment identifiers, they may do so by adding hash sums to fragment
   identifiers.  These hash sums are used to detect a change in the
   resource, so that applications may warn users about the possibility
   that a fragment identifier might have been broken by a modification
   of the resource.  Since fragment identifiers are interpreted by
   clients, hash sums are defined on MIME entities rather than the
   resource itself, and as such are specific to a certain representation
   of the resource, in case of text/plain resources the character
   encoding of MIME entity.

   Hash sums may specify the character encoding that has been used when
   creating the hash sums, and if such a specification is present,
   clients MUST check whether the character encoding specified for the
   hash sum and the character encoding of the retrieved MIME entity are
   equal, and clients MUST NOT check the hash sum if these values
   differ.

3.  Fragment Identification Syntax

   The syntax for the fragment identifiers is straightforward.  The
   syntax defines four schemes, 'char', 'line', 'match', and hash (which
   can either be 'length' or 'md5').  The 'char' and 'line' can be used
   in two different variants, either the position variant (with a single
   number), or the range variant (with two comma-separated positions).
   The 'match' scheme has a regular expression as parameter, which must
   be specified as a string with escaped semicolons (because the
   semicolon is used to concatenate multiple fragment identification
   scheme parts).  The hash scheme can either use the 'length' or the


Wilde                   Expires December 10, 2005               [Page 9]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   'md5' scheme to specify a hash value.

   The following syntax definition uses ABNF as defined in RFC 2234 [7].


   text-fragment =  text-scheme 0*( ";" text-scheme) 0*( ";" hash-scheme)
   text-scheme   =  ( char-scheme / line-scheme / match-scheme )
   hash-scheme   =  ( length-scheme / md5-scheme ) [ "," charenc ]
   char-scheme   =  "char=" ( position / range )
   line-scheme   =  "line=" ( position / range )
   match-scheme  =  "match=" regex
   position      =  number
   range         =  (position "," [ position ]) / ("," position )
   number        =  1*( DIGIT )
   regex         =  StringWithEscapedSemicolon
   length-scheme =  "length=" number
   md5-scheme    =  "md5=" md5-value
   md5-value     =  32( hexdigit )
   hexdigit      =  (DIGIT / "a" / "A" / "b" / "B" / "c" / "C" / "d" / "D" / "e" / "E" / "f" / "F" )
   charenc       =  StringWithEscapedSemicolon

   The StringWithEscapedSemicolon is a string where all characters may
   appear literally (except the characters which are required by the URI
   syntax to be escaped), with the exception of a semicolon.  A
   semicolon that should be part of the regular expression must be
   escaped with a leading backslash, and implementations MUST make sure
   to properly interpret regular expressions, properly dereferencing all
   escape mechanisms that apply (ie, URI encoding, semicolon escaping,
   and BRE escaping, as well as any additional escaping that may be
   present due to the context of the URI).

3.1  Non-ASCII Characters in Regular Expressions

   RFC 3986 [5] only allows a subset of ASCII as characters in URIs.
   Consequently, it is not possible to use non-ASCII characters in URIs.
   However, using Internationalized Resource Identifiers (IRI) as
   defined by RFC 3987 [8], it is possible to use non-ASCII characters,
   using the encoding defined by IRI.  Thus, using IRIs it is possible
   to use non-ASCII characters in regular expressions, and
   implementations MUST make sure to correctly handle any non-ASCII
   characters in regular expressions, if they accept IRI-encoded text/
   plain fragment identifiers.

3.2  Hash Sums

   A hash sum can either specify a MIME entity's length, or its MD5
   fingerprint.  In both cases, it can optionally specify the character
   encoding which had been used when calculating the hash sum, so that


Wilde                   Expires December 10, 2005              [Page 10]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   clients interpreting the fragment identifier may check whether they
   are using the same character encoding for their calculations.  The
   length of a text/plain MIME entity is calculated by using the
   principles defined in Section 2.1.1.2.  The MD5 fingerprint of a
   text/plain MIME entity is calculated by using the algorithm presented
   in [9], encoding the result in 16 hexadecimal digits (using uppercase
   or lowercase letters) as a representation of the 128 bit which are
   the result of the MD5 algorithm.

4.  Fragment Identifier Processing

4.1  Handling of position Values

   If any position value (as a position or inside a range) is greater
   than the value for the actual MIME entity, then it identifies the
   last character or line position of the MIME entity.  If the first
   position value in a range is not present, then the range extends from
   the start of the MIME entity.  If the second position value in a
   range is not present, then the range extends to the end of the MIME
   entity.  If a range scheme's positions are not properly ordered (ie,
   the first number is less than the second), then this scheme part MUST
   be ignored.

4.2  Handling of Hash Sums

   If a fragment identifier contains a hash sum, and a client retrieves
   a MIME entity and detects that the hash sum has changed (observing
   the character encoding specification, if present), then the client
   MUST NOT interpret any other text/plain fragment identifier scheme
   part.  A client MAY signal this situation to the user.

4.3  Syntax Errors in Fragment Identifiers

   If a fragment identifier contains a syntax error (i.e., does not
   conform to the syntax specified in Section 3), then it MUST be
   ignored by clients.  Clients SHOULD NOT make any attempt to correct
   or guess fragment identifiers.  Syntax errors MAY be reported by
   clients.

5.  Examples

   The following examples show some usages for the fragment identifiers
   defined in this memo.


   http://example.com/text.txt#char=100

   This URI identifies the position after the 100th character of the


Wilde                   Expires December 10, 2005              [Page 11]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   text.txt MIME entity.  It should be noted that it is not clear which
   octet(s) of the MIME entity this will be without retrieving the MIME
   entity and thus knowing which character encoding is it using (in case
   of HTTP, this information will be given in the response's Content-
   type header).  If the MIME entity has fewer than 100 characters, the
   URI identifies the position after the MIME entity's last character.


   http://example.com/text.txt#line=10,20

   This URI identifies lines 11 to 20 of the text.txt MIME entity.  If
   the MIME entity has fewer than 11 lines, it identifies the position
   after last line.  If the MIME entity has less than 20 but at least 11
   lines, it identifies the lines 11 to the last line of the MIME
   entity.


   http://example.com/text.txt#match=searchterm

   This URI identifies all occurrences of the regular expression
   'searchterm' in the MIME entity, ie all occurrences of the string
   'searchterm'.  If there is more than one occurrence, then this URI
   identifies a disjoint fragment, consisting of all of these
   occurrences.  If there is no occurrence of the search term, the URI
   does not identify a fragment.


   http://example.com/text.txt#line=,1;match=searchterm

   This URI identifies the first line and all occurrences of the regular
   expression 'searchterm' in the MIME entity.  If there is an
   occurrence of 'searchterm' outside of the first line, then this URI
   identifies a disjoint fragment.


   http://example.com/text.txt#match=hello\;

   This URI identifies all occurrences of the regular expression
   'hello;' in the MIME entity.  The semicolon with the leading
   backslash has to be interpreted as a literal semicolon inside of the
   BRE, treating the '\;' as an escaped ';', so that the actual regular
   expression is 'hello;'.  If there is more than one occurrence of this
   regular expression, then this URI identifies a disjoint fragment,
   consisting of all of these occurrences.


   ...


Wilde                   Expires December 10, 2005              [Page 12]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   (more complex example...)

6.  Security Considerations

   Regular expression matching code is notoriously vulnerable to buffer
   overflow security holes, so any implementation supporting text/plain
   fragment identifiers SHOULD make sure that the code being used has
   been tested against buffer overflow attacks.

7.  Change Log

   This section will not be part of the final RFC text, it serves as a
   container to collect the history of the individual draft versions.

7.1  From -03 to -04

   o  URIs are now defined by RFC 3986 [5], so the text and the
      references have been updated.  In particular, RFC3986 defines a
      fragment identifier to be part of the URI, whereas in the
      obsoleted RFC 2396 URI specification, it was not part of a URI as
      such, but of a "URI reference".

   o  IRIs are now defined by RFC 3987 [8], so the text and the
      references have been updated.

   o  Changed IPR clause from RFC 3667 to RFC 3978 (updated version of
      RFC 3667)


7.2  From -02 to -03

   o  Replaced most occurrences of 'resource' with 'MIME entity',
      because the result of dereferencing a URI is not the resource
      itself, but some MIME entity (in our case of type text/plain)
      representing it.  Thanks to Sandro Hawke for pointing this out.

   o  Moved Section 8 to the very back of the document.

   o  Added Section 4 to define the processing model for fragment
      identifiers (moved Section 4.1 from Section 3 to Section 4).

   o  Added hash scheme to make fragment identifiers more robust
      (Section 2.2).

   o  Changed IPR clause from RFC 2026 to RFC 3667 (updated version of
      RFC 2026)


Wilde                   Expires December 10, 2005              [Page 13]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


7.3  From -01 to -02

   o  Fundamental change in semantics: counts turn into positions
      (between characters or lines), so in order to identify a character
      or line, ranges must be used (which now use positions to specify
      the upper and lower bounds of the range).

   o  Made the first value of a range optional as well, so that line=,5
      also is legal, identifying everything from the start of the MIME
      entity to the 5th line.

   o  Changed the syntax from paranthesis-style to a more traditional
      style using equals-signs.


7.4  From -00 to -01

   o  Made the second count value of ranges optional, so that something
      like line(10,) is legal and properly defined.

   o  Added non-normative reference to Internet draft about non-ASCII
      characters in search strings.

   o  Added Section 1.4 about incremental deployement.

   o  Added more elaborate examples.

   o  Added text about regex buffer overflow problems in Section 6.

   o  Added Section 1.1.1 about line endings in text/plain resources.

   o  Added Section 8 to collect open issues regarding this memo (will
      be deleted in final RFC text).


8.  Open Issues

   This section will not be part of the final RFC text, it serves as a
   container to collect to-dos (Section 8.1) and open questions
   (Section 8.2) regarding this memo.

8.1  To Do

   o  Allow negative numbers for positions, which are interpreted as
      counting backwards from the MIME entity's end.

   o  Provide more complex example(s).


Wilde                   Expires December 10, 2005              [Page 14]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


   o  Provide short BRE syntax and description in Appendix A (by
      inclusion or by reference).

   o  Add some text about the importance of having fragment
      identification capabilities for out-of-line linking methods such
      as XLink to Section 1.3.


8.2  Open Questions

   o  Escaping a semicolon in a regex (so that it is interpreted
      literally) is now done by using a leading backslash.  Shouldn't
      that be changed to the more URI-style way of percent-encoding it,
      if it should appear literally?

   o  Should regex ranges be allowed (ie, a fragment ranging from one
      regex match to another regex match)?

   o  Should a more sophisticated regex mechanism than BREs be used?

   o  Regexes by themselves may identify disjoint sub-resources.  Should
      there be a mechanism to say something like "the 5th appearance of
      the following regex"?  Or are users responsible for composing
      regexes which do not need this kind of additional mechanism?

   o  Is the concatenation of scheme parts (Section 2.1.4) and its
      semantics of joining the individual fragments a good thing?  Or a
      bad thing?

   o  Should there be more schemes?  Or less?

   o  Is it necessary to mention that applications must be able to
      transcode characters, because the text file and the fragment
      identifier may use different character encodings?  What about
      character normalization?  Should that be addressed or at least
      mentioned as being out of scope?

   o  MD5 values are now specified as 32 hex digits.  An alternative
      would be the representation as specified by [12], which defines
      base64 encoding for the 128 bits of the checksum.  Should both
      forms be allowed (hex and base64) or is one enough?  If only one,
      is hex the right choice?


9.  References


Wilde                   Expires December 10, 2005              [Page 15]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


9.1  Normative References

   [1]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", RFC 2119, March 1997.

   [2]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
        Extensions (MIME) Part One: Format of Internet Message Bodies",
        RFC 2045, November 1996.

   [3]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
        Extensions (MIME) Part Two: Media Types", RFC 2046,
        November 1996.

   [4]  Gellens, R., "The Text/Plain Format and DelSp Parameters",
        RFC 3676, February 2004.

   [5]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
        Resource Identifier (URI): Generic Syntax", RFC 3986,
        January 2005.

   [6]  International Organization for Standardization, "Information
        technology - Portable Operating System Interface (POSIX) - Part
        2: Shell and Utilities", ISO 9945-2, 1993.

   [7]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
        Specifications: ABNF", RFC 2234, November 1997.

   [8]  Duerst, M. and M. Suignard, "Internationalized Resource
        Identifiers (IRI)", RFC 3987, January 2005.

   [9]  Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321,
        April 1992.

9.2  Non-Normative References

   [10]  Connolly, D. and L. Masinter, "The 'text/html' Media Type",
         RFC 2854, June 2000.

   [11]  Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646",
         RFC 2781, February 2000.

   [12]  Myers, J. and M. Rose, "The Content-MD5 Header Field",
         RFC 1864, October 1995.

   [13]  Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629,
         June 1999.


Wilde                   Expires December 10, 2005              [Page 16]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


Author's Address

   Erik Wilde
   ETH Zurich
   ETH-Zentrum
   8092 Zurich
   Switzerland

   Phone: +41-1-6325132
   Email: net.dret@dret.net
   URI:   http://dret.net/netdret/

Appendix A.  POSIX BRE Syntax

   This section contains a short (and non-normative) summary of the
   POSIX BRE syntax defined in ISO 9945-2 [6].  The definition of BRE
   syntax in ISO 9945-2 [6] is the normative reference, and the
   following summary is for informative purposes only.

   (tbd - is there some rfc that could be referenced instead?)

Appendix B.  Where to send Comments

   Please send all comments and questions concerning this document to
   Erik Wilde.

Appendix C.  Acknowledgements

   This document has been prepared using the IETF document DTD described
   in RFC 2629 [13].

   Thanks for comments and suggestions provided by Dan Kohn, John Cowan,
   Benja Fallenstein, Henrik Levkowetz, Sandro Hawke, Marcel Baschnagel,
   and Martin Duerst.


Wilde                   Expires December 10, 2005              [Page 17]

Internet-Draft       text/plain Fragment Identifiers            Jun 2005


Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Disclaimer of Validity

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Copyright Statement

   Copyright (C) The Internet Society (2005).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.


Acknowledgment

   Funding for the RFC Editor function is currently provided by the
   Internet Society.


Wilde                   Expires December 10, 2005              [Page 18]