DAV Searching and Locating March 1998 INTERNET-DRAFT S. Reddy draft-reddy-dasl-requirements-02.txt Microsoft Corporation March, 1998 J. Slein Expires July, 1998 Xerox Corporation Requirements for DAV Searching and Locating Status of this Memo This document is an Internet draft. Internet drafts are working documents of the Internet Engineering Task Force (IETF), its areas and its working groups. Note that other groups may also distribute working information as Internet drafts. Internet Drafts are draft documents valid for a maximum of six months and can be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use Internet drafts as reference material or to cite them as other than as "work in progress". To learn the current status of any Internet draft please check the "lid-abstracts.txt" listing contained in the Internet drafts shadow directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East coast) or ftp.isi.edu (US West coast). Further information about the IETF can be found at URL: http://www.ietf.org/ Distribution of this document is unlimited. Editorial comments should be sent to the author (saveenr@microsoft.com). Abstract The Distributed Authoring and Versioning protocol [WEBDAV] defines simple mechanisms to assign and retrieve values for properties. This document presents requirements for a WEBDAV extension to support efficient searching for resources based on WEBDAV properties and content. These requirements are intended to be the basis for the DAV Searching a Location (DASL) protocol. 1 Introduction 1.1 Existing DAV searching mechanisms INTERNET DRAFT DAV Searching and Locating 1 DAV Searching and Locating March 1998 WEBDAV and HTTP provide support for client-side search, but not server-side search. The GET method defined in [HTTP] allows clients to retrieve a resource’s content; the PROPFIND method defined in [WEBDAV] allows clients to retrieve a resource’s properties. Having retrieved a resource’s properties and / or content, the client can compare them to its search criteria to determine whether the resource is of interest. 1.2 Limitations of Client-side Searching Client-side searching requires no modifications to the server. However, simplicity for the server comes at a cost: (1) It makes inefficient use of network resources. Clients must retrieve properties and content for each resource under consideration. (2) It does not take advantage of server intelligence. Servers capable of searching can use sophisticated mechanisms to generate results: internal caching of intermediate search results, content-indexing, etc. Even simple, common queries may expose these limitations. Consider the query "find all text files modified during the last week.” When such a query is extended to a large number of clients searching against a single server, the limitations become more apparent. Client-side searching has difficulties scaling in these cases. 1.3 Server-side Searching DASL allows for server-side searching. Server-side searching allows the client to formulate a query and have the server perform task of selecting the resources that fit the criteria. This overcomes both of the limitations of client-side searching described above. The benefit is a searching solution that scales; the cost is that the server software becomes more complex. 2 Terminology 2.1 DASL Terms 2.1.1 Search Criteria INTERNET DRAFT DAV Searching and Locating 2 DAV Searching and Locating March 1998 Search criteria are an expression against which each resource in the search scope is evaluated. Those resources for which the expression evaluates to True are included in the result set. 2.1.2 Search Expression An Expression is a Term or the negation of an Expression (using the Boolean NOT operator) or two expressions joined by one of the Boolean operators (AND or OR). An expression evaluates to either True, False, or Unknown. 2.1.3 Search Term A Search Term is an assertion about a resource. The term may assert that: (1) a property has a relationship to some value, (2) a property exists, or (3) the content of a resource has a relationship to some value. 2.1.4 Result Set The Result Set is a response to a search request. This is a set of result records, one record for each resource that matches the search criteria. 2.1.5 Result Record Definition The Result Record Definition is the set of properties specified by the client that it requests the server to transmit for each resource that matches the criteria. 2.1.6 Result Record A unit of information appearing in the result set that corresponds to a resource that matches the search criteria. The record consists of those properties listed in the Result Record Definition. 2.1.7 Search Scope The Search Scope is the set of resources to be searched. Comparison Operator A comparison operator is a function used in a search term that evaluates the relationship between two values. Examples of comparison operators are <, <=, >=, >, ==, and != . INTERNET DRAFT DAV Searching and Locating 3 DAV Searching and Locating March 1998 2.1.8 Sort Specification A sort specification tells the server how to sort the result set. 2.1.9 Search Attribute A Search Attribute is an instruction that governs the execution of the query but is not part of the search scope, result record definition, the search criteria, or the sort specification. An example of a search modifier is one that controls how much time the server can spend on the query before giving a response. 2.1.10 Query The Query is the combination of search criteria, search scope, result record definition, sort specification, and search attributes. 2.2 Additional Terms In addition to the terms defined above, this document uses terminology consistent with [HTTP] and [WEBDAV]. 3 Query Semantics 3.1 General Requirements 3.1.1 Simple Searches on Content It must be possible to perform simple searches on content of any media type. Searching for specific content inside a resource is a common operation. DASL must provide a mechanism to provide searching on content of a resource to provide for this scenario. 3.1.2 Variants It must be possible for searches to occur across multiple variants of a resource and to target specific variants. The WEBDAV working group is addressing the standardization of mechanisms for authors to use when submitting variants to the server. DASL must provide mechanisms that can intelligently query on those variants. INTERNET DRAFT DAV Searching and Locating 4 DAV Searching and Locating March 1998 3.1.3 Versioning It must be possible for searches to occur across multiple versions of resource and to target specific versions. The WEBDAV working group is addressing the standardization of mechanisms for authors to use when submitting versions to the server. DASL must provide mechanisms that can intelligently query on those versions. 3.2 Result Record Definition The client must be able to identify the properties or content to be returned in the result records. Search criteria and search result records are not required to overlap. For example, a query might ask for "the authors of those documents under 10K in size". In this case, the criterion relates only to the size, but the desired result record contains only the author. 3.3 Scope 3.3.1 Scope Identification & Multiple Scopes It must be possible for the client to specify a number of different, unrelated URIs over which the search is to range. 3.3.2 Resource-Based Scopes It should be possible to perform scoping within a resource. For example, one may wish to limit a search to a single chapter within a document. 3.3.3 Depth It must be possible for the client to specify the "depth" of a search for a search scope URI. Users often intend to scope their searches either to the immediate children of a container or to extend the search recursively to the container's children. Furthermore, depth control is needed to prevent servers from performing unnecessary work. INTERNET DRAFT DAV Searching and Locating 5 DAV Searching and Locating March 1998 3.4 Search Criteria 3.4.1 Simple Terms 3.4.1.1 Exact Matching A query term must be able to compare the entire value of a property to some constant value. 3.4.1.2 Regular Expression Matching A query term must be able to compare a property to an expression with the expressive power of regular expressions. The power and frequent use of the UNIX utility GREP highlights the value of regular expressions for searching large bodies of content. 3.4.1.3 Property Comparisons It must be possible to specify criteria on "equal to", and "not equal to" for all property values that can be compared. It must be possible to support relative comparison operators ( >, >=, <=, and < ) on those properties that can be ordered (for example, those having numerical values). Many common searches involve such comparisons. For example, a stereotypical query might ask for "those documents under 10K in size" or "those text files authored by Saveen". DASL must support the ability to compare property values against literal values, other property values, and expressions. 3.4.1.4 Content Comparisons It must be possible to specify searches for content-based operators such as NEAR, IN, CONTAINS, LIKE. It must be possible to specify how linguistic stemming, phonetic searching, truncation, keyword expansion, and case-sensitivity will play a role in the search. It must be possible to specify the relevance and ranking criteria for content-based searches. INTERNET DRAFT DAV Searching and Locating 6 DAV Searching and Locating March 1998 3.4.1.5 Existence Assertions It must be possible to test for the existence or non-existence of a property. 3.4.2 Complex Expressions 3.4.2.1 Logical Boolean Operators It must be possible to use the logical Boolean operators (AND, OR, NOT) in the search criteria to combine search expressions. Often criteria involve the evaluation of several conditions simultaneously. For example, a stereotypical query might ask for "those documents modified by user X within some period of time Y." Boolean operations are necessary to provide support these criteria. 3.4.2.2 Undefined properties and values The behavior of a query when properties or their values are undefined must be specified. Undefined properties are those that do not exist. Their role in query evaluation needs to be specified. Undefined values can occur when properties are calculated from expressions like "x/y" where y=0. 3.4.2.3 Sort Order DASL must define a mechanism to allow clients to specify a sort order for the result set. 3.4.3 Other Query Attributes 3.4.3.1 Maximum Result Rest It must be possible to indicate that the search result must not exceeded some fixed number of records. 3.4.3.2 Paged Results It must be possible to request pages results. 3.5 Query Syntax INTERNET DRAFT DAV Searching and Locating 7 DAV Searching and Locating March 1998 3.5.1 Standard Query Grammar The DASL extensions must define a query grammar that provides simple searching functionality. For the sake of interoperability, DASL servers are expected to offer a basic set of searching capabilities. Likewise, clients need a standard, simple syntax by which to access those capabilities. 3.5.2 Support for Other Query Grammars DASL extensions must allow servers to support other grammars. A particular query grammar may not expose useful searching functionality of a server. Clients should be allowed to query a server using any grammar that takes advantage of those special server capabilities. 3.5.3 Natural Language Queries It must be possible to support natural language queries. 3.6 Results 3.6.1 Standard format DASL must define a standard format for search results. For the sake of interoperability, it is desirable that server result formats be standardized so that regardless of the type of query syntax used, clients are guaranteed to successfully understand the results of a query. 3.6.2 Paged Results DASL search results must be conducive to paged retrieval. Paged retrieval is necessary if result sets are very large and if clients must also present a responsive interface to a user. In this scenario clients need to access portions of the search result at specific times. DASL search results must be defined so that paged search results are possible. 3.7 Discovery Mechanisms INTERNET DRAFT DAV Searching and Locating 8 DAV Searching and Locating March 1998 3.7.1 Grammar Discovery It must be possible for clients to discover which query grammars a server supports. If a server is capable of supporting several search grammars, the client needs to determine which grammars are supported. 3.7.2 Operator Discovery It must be possible for a client to discover which operators are available for a given query grammar. 3.7.3 Scope Information Discovery It should be possible for a client to determine searching information about a scope, if that information is available. Examples of such information includes information that reveals which properties can be searched in a scope, indexing statistics for the scope, etc. 3.8 Redirecting a Query It must be possible for the server to refer the client to other resources in order to continue a search. For example, a client may ask the resource http://ren/stimpy to perform a search over http://foo/bar and http://blah/mumble. However http://ren/stimpy may not be able to perform the search itself and so will need to be able to inform the client that it should submit its search request directly to http://foo/bar and http://blah/mumble. 3.9 Hit Highlighting DASL must define a mechanism to allow clients to request and receive "hit highlighting". Hit highlighting allows clients to provide visual cues to a user to identify segments in a text resource that cause them to match content-based queries. 4 Authentication The DASL specification should state how the DASL extensions to WEBDAV interoperate with existing authentication schemes, and should make recommendations for using those schemes. INTERNET DRAFT DAV Searching and Locating 9 DAV Searching and Locating March 1998 5 Access Control The DASL specification should state how the DASL extensions to WEBDAV interoperate with the ACL mechanisms supported by WEBDAV, and should make recommendations for using those schemes. 6 Internationalization DASL extensions must describe how to perform searches on internationalized content and properties. Information intended for user comprehension must conform to the IETF Character Set Policy [CHAR]. 7 Related Work Z39.50: "Information Retrieval (Z39.50): Application Service Definition and Protocol Specification". http://lcweb.loc.gov/z3950/agency/ Z39.50 Profile for Simple Distributed Search and Ranked Retrieval http://lcweb.loc.gov/z3950/agency/profiles/zdsr.html The STARTS Protocol http://www-db.stanford.edu/~gravano/starts.html The Harvest Information Discovery and Access System http://mordor.transarc.com/afs/transarc.com/public/trg/Harvest/ 8 References [CHAR] H.T. Alvestrand, "IETF Policy on Character Sets and Languages", June 1997, internet-draft, work-in-progress, draft- alvestrand-charset-policy-02.txt. [HTTP] R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068, U.C. Irvine, DEC, MIT/LCS, January 1997. [WEBDAV] Y. Y. Goland, E. J. Whitehead, Jr., A. Faizi, S. R. Carter, D. Jensen, "Extensions for Distributed Authoring and Versioning on the World Wide Web", October, 1997, internet-draft, INTERNET DRAFT DAV Searching and Locating 10 DAV Searching and Locating March 1998 work-in-progress, draft-ietf-webdav-protocol-04.txt.Authors' Addresses 9 Author's Addresses Saveen Reddy Microsoft Corporation One Microsoft Way Redmond WA, 9085-6933 EMail: saveenr@microsoft.com Judith Slein Xerox Corporation 800 Phillips Road 105-50C Webster, NY 14580 EMail: slein@wrc.xerox.com Expires July 1998 INTERNET DRAFT DAV Searching and Locating 11