IETF NNTP Working Group N. Ballou, Microsoft Internet Draft B. Hernacki, Netscape Document: draft-ballou-nntpsrch-04.txt September, 1997 NNTP Full-text Search Extension Status of this Memo This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress". To learn the current status of any Internet-Draft, please check the 1id-abstracts.txt listing contained in the Internet-Drafts Shadow Directories on ds.internic.net, nic.nordu.net, ftp.isi.edu, or munnari.oz.au. A revised version of this draft document will be submitted to the RFC editor as a Proposed Standard for the Internet Community. Discussion and suggestions for improvement are requested. This document will expire before March 1998. Distribution of this draft is unlimited. 1. Abstract This document describes a set of enhancements to the Network News Transport Protocol [NNTP-977] that allows full-text searching of news articles in multiple newsgroups. The proposed SEARCH command supports functionality similar to the [IMAP4] SEARCH command, minus user specific search keys (i.e., ANSWERED, DRAFT, FLAGGED, KEYWORD, NEW, OLD, RECENT, SEEN) and minus search keys based on headers that do not exist in news (i.e., CC, BCC, TO). The availability of the extensions described here will be advertised by the server using the extension negotiation-mechanism described in the new NNTP protocol specification currently being developed [NNTP- NEW]. 2. Conventions used in this document In examples, "C:" and "S:" indicate lines sent by the client and server respectively. NNTP Full-text Search Extension September 1997 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC-2119]. 3. Introduction The NNTP SEARCH command is sent from the client to the server to specify and initiate a full-text search on articles in one or more newsgroups. The NNTP SEARCH command is similar to the [IMAP4] SEARCH command, with user property and mail-specific header search keys not present in NNTP SEARCH. The results of an NNTP Search is OVER data as specified in [NNTP-NEW] for each article that satisfies the search criteria. In addition, the PAT command is extended so that it can be used to full-text search articles within a single newsgroup. Both the headers and the body of the articles are searched. 3.1. New and Enhanced NNTP Commands There are four new NNTP commands: two new options to the existing LIST command, and enhancements to one existing command. * SEARCH * LIST SRCHFIELDS * LIST SEARCHABLE * PAT The SEARCH command runs a one-time search, returning overview-like data. The LIST SRCHFIELDS command returns the fields that the server allows in full-text searches. The LIST SEARCHABLE command allows the client to determine which newsgroups are full-text searchable. The PAT command allows the pseudo-header ":TEXT". This specifies a full-text (headers and body) search of the articles in a single newsgroup. 4. Use of NNTP Extension Mechanism The NNTP extension mechanism allows a server to describe its capabilities. The following extensions are used to describe the capabilities described in this document. 4.1. SEARCH Extension NNTP Full-text Search Extension September 1997 The SEARCH extension means that the server supports the following commands: SEARCH, LIST SEARCHABLE, LIST SRCHFIELDS. 4.2. PATTEXT Extension The PATTEXT extension means that the server supports the :TEXT header in the PAT command, as described by this document. 5. SEARCH Command Arguments: optional newsgroup specification searching criteria (one or more) Responses: 224 overview information follows 412 no news group selected 462 error performing search 480 authentication required 501 command syntax error 502 no permission The SEARCH command searches the newsgroups for articles that match the given searching criteria. Searching criteria consist of one or more search keys. If there are articles that match the search criteria, the server responds with code 224 and returns OVER data for each matching article in a similar format as described in [NNTP- NEW] with one exception. The one change from [NNTP-NEW] OVER format is to change the article number field to a format that supports searches over multiple newsgroups. The article ID field for SEARCH OVER data will use the format newsgroup:art-ID rather than just an article number as defined in [NNTP-NEW] (note: this is the same format used by the Xref header). A response of 501 indicates a syntax error in the search criteria. A response of 502 indicates that the user does not have permission to search one or more of the specified newsgroups. If the search criteria did not specify a newsgroup, and there is no current newsgroup (i.e., set using the NNTP GROUP command), then the server returns the error code 412, indicating that no newsgroup has been specified. A response of 462 indicates that the server encountered an error when processing the search. When multiple keys are specified, the result is the intersection (AND function) of all the messages that match those keys. For example, the criteria FROM "SMITH" SINCE 1-Feb-1994 refers to all articles from Smith that were placed in the newsgroup since February 1, 1994. A search key may also be a parenthesized list of one or more search keys (e.g. for use with the OR and NOT keys). Server implementations MAY exclude [MIME-1] body parts with terminal content types other than TEXT and MESSAGE from consideration in SEARCH matching. NNTP Full-text Search Extension September 1997 The optional newsgroup specification consists of the word "IN" followed by either a wildcard character "*" - indicating a search over all newsgroups - or a list of newsgroup names separated by a comma. A newsgroup name can end with the wildcard string ".*" indicating a search over a sub-hierarchy of the newsgroup name space. If no newsgroup specification is given, the search is over the current newsgroup. If there is no current newsgroup, the server returns the 412 error code. The ON, BEFORE, and SINCE search criteria use the same date as used in the NNTP NEWNEWS command in [NNTP-NEW] - the date the article arrived on the server. A server indicates support for the ON, BEFORE, and SINCE search criteria by listing :Date in the LIST SRCHFIELDS response. The defined search keys are as follows. Refer to the Formal Syntax section for the precise syntactic definitions of the arguments. Articles with article numbers corresponding to the specified range. ALL All Articles in the current newsgroup; the default initial key for ANDing. BEFORE Articles whose server arrival date is earlier than the specified date. BODY Articles that contain the specified string in the body of the message. FROM Articles that contain the specified string in the article structure's FROM field. HEADER Articles that have a header with the specified field-name (as defined in [RFC-822]) and that contains the specified string in the [RFC-822] field-body. LARGER Articles with an size larger than the specified number of octets. NOT Articles that do not match the specified search key. ON Articles whose server arrival date is within the specified date. OR Articles that match either search key. SENTBEFORE NNTP Full-text Search Extension September 1997 Articles whose [RFC-822] Date: header is earlier than the specified date. SENTON Articles whose [RFC-822] Date: header is within the specified date. SENTSINCE Articles whose [RFC-822] Date: header is within or later than the specified date. SINCE Articles whose server arrival date is within or later than the specified date. SMALLER Articles with a size smaller than the specified number of octets. SUBJECT Articles that contain the specified string in the envelope structure's SUBJECT field. TEXT Articles that contain the specified string in the header or body of the message. Example: C: SEARCH FROM "Smith" SINCE 1-Feb-1994 S: 224 overview information follows S: comp.object:573 \t RE: object-oriented langs \t \ "John Smith" \t Sun, 03 Nov 1996 \ 14:25:05 -0800 \t <01cbc9d5f3c70$eab9a2cd@xyz.com> \ \t 4080 \t 33 S: . Note: each field in OVER response is separated by a tab - shown as a \t in the example above. 5.1.1. Search Formal Syntax The search query syntax is derived from the search syntax defined for the IMAP4 protocol. It is somewhat different because of the way international character sets need to be encoded. The following syntax specification uses the augmented Backus-Naur Form (BNF) as described in [ABNF]. Except as noted otherwise, all alphabetic characters are case- insensitive. The use of upper or lower case characters to define token strings is for editorial clarity only. Implementations MUST accept these strings in a case-insensitive fashion. astring ::= atom / string atom ::= 1*ATOM_CHAR NNTP Full-text Search Extension September 1997 ATOM_CHAR ::= atom_specials ::= "," / "(" / ")" / SPACE / CTL / "*" / quoted_specials CHAR ::= CTL ::= date ::= date_text / <"> date_text <"> date_day ::= 1*2digit ;; Day of month date_month ::= "Jan" / "Feb" / "Mar" / "Apr" / "May" / "Jun" / "Jul" / "Aug" / "Sep" / "Oct" / "Nov" / "Dec" date_text ::= date_day "-" date_month "-" date_year date_year ::= 4digit digit ::= "0" / digit_nz digit_nz ::= "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" header_fld_name ::= sstring mstring ::= A MIME encoded string surrounded by double quotes newsgroup ::= atom [ ".*"] newsgroups ::= "*" / newsgroup_list newsgroup_list ::= newsgroup [ "," newsgroup_list] number ::= 1*digit ;; Unsigned 32-bit integer ;; (0 <= n < 4,294,967,296) nz_number ::= digit_nz *digit ;; Non-zero unsigned 32-bit integer ;; (0 < n < 4,294,967,296) QUOTED_CHAR ::= / "\" quoted_specials quoted_specials ::= <"> / "\" NNTP Full-text Search Extension September 1997 range ::= nz_number / nz_number "-" [ nz_number ] ;; Identifies a range of Articles. search ::= "SEARCH" SPACE ["IN" SPACE newsgroups SPACE] 1#search_key search_key ::= "ALL" / "BODY" SPACE sstring / "FROM" SPACE sstring / "ON" SPACE date / "SINCE" SPACE date / "BEFORE" SPACE date / "SUBJECT" SPACE sstring / "TEXT" SPACE sstring / "HEADER" SPACE header_fld_name SPACE sstring / "LARGER" SPACE number / "NOT" SPACE search_key / "OR" SPACE search_key SPACE search_key / "SENTBEFORE" SPACE date / "SENTON" SPACE date / "SENTSINCE" SPACE date / "SMALLER" SPACE number / range / "(" 1#search_key ")" SPACE ::= 1* sstring ::= astring / mstring string ::= <"> *QUOTED_CHAR <"> TEXT_CHAR ::= 5.2. LIST SRCHFIELDS Command Arguments: none Responses: 224 data follws The LIST SRCHFIELDS command returns a list of which fields can be specified in full-text search queries on the server. The response is a list of searchable fields, one per line. A "." on its own line terminates the list. The fields are either newsgroup headers, or non-header fields supported by the query syntax. The three currently defined non-header fields are ":Body", ":Text", and ":Date". ":Text" means all the searchable text in the article, and indicates that the "TEXT" keyword is supported in the search query language. ":Body" means the body of the article, excluding the headers, and indicates that the "BODY" keyword is supported in the search query language. ":Date" means the date at which an article arrived on a server - similar to the date used in the NNTP NEWNEWS command - and indicates that the "ON", "SINCE", and "BEFORE" keywords are supported in the search query language. The "TEXT" and "BODY" search query fields are optional, but the server must indicate whether they are supported or not in the LIST SRCHFIELDS response. NNTP Full-text Search Extension September 1997 Example: C: LIST SRCHFIELDS S: 224 Data follows. S: From S: Date S: Subject S: :Text S: . 5.3. LIST SEARCHABLE Command Arguments: none Responses: 224 Data Follows The LIST SEARCHABLE command returns a list of strings that define which new groups are being indexed by the news server and are thus available for searching. In addition, the character sets allowed for each group is returned. When there are newsgroups indexed it will return 224, followed by each portion of the tree that is indexed. If all groups are indexed, a line with "*" is returned. If only some parts of the newsgroup hierarchy are indexed, they are identified in the form .*. Clients should not assume that these will always be top level hierarchies. A "." on its own line terminates the list. Example: C: LIST SEARCHABLE S: 224 Data follows. S: alt.* S: comp.lang.* S: mcom.* S: . 5.4. PAT Command Enhancement Arguments: header range| [pat [pat...]] Responses: The PAT command is enhanced in a simple way: The new value ":TEXT" will be supported as a header when invoking the command. The :TEXT header requests a full-text search the body and all headers of the specified articles. Other than adding a new header name, the PAT command arguments are the same as specified in [NNTP-NEW]. If :TEXT isn't specified as the header, the response is the same as it always has been for PAT, with each result line containing the article number and the value of the header that matched the pattern. NNTP Full-text Search Extension September 1997 If the :TEXT header is specified, the constant string "TEXT" is returned in place of the value of the header that matched the pattern. Example: C: PAT :TEXT 1000-2000 searchtext S: 221 Header follows S: 1021 TEXT S: 1024 TEXT S:. 6. Security Considerations The search commands must be implemented in a way that does not allow access to articles in newsgroups that a client is otherwise restricted from reading due to access control rules. 9. References [ABNF], DRUMS working group, Dave Crocker Editor, "Augmented BNF for Syntax Specifications: ABNF", draft-drums-abnf-02.txt (work in progress), Internet Mail Consortium, April 1997 [IMAP4] IMAP4 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4rev1. M Crispin, Request for Comment (RFC) 2060, December 1994 [MIME-1] Borenstein N., and N. Freed, MIME (Multipurpose Internet Mail Extensions) Part One: Format of Internet Message Bodies, Request for Comment (RFC) 2045, December 1996. [NNTP-977] Network News Transfer Protocol. B. Kantor, Phil Lapsley, Request for Comment (RFC) 977, February 1986. [NNTP-NEW] Network News Transfer Protocol. S. Barber INTERNET DRAFT, draft-ietf-nntpext-base-02.txt, September 1997. [RFC-2119], Bradner, S, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, Harvard University, March 1997 10. Acknowledgments TBD 11. Author's Addresses Nathaniel Ballou Microsoft One Microsoft Way Redmond, WA 98052 Phone: +1 425-703-0574 Email: NatBa@Microsoft.com NNTP Full-text Search Extension September 1997 Brian Hernacki Netscape Communications 501 E. Middlefield Rd. Mountain View, CA 94043 Phone: (650) 937-6738 Email: bhern@netscape.com