INTERNET-DRAFT B. Hernacki Expires: April 4, 1997 B. Polk Netscape Communications, Inc. October 4, 1996 NNTP Full-text Search Enhancements 1. Status of this Memo This document is an Internet-Draft. Internet-Drafts are working docu- ments of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet- Drafts Shadow Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). 2. Abstract This document describes a set of enhancements to the Network News Tran- sport Protocol [NNTP-977] that allows full-text searching of news arti- cles across multiple newsgroups. This new search mechanism also allows search criteria to be saved into search profiles. Articles arriving on the server are checked against the profiles, and the articles that match are collected together for the client. The availability of the extensions described here will be advertised by the server using the extension negotiation-mechanism described in the new NNTP protocol specification currently being developed [NNTP-NEW]. Hernacki & Polk [Page 1] INTERNET-DRAFT October 4, 1996 3. Introduction The new SEARCH NNTP command is sent from the client to specify and ini- tiate a full-text search. The server constructs a "virtual newsgroup" consisting of articles that matched the search criteria. The virtual newsgroup acts in most ways like a normal newsgroup, allowing access through the standard NNTP commands. The new PROFILE command makes a virtual newsgroup permanent, and saves the search criteria that generated the newsgroup. The server will show newly arrived articles that match the search criteria as new articles in the virtual newsgroup. This can be implemented on the server by reexe- cuting the search periodically or by using a profile mechanism that checks each article as it arrives. Because the virtual newsgroup usually consists of articles from many other newsgroups, clients might want to display it differently than a non-virtual newsgroup. For example, clients may want to display the source newsgroup of each article. To make this easier, and to resolve some of the longstanding problems with XOVER, the OVER command is intro- duced. To control the headers returned by the OVER command, and to allow the client and server to communicate information that does not fit through other channels, the SET and GET commands have been added. SET allows the client to send an attribute/value pair to the server. GET allows the client to retrieve an attribute/value pair by attribute name. In addition, the XPAT command is extended so that it can be used to full-text search articles within a single newsgroup. Both the headers and the body of the articles are searched. 3.1. New and Enhanced NNTP Commands There are five new NNTP commands, three new options to the existing LIST command, and enhancements to one existing command. * GET * SET * OVER * SEARCH * PROFILE * LIST SRCHHEADERS Hernacki & Polk [Page 2] INTERNET-DRAFT October 4, 1996 * LIST SEARCHES * LIST XACTIVE * XPAT The GET and SET commands communicate per-session information between the client and server. The OVER command returns specific headers requested by the client. This command functions much like the widely implemented XOVER command. The SEARCH command runs a one-time search. The PROFILE command converts search results into saved profiles and manipulates them. The LIST SRCHHEADERS command returns the headers that the server allows in full-text searches. The LIST XACTIVE command functions in most ways like the LIST ACTIVE command. It is different because it can be made to return information about a single newsgroup, and it supports new newsgroup flags for the virtual newsgroups. It also can return multpile newsgroup flags per newsgroup. The LIST SEARCHES command allows the client to determine which news- groups are full-text indexed. Only these newsgroups are full-text searchable. The XPAT command has a simple extension to allow the header "TEXT". This specifies a full-text (headers and body) search of the articles in a single newsgroup. 4. Use of NNTP Extension Mechanism The NNTP extension mechanism allows a server to describe its capabili- ties. The following extensions are used to describe the capabilities described in this document. 4.1. SETGET Extension The SETGET extension means that the server supports the SET and GET com- mands. 4.2. OVER Extension The OVER extension means that the server supports the OVER command. In Hernacki & Polk [Page 3] INTERNET-DRAFT October 4, 1996 addition, any server that supports the OVER extension must also support the SETGET extension, and must explicitly include SETGET in the list of extensions it supports. 4.3. SEARCH Extension The SEARCH extension means that the server supports the following com- mands: SEARCH, LIST SEARCHES, LIST SRCHHEADERS, LIST XACTIVE. In addi- tion, any server that supports the SEARCH extension must also support the OVER and SETGET extensions, and must explicitly include OVER and SETGET in the list of extensions it supports. 4.4. PROFILE Extension The PROFILE extension means that the server supports the PROFILE com- mand. In addition, any server that supports the PROFILE extension must also support the SEARCH, OVER, and SETGET extensions, and must expli- citly include these extensions in the list of extensions it supports. 4.5. XPATTEXT Extension The XPATTEXT extension means that the server supports the TEXT header in the XPAT command, as described by this document. 5. Command Descriptions 5.1. GET command GET [ATTRIBUTE [ATTRIBUTE]...] GET allows the client to retrieve session-specific state information from the server. The only characters allowed in attributes or values are uppercase and lowercase letters, numbers, and the characters "-_:". Case is not signi- ficant in the attribute names. This information must not be preserved by the client across server sessions. If no ATTRIBUTE is specified, all of the attributes are returned by the server. 5.2. Responses The server will either return the values (209), indicate a syntax error (501), or indicate that the attribute was not recognized (409). 209 values follow 501 command syntax error Hernacki & Polk [Page 4] INTERNET-DRAFT October 4, 1996 409 unknown attribute 5.3. Example C: GET S: 209 values follow S: OVERFIELDS Subject:Newsgroups:From:References:Lines:Bytes: S: . 5.4. OVER command OVER [range] The optional range argument may be any of the following: an article number an article number followed by a dash to indicate all following an article number followed by a dash followed by another article number If no argument is specified, then information from the current article is displayed. Successful responses start with a 224 response, followed by a line listing the headers, followed by the overview information for all matched messages. Once the output is complete, a period is sent on a line by itself. If no argument is specified, the information for the current article is returned. If a newsgroup has not been selected, a 412 error response is returned. If no articles are in the range speci- fied, a 420 error response is returned. If the client only has permis- sion to transfer articles, a 502 response will be returned By default, the headers returned are as specified in the OVERVIEW.FMT file, and will therefore be the same as the server would return for an XOVER command. The SET command may be used to specify what headers are returned and in what order. The SET attribute OVERFIELDS is used to specify the names of the headers to return, with the headers concatenated together, including the terminating ":". This use of SET for the OVERFIELDS attribute must be supported. The server must honor this request and return only the headers specified in subsequent OVER commands in that session. The number of lines in the article is available in the Lines: field. The number of bytes in the article is available in the Bytes: field. Hernacki & Polk [Page 5] INTERNET-DRAFT October 4, 1996 5.5. Responses 224 data follows 412 not in group 420 no articles in range 501 command syntax error 502 no permission 5.6. Example C: SET OVERFIELDS Subject:From:Lines: S: 209 OK C: OVER S: 224 data follows S: Subject:From:Lines: S: Re: Long runing subjects/tfrequent-poster@somewhere.com/t593 S: . 5.7. SEARCH command SEARCH The specified query is executed, and the name of the resulting virtual newsgroup is returned. Search result virtual newsgroups are not permanent. The server must keep them for at least ten minutes after the last client access to the newsgroup, but after that time the server is free to remove them. This ten minute period must be observed even if the client terminates it's session with the server. "Access to the newsgroup" is defined to mean any command executed while the virtual newsgroup was the current news- group. The query is the full-text search criteria expressed in the syntax described below. 5.7.1. Search Syntax The search query syntax is derived from the search syntax defined for the IMAP4 protocol. It is somewhat different because of the way inter- national character sets need to be encoded. See RFC 1730 [IMAP4] for the IMAP4 search syntax. One exception defined by this RFC to the 7bit character set restriction for commands in [NNTP-977] is that the 8bit ISO-8859-1 character set is allowed in unencoded form in search strings. This is allowed because it simplifies handling this widely used character set, without requiring support of arbitrary binary data. Hernacki & Polk [Page 6] INTERNET-DRAFT October 4, 1996 Here is a semi-formal definition of the search query syntax. query = HEADER Newsgroups [...] group_pat = "[,group_specifier...]" group_specifier = Either a single * for all searchable groups, a full newsgroup name, or a part of the news hierarchy, suffixed with .*. search_term = TEXT | HEADER | SENTBEFORE date | SENTON date | SENTAFTER date | NOT | OR | ( ) search_string = "" | "" date = Date in DD-MMM-YYYY form. simple_string = US-ASCII or ISO-8859-1 text. MIME-2String = A MIME-2 encoded string. The double quotes are always required around the group pattern and the search strings. BODY requests a search through the body of the article, excluding the headers. TEXT requests a search through all indexed parts of the article, includ- ing the body and all indexed headers. If multiple search_terms are listed without being prefixed by the OR operator, they are ANDed together. SENTBEFORE, SENTON, and SENTAFTER may only be used if the Date: header is indexed, as specified by the LIST SRCHHEADERS command. The searches should be case insensitive. 5.7.2. Query Examples SEARCH HEADER Newsgroups "comp.*, alt.*" BODY "nntp" SENTAFTER 25-DEC-1995 SEARCH HEADER Newsgroups "comp.*" HEADER From "Salz" NOT HEADER From "Bob" Hernacki & Polk [Page 7] INTERNET-DRAFT October 4, 1996 SEARCH HEADER Newsgroups "*" BODY "Election" ( OR TEXT "Bob" TEXT "Bill" ) SEARCH HEADER Newsgroups "comp.lang.c++" TEXT "=?ISO-8859-1?Q?QPtext?=" 5.8. Responses A successful search returns the name of a newsgroup in which the server has placed the results. This newsgroup can then be treated like any other non-postable newsgroup. If no articles matched the search cri- teria, an error (460) is returned. 260 groupname 460 no matches found 462 error performing search 501 command syntax error 5.9. Example C: SEARCH header newsgroups "*" TEXT "internet" S: 260 virtual.group.temp5423 5.10. PROFILE command PROFILE NEW [profilenamehint] | RET | DEL The PROFILE subcommands specify what operation to perform: NEW creates a new profile from the current search result. RET returns the search criteria of a profile. DEL deletes a profile. 5.10.1. NEW Subcommand NEW converts a SEARCH result group into a profile. The profilenamehint is used by the server as part of the name of the newsgroup. The client must not make any assumptions that any part of the name hint will be used. The name hint must be 32 characters or less, and consist of valid newsgroup name characters, except that no "."s are allowed in the profilenamehint. 5.10.2. RET Subcommand RET retrieves the QUERY field stored on the server for the current pro- file newsgroup. 5.10.3. DEL Subcommand DEL deletes the current profile newsgroup. This command also indicates Hernacki & Polk [Page 8] INTERNET-DRAFT October 4, 1996 that the group should be deleted, although the server does not have to delete it immediately. The server must clear the current group context, so that no commands that require a group context can be done. 5.11. NEW Subcommand Responses If the profile newsgroup is created, the 260 response is returned, including the name of the new newsgroup. If there's no current news- group, the error response 412 is returned. If the current newsgroup isn't a search result virtual newsgroup, the 461 error response is returned. 5.12. RET Subcommand Responses If the PROFILE RET is successful, the 261 response is returned, includ- ing the criteria. If there's no current newsgroup, the error response 412 is returned. If the current newsgroup isn't a profile virtual news- group, the 461 error response is returned. 5.13. DEL Subcommand Responses If the PROFILE DEL is successful, the 260 response is returned, includ- ing the name of the deleted virtual newsgroup. If there's no current newsgroup, the error response 412 is returned. If the current newsgroup isn't a profile virtual newsgroup, the 461 error response is returned. 5.14. Responses 260 groupname 261 returned search criteria 412 not in group 461 current group is not a correct virtual newsgroup 462 profile error 501 command syntax error 5.15. Example 1 - Create New Profile C: SEARCH header newsgroups "comp.*" TEXT "fortran" S: 260 virtual.search.temp3254 C: GROUP virtual.search.temp3254 S: 211 103 402 504 virtual.search.temp32 C: PROFILE NEW myprofile S: 260 virtual.profile.myprofile 5.16. Example 2 - Return Profile C: GROUP virtual.profile.myprofile S: 211 103 402 504 virtual.profile.myprofile Hernacki & Polk [Page 9] INTERNET-DRAFT October 4, 1996 C: PROFILE RET S: 261 TEXT searchstring 5.17. Example 3 - Delete Profile C: GROUP virtual.profile.myprofile S: 211 103 402 504 virtual.profile.myprofile C: PROFILE DEL S: 260 virtual.profile.myprofile deleted 5.18. SET command SET ATTRIBUTE [ATTRIBUTE ...] SET allows the client to set session specific state information. This might include things like what language it wants to use, what version of the protocol it wants, what type of authentication it will be using, or optional article compressions. The only characters allowed in attri- butes or values are upper and lower case letter, number, and the charac- ters "-_:". Case is not significant in the attribute names. This infor- mation must not be preserved by the server across client sessions. If multiple attributes are specified and the server does not recognize one or more of them, it must return an error and not set any of them. 5.19. Responses The server will either return that it set the value (209), return a syn- tax error (501), or indicate that one or more of the attributes was not recognized (409). 209 OK 501 command syntax error 409 unknown attribute 5.20. Example C: SET LANG USEnglish S: 209 OK 5.21. LIST SRCHHEADERS LIST SRCHHEADERS Returns a list of which headers can be specified in full-text search queries on the server. Hernacki & Polk [Page 10] INTERNET-DRAFT October 4, 1996 5.22. Responses Returns a list of headers, one per line. A "." on its own line ter- minates the list. 5.23. Example C: LIST SRCHHEADERS S: 215 Data follows. S: From: S: Date: S: Subject: S: . 5.24. LIST SEARCHES LIST SEARCHES Returns a list of strings that define which newgroups are being indexed by the news server and are thus available for searching. In addition, the character sets allowed for each group is returned. 5.25. Responses When there are newsgroups indexed it will return 215, followed by each portion of the tree that is indexed. If all groups are indexed, a line with "*" is returned. If only some parts of the newsgroup hierarchy are indexed, they are identified in the form .*. Clients should not assume that these will always be top level hierarchies. A "." on its own line terminates the list. The character sets allowed in full-text searches for each entry is also returned. The character sets are identified by the name as defined in [MIME-1]. 5.26. Example C: LIST SEARCHES S: 215 Data follows. S: alt.* US-ASCII S: comp.lang.* US-ASCII ISO-8859-1 ISO-8859-2 S: mcom.* ISO-8859-1 S: . Hernacki & Polk [Page 11] INTERNET-DRAFT October 4, 1996 5.27. LIST XACTIVE LIST XACTIVE [newsgroup] The LIST XACTIVE command functions in most respects like the LIST ACTIVE command. It differs in the following ways: First, multiple flags may be returned. The flags are concatenated together. Second, LIST XACTIVE allows two new flags to be returned, "s" or "p", indicating a search results virtual newsgroup or profile virtual news- group, respectively. In both these cases the "n" or "y" flag is also set, indicating whether the virtual group can be posted to. So the flag field in the response line for a search result virtual group that can not be posted to will be "ns". Third, other flags may be added in the future. Clients must ignore flags they do not recognize. 5.28. Responses The responses are exactly the same as the LIST ACTIVE command, except for the new flags. 5.29. Example C: LIST XACTIVE virtual.guest.temp3453 S: 215 Newsgroups in form "group high low flags". S: virtual.guest.temp3453 0000000000 0000000001 ns S: . 5.30. XPAT command enhancement XPAT header range| pat [pat...] The XPAT command is enhanced in a simple way: The new value TEXT will be supported as a header when invoking the command. The TEXT header requests a full-text search the body and all headers of the specified articles. When TEXT is specified for the header, only a single "pat" is allowed, and it must be a full word to search for, rather than a wildmat pattern as allowed otherwise. Hernacki & Polk [Page 12] INTERNET-DRAFT October 4, 1996 5.31. Responses If TEXT isn't specified as the header, the response is the same as it always has been for XPAT, with each result line containing the article number and the value of the header that matched the pattern. If the TEXT header is specified, the constant string "TEXT" is returned in place of the value of the header that matched the pattern. 5.32. Example C: XPAT TEXT 1000-2000 searchtext S: 221 Header follows S: 1021 TEXT S: 1024 TEXT S:. 6. Security Considerations The search and profile commands must be implemented in a way that does not allow access to articles in newsgroups that a client is otherwise restricted from reading due to access control rules. Clients will in some cases want to control access to virtual newsgroups or profiles. No means to support this kind of protection is defined in this document, as it requires access control infrastructure that is not currently defined for NNTP. The OVER command should be treated the same as the XOVER command for access control and security purposes. The other commands do not introduce any new security issues. 7. Bibliography [NNTP-977] Network News Transfer Protocol. B. Kantor, Phil Lapsley, Request for Comment (RFC) 977, February 1986. [NNTP-NEW] Network News Transfer Protocol. S. Barber INTERNET DRAFT, Sep- tember 1996. [IMAP4] IMAP4 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4. M Crispin, Request for Comment (RFC) 1730, December 1994 Hernacki & Polk [Page 13] INTERNET-DRAFT October 4, 1996 [MIME-1] Borenstein N., and N. Freed, MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies, RFC 1521, Bellcore, Innosoft, September 1993. [MIME-2] Moore, K., MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text, RFC 1522, University of Tennessee, September 1993. 8. Author's Address Brian Hernacki Netscape Communications, Inc. 685 W. Middlefield Road Mountain View, CA 94043 USA Phone: +1 415-937-6738 Email: bhern@netscape.com Ben Polk Netscape Communications, Inc. 685 W. Middlefield Road Mountain View, CA 94043 USA Phone: +1 415-937-3686 Email: bpolk@netscape.com This Internet Draft expires April 4, 1997. Hernacki & Polk [Page 14]