INTERNET DRAFT EXPIRES AUGUST 1998 INTERNET DRAFT American University in Bulgaria Peter Lazarov Lakov The Keyword Protocol (KP) Status of This Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet- Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. i) Summary ii) Introduction iii) Overview of the job done by the KP server iv) Communication with the server. iv.i) Communication with remote computers (wanders, robots) iv.ii) Communication with local computers (users updating the information about their pages) v) Actions taken for each command vi) Messages i) Summary This document provides a proposal for a new Internet protocol, which should improve the relevance of results of search requests for information in the WWW. It contains a description of the basic model and the minimum set of commands necessary to run properly. It is neither meant to nor should be regarded as a definitive version, but rather as a suggestion to a new way of looking at the relations between the search engines on one hand and hosts running WEB servers on the other. ii) Introduction The purpose of this protocol is to improve the results of search requests sent to search engines like Alta-Vista, Lycos, Infoseek, etc. Whenever a user browsing the WEB sends a search request to any of the search engines, there is a high probability that the result will contain many irrelevant entries. This happens because of the way WEB wanderers, robots and spiders fill in and update their databases. Currently their algorithms for finding information is visiting different sites, browsing the file and taking out keywords which will be used in subsequent queries. The keywords taken out will depend on the algorithms of the robots, but generally they will be either too few, or too many. If robots parse only the title of a file, then surely many keywords, useful for finding the file, will not be available, for it is not possible to include them all in the title. If, on the other hand, the whole file is parsed, surely many keywords, which have nothing to do with the topic of the file, will be included, thus enabling this file to appear in search results for completely different topics. In any case, the owner of the file (or the WEB server supervisor) - the main person interested in assuring that the information will reach the target audience - has no power in assisting this process. His/her actions are reduced to the very passive role of only providing access to the data. The HTML tags “contents” and “keywords” can only partially alleviate this problem, because many of the HTML files, linked in a document already parsed by a robot, do not contain those tags. In this case the robot should decide whether keywords should be extracted from those files, or they should be disregarded. A search engine may extracts keywords from a file which was not intended to provide such - in this case it will fill its database with worthless data. Similarly, it may neglect files with some important data. In either cases, the owner of the files has very little power in communicating the precise information to the robots. The KP protocol will give the WEB server supervisor and each single user an opportunity for very close control of the information they provide for public access. Each user will be able to edit the exact keywords necessary to describe his/her files. The cornerstone of the suggestion is that all descriptions of the files will be handled by a central and well-known server, which will both increase accuracy and decrease the time necessary to browse a WEB server. iii) Overview of the job done by the KP server. The KP server operates on the client-server paradigm through a reliable TCP/IP byte stream using the ASCII character set. The server performs a listen on a well-known port and when a client requests a connection to that port, the server accepts the connection. Once it is created, the client starts sending commands to the server, which performs the action and returns a response and (when applicable) data. The response of the server may be of either predictable, or unpredictable line length. In case the an unpredictable line length answer, the last line contains only a full stop. Commands and replies are terminated by a new line character (more on the command syntax in part 4.) >From the point of view of the server, there are two types of objects which can contact the server - users and robots (from now on, till the end of the document, I shall refer to a robot as program written only to contact the KP server and update the databases of search engines. Do not confuse with previously mentioned robots, web wanders, crawlers.) The difference between the two is that first, users need to supply password, while the robot does not. Actually, the robot does not present any identification whatsoever, so any person without a user permission could login as a robot. There is no harm taken in this, because the information is meant for public use anyway. Second, users have permission to edit some of the information, while robots have only read only permissions. The KP server needs to handle a single copy of each of following files with the following proposed fields: File: DATA Fields: , , , File: PASSWORD Fields: , File: PATCH Fields: , File: USED_PATCH Fields: and many of the following tables: File: P_1, P_2, P_3, …, P_N Fields: , , , where the meaning of the fields for each table is as follows: DATA a unique record identifier (same for the other tables) the name of the user, to whom the file belongs the actual name of the file a string with keywords, separated by comma PASSWORD same as in table DATA the password for , used at login time. PATCH the name of the previous to last patch file the name of the last patch file USED the name of the patch files, already sent to robots. P_1, P_2, …P_N same as in DATA action to be taken when the patch file is merged into the database. N is for new (add new record with , and like those in the current table), D is for delete (Delete the records with equal to of the current table) same as in DATA same as in DATA File DATA contains information concerning all available files and their respective keywords. When a robot contacts KP server for the first time, it should first download the file DATA using the GETALL command (which sends back all the records of file DATA. Then, the robot can send the command NEXTPATCH a number of times until it records all the changes done to file DATA. The rule for generating a new patch file is simple: whenever a robot visits the last patch file, create a new patch file and use it to store all changes thereafter. Changes are made only by users (see above,) only with the commands ADDFILE and DELETEFILE. Whenever one of these two commands is used, the action taken is stored to the last, unvisited by robots, patch file. Each user can change only the files, which are referred to by DATA as his/her username. iv) Communication with the server. Each command should be terminated with CRLF characters. The space left blank between the commands and the parameters should be considered as white space. CRLF characters and white spaces are not shown explicitely in the description of the commands lest they become too overburdened. iv.i) Communication with remote computers (wanders, robots.) GETALL the server sends to the client all the records with the files and keywords. ACTION is N for all the records. GETPATCH the server sends to the client only the records from the file . NEXTPATCH the server sends to the client only the name of the next patch file. No records from the patch are actually transferred. If is empty, then the return value is the first patch of the whole database. iv.ii) Communication with local computers (those updating the files.) USER the client sends to the server the username of the person who wants to update the database. Username robot is reserved for robots, WEB wanderers and staff. PASS the client sends to the server the password of the user. ADDFILE the client sends to the server a line containing a filename (possibly the URL) and the keywords which should get in the search engines’ databases for that file. In case there is already an entry in the server’s database for that file the keywords should be replaced with the new ones. DELETEFILE the server deletes the entry for this file from its database. A user would typically want to do this operation if the file is deleted or moved to a new position. If the last patch file has been sent to at least one robot/wander (or there are no patch files yet), the server should create a new patch file and add the entry in it. LISTLIKE the server sends to the client a list of files matching the specified condition. If the is empty, the server sends all the files LISTMINE the server sends to the client only the files belonging to the user currently logged in. EXACT switch exact string comparison ON/OFF. When exact mode is ON, a string is equal to another only when they have the same sequence of characters. When exact is OFF, a string is equal to another when it is a sub-string of the second. All comparison is case-sensitive. When exact mode is OFF, HELP the server sends a short help message to the client about the command specified. If no command is specified, the server sends the list of all the commands. QUIT request that the connection with the server be terminated. v) Actions taken for each command. USER 1) Check if username is “robot”. If yes, then this is a robot. Let it in without asking for password and apply only the commands for robots. If it enters other commands, then send a message 205. 2) If the name is not "robot", check for it in the table password. If the name is not found, send a message 210. Else send message 101. PASS 1) Check whether user has already logged in. If yes, send a message 204. 2) If the user hasn’t logged in yet, check the password sent against the one stored in file password for that user. If different, send 207. Else send 102. ADDFILE 1) Add the record to table DATA. 2) Send message 103. 3) Check if there is already a patch file. * If no patch file exists yet, * add field to table PATCH with fields: DATA, P_1. * Create patch file P_1 and add the field to it. * If a patch file exists, *locate the last patch *If it has been sent to robots, * Add field to table PATCH with: P, P *Create patch file P and add the field to it. *If it hasn’t been sent to robots: * Add the field to the last patch file. 4) Send confirmation message 103 DELETEFILE 1) Locate the file and check that it belongs to the user. If the file is not present in the database, send message 208. If the file does not belong to the user, send message 209. In either of the two cases, goto step 6) 2) Delete the record from table DATA. 3) Send message 104. 4) Follow the same steps, as for ADDFILE step 3. 5) Send confirmation message. 6) End of DELETEFILE command. LISTLIKE 1) Send the user the files matching the pattern. LISTMINE 1) Send the user only the files belonging to him/her. EXACT 1) Change the comparison mode. 2) Send the user the message 105 or 106. HELP 1) Send to the user a help for the command. If is empty, send a list of all available commands. GETALL 1) Send all records of file DATA. GETPATCH 1) Send all records of file . NEXTPATCH 2) Locate for next patch file in PATCH table. 3) Send message 107. 4) Send name of next patch file. vi) Messages. 101 +OK Enter password. 102 +OK Welcome to KP version 1.0. 103 +OK Your file has been added to the database. 104 +OK The file has been deleted from the database. 105 +OK Exact mode is ON. 106 +OK Exact mode is OFF. 107 +OK The next patch file is: 201 -ERR Unknown command. 202 -ERR Command USER expected. 203 -ERR Command PASS expected. 204 -ERR You have already logged in. 205 -ERR Command not allowed for your class. 206 -ERR No patch file with this name. 207 -ERR Password incorrect. Try again. 208 -ERR File ID not found. 209 -ERR You have no write permission for this file. 210 -ERR User unknown. Author's Contact Informationa Peter Lakov lakov@wizcom.bg INTERNET DRAFT EXPIRES AUGUST 1998 INTERNET DRAFT