Network Working Group P. Faltstrom Internet-Draft Cisco Systems Inc Expires: May 15, 2002 November 14, 2001 Lookup, Search and Sort draft-faltstrom-i18n-sorting-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http:// www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on May 15, 2002. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract Lookup, searching and sorting is somewhat defined when using a limited set of characters, and especially only one or a few scripts. When operating in a more complex environment, the algorithms get more complex, and in many cases need information about locale. This document describes some of the problems which one encounter in an international environment. Faltstrom Expires May 15, 2002 [Page 1] Internet-Draft Lookup, Search and Sort November 2001 1. Introduction Characters mentioned in this document are identified by their position or code point in the Unicode character set [UCS]. The notation U+12AB, for example, indicates the character at the position 12AB (hexadecimal) in the [UCS]. It is strongly recommended that a [UCS] table is available for reference for the ideograph described. In various applications the communication involves a requester and a servant of information. The requester ask for something, and the servant send back the result of the operation which takes place on the servant side of the communication link. The communication between the requester and the servant is done according to the definitions of the protocol which is used. If the same request is coming from two different requesters, the result given back from the servant is supposed to be the same, or else the requester will be surprised. For example, if a requester is accessing the same servant from two different places in the world, the requester is supposed to get the same result back in both cases. Also, if two different requesters send the same request to the servant, they are supposed to get the same result back. The request itself include some definition created by the requester on what information he wants from the servant. This request is in the case of textual based data expressed through a textual string, which of course can be encoded according to the transport encoding mechanisms defined in the protocol used. This gives that the requester compose a request in his local environment, pass that request in the application protocol which is used to the servant, which is parsing the request, computes a result which is passed back to the requester. The problem with lookups, searches and sorting is that the request is created in the local environment of the requester, and the computation which leads to the result set which is passed back from the servant is computed in the local environment of the servant. In a global environment these local environments might be different. Examples of differences are character sets, scripts, and locale dependent issues like language and geographical locality. The environmental variables which can change the computational algorithm at the servent include: - Character set - Language - Geographical locality - Sort order definition(s) - Equality definition(s) - Case folding definition(s) Faltstrom Expires May 15, 2002 [Page 2] Internet-Draft Lookup, Search and Sort November 2001 2. Lookup A lookup is done when the requester know the exact key or value for the data which is requested. If the key is not exactly the right one, no data is returned. When using a distributed system, this implies also knowing what server to send the query to. In turn this implies that the key must include enough information to know where to send the query, and what the query should be (normally the key itself). Example of a system which is using lookup is the Domain Name System. 2.1 Problems match is a relativly simple operation, as the computation at the servent involves making a decision about what is "equal". This gives that the requester and the servant needs to agree on what is equal. Equality for text is defined by comparison of characters, and can be defined being case depenent or independent. A classical example of equivalence definition problems inside one script can be illustrated using the character 'A' with a ring above. This character can be encoded in Unicode in three ways: 1) U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE 2) U+0041 LATIN CAPITAL LETTER A followed by U+030A COMBINING RING ABOVE 3) U+212B ANGSTROM SIGN The equivalence between 1) and 3) is a singleton equivalence. The equivalence between 1) and 2) is a precomposed/decomposed equivalence, where 1) is the precomposed representation, and 2) is the decomposed representation. In all three cases, it is supposed to look the same for the reader. Case folding gives yet some other problems. In many parts of the world the character 'i' (U+0069, LATIN SMALL LETTER I) has as upper case equivalent the character 'I' (U+0049, LATIN CAPITAL LETTER I). That is not the case in Turkey. There they have a character U+0131 (LATIN SMALL LETTER DOTLESS I) which is lower case of U+0049 (LATIN CAPITAL LETTER I), and also a special upper case character which is the upper case equivalent of the character 'i'. Multiple scripts introduce yet more complicated problems if the computation on the servant side is not done carefully. Basically, one can get matches inside any of the scripts (i.e. both the request and the data at the servant is in the same script) but also between scripts. This because some requests might be possible to express in more than one script (i.e. the same character might exist in more than one script). Another issue have to do with "unification" of characters (or not) Faltstrom Expires May 15, 2002 [Page 3] Internet-Draft Lookup, Search and Sort November 2001 between different languages and scripts. In Unicode a unification between Chinese, Japanese and Korean (CJK) characters were used, which means in practice that the "same" character used in any of the languages are unified to the same codepoint. This implies that given a character and no context, one doesn't know if one talk about the Chinese version of the character or the Japanese. In some environments knowing the difference is important for example when trying to define matching rules between the simplified and traditional form of the chinese character when used in the chinese language. Faltstrom Expires May 15, 2002 [Page 4] Internet-Draft Lookup, Search and Sort November 2001 3. Sort If more than one item is given back in the result set, there is a need to decide in what order the results are presented in the result set. The decision on ordering results normally consists on two steps: deciding what characters to use for creating the order and calculation of the order between two characters. An example of the former is a decision that sorting of white pages records is to be done according to the last name of a person. The latter is the process of calculating the order between different last names. Example of a system which allowes sorting is a white pages service using the LDAP protocol. 3.1 Problems What characters to use when doing sorting changes between applications and between cultures. A phone book on Iceland is sorted on the first name of a person while one in Sweden is sorted on the last name for example. A list of email in an email application might be sorted in different ways (date, sender, subject) in the same session. What way to sort records is in many cases a request from the client to the server, and the server tries to fulfil that request just like it tries to fulfil the request of finding the data itself. This in many cases leads to a need for negotiation between client and server 031-934660. Gun Segerling When comparing the individual characters with each other, the collation order is normally dependent on the language which is used, which means that cross language collation order is very hard to define (it is in most cases undefined, so there is no "standard" to reference). For example, the order of the local characters used in Danish, Swedish and Norwegian is different in the three countries, even though they in some way can be seen as being the same. Faltstrom Expires May 15, 2002 [Page 5] Internet-Draft Lookup, Search and Sort November 2001 4. Search When doing searches, compared with lookups, just matching characters is not enough as a computation algorithm. The data which is to be searched have to be structured in such a way that (more or less) the matching itself can be done in an arbitrary part of the data, and not only the key of the record. This is in many cases when the datastore is centralized (i.e. all data is stored at one servant) about the same problems as in the case of lookups, but if the data is distributed among several servants, finding the exact record is a non-trivial task. In the case of systems (such as DNS) which uses lookups, the structure of the key is normally chosen in a way so the key itself gives information about what servant have information about the record. Searches might also include getting information back from more than one servant if the data is distributed. If the search term allow expressions like the regular expression "[a- c]" (match every character between 'a' and 'c', inclusive) the same problems occur like is described below regarding sorting, i.e. that a rank of each character is needed. The example given, LDAP, allow such expressions, so even though in LDAP a specific sort directive is not given, the computation which is taking place on the servant might in some cases involve sorting data. 4.1 Problems The problems with searching is about the same problem as when doing a lookup, with the difference that a lookup is with an exact key, while a search is done via approximate matchings. Either approximate on the level of character by character, or one level above, as a semantic match ("color" and "colour"). Author's Address Patrik Faltstrom Cisco Systems Inc 170 W Tasman Drive SJ-13/2 San Jose CA 95134 USA EMail: paf@cisco.com URI: http://www.cisco.com Faltstrom Expires May 15, 2002 [Page 6] Internet-Draft Lookup, Search and Sort November 2001 Appendix A. Appendix Faltstrom Expires May 15, 2002 [Page 7] Internet-Draft Lookup, Search and Sort November 2001 Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Faltstrom Expires May 15, 2002 [Page 8]