Caching Mechanisms in Layered DNS Search Services. October 2002 Shi Xiaohong Expires - April 2003 [Page 1] Internet Draft Xiaohong SHI Karen LIU Document: draft-xhshi-dns-search-caching-00.txt Inter China Network Software Co., Ltd Expires: April 2003 October 2002 Caching Mechanisms in Layered DNS Search Services Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2002). All Rights Reserved. Table of Contents 1. Abstract 2. Conventions used in this document 3. Searching query distribution in Keyword Services 4. Caching mechanisms used in Keyword services 5. Future consideration of caching mechanisms 6. Conclusion 7. References 8. Acknowlegement 9. Authors' Addresses 1. Abstract John Klensin proposed a multi-layered search-based access model for the DNS to address more human-friendly naming and navigation need of the modern Internet [DNSSEARCH]. When deploying the DNS search system in the global Internet, stability and scalability are two most important factors no matter what protocol we finally use. This draft demonstrates that caching is a significant objective for a stable and scalable DNS search system. Particularly, this document analyzes statistical operational data collected from our Keyword service. The collected data clearly demonstrates the potential high impact of caching in the resolution process. Based on deployment experience, the document describes alternative caching methods. From there, basic caching requirements for DNS Search systems are discussed. The goal of this document is not to propose some standard caching mechanisms or methods but to emphasize the necessity of caching for DNS Search systems. Furthermore, the authors believe that the proven high efficiency of caching should be an important consideration when designing the future lookup protocol for such systems. 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119 ]. 3. Searching query distribution in Keyword Services Searching and navigation are the mostly common used behavior on the Internet. To facilitate the search on the Internet, there are kinds of similar services implemented, such as Keyword service, specialized search service, yellow page services, and free-text search engines, corresponding to the different layers of the DNS-search framework. Since [DNSSEARCH] is to establish a multi-layered identifying and searching system, the user query characteristics of some existing search services will obviously influence the design of the whole system. Based on the analysis of practical access log of 3721's Keyword service and some free-text search engines, it is found that the distribution of the search items complies with the well-known 80%-20% distribution rule, i.e., more than 80% of search traffic arises from less than 20% search items(or even more extreme than 80:20). Statistic data of 3721's keyword service can explain what is above discussed. The table below is the search frequency distribution collected from the log files of the searched items generated from the address bar of web browsers during July 20th, 2002 to July 26th, 2002. All the query keywords from the log files are filtered through a normalization process, thus figure out the equivalent keyword regardless of the format. The normalization process includes striping quotes, removing undesired characters such as control characters, turning multiple spaces into single space, lowering casing, and converting TC to SC[TC/SC], etc.(Note that the average total searched items per day does not count in the keyword which is only searched once in a day) +---------------------------------------------------------------+ | | Search traffic / day | item% | traffic% | +---------------------------------------------------------------+ | Top5000 items | 14184304 | 5.73% | 89.17% | +---------------------------------------------------------------+ | Top1000 items | 12711044 | 1.15% | 79.91% | +---------------------------------------------------------------+ | Top500 items | 12021089 | 0.57% | 75.57% | +---------------------------------------------------------------+ | Top100 items | 10302625 | 0.11% | 64.77% | +---------------------------------------------------------------+ | Average total searched items / day | 87205 | +---------------------------------------------------------------+ | Average total search traffic / day | 15906821 | +---------------------------------------------------------------+ >From the above table, we can know that nearly 80% of the search traffic is due to less than 2% of the query words. Besides, for free-text search engine, we can also get the search item distribution because 3721's keyword service has been integrated into almost all the portal and local portal search engines in China. We get the same distribution and we believe that this discipline will also be effective in the future. So the authors believe that there will be the similar situation when implementing a DNS-searching system. Thus, whichever protocol we finally use, caching will be highly efficient and necessary. 4. Caching mechanisms used in Keyword services This section takes 3721 as an example to show some caching mechanisms used in keyword services, which will be possible options for future DNS- searching systems. 4.1 Server side caching mechanisms In implementation of the name resolving or data retrieving software on server side, some mature cache methods already exists. (a) In-memory data storage or index A very important goal of the resolution procee is to eliminate network delay and name server load from most requests by answering them from its cache of results. For conventional relational database system, even for exact match query, once the traffic or load exceeds some threshold, the overall performance of the system will decline sharply. To guarantee the high performance as well as the fuzzy search function, the resolving system of 3721's keyword service takes advantage of in-memory database. Almost all keywords and indexes based on which the exact and fuzzy match can be conducted are stored in memory. This kind of in-memory data storage and index structure can also be regarded as a caching mechanism. In fact, a resolving server configured with 1G byte memory can store 1 million keywords and indexes, so most of the 28 million queries per day of 3721's keyword service are finished in memory, which guarantees the high performance of the system. (b) Middle cache or proxy server We can also setup cache or proxy servers between server and client. Many search engines take the similar way. Besides, some mature caching methods in other areas can also be used, e.g., using independent software or hardware for cache, such as the special caching devices of EMC, Network Appliance or CacheFlow. In the global Internet environment, when we have very huge data volume, we probably need more than one server to form a cluster for data partitioning, and we may also need several clusters for data mirroring. Under such circumstance, the cache server can also act as the inverse proxy inside or among the clusters, to fulfill the caching function together with other goals, such as client requests dispatching, load balance and fault tolerance. 4.2 Client side caching mechanisms The typical caching method used in client software is to store the recently query results in cache. When a user query is matched with one record in the cache, then it is directly returned as the result, which will reduce the network traffic for the server. The records stored in client cache will be refreshed based on TTL(Time-to-live) set by the server. Cache size and refresh frequency can both influence the hit rate of cache. In 3721's keyword client, these two parameters are carefully designed. Based on the analysis of access log of keyword resolve server and analysis of the client buffer, we find that the 80%-20% discipline is also effective for individual user. So we can use an appropriated size for client cache, e.g.,200 to 300 keyword records. Of course, the best size of client cache will vary with the popularity or query frequency of the service. Server should be able to know such change and can upgrade or notify the client to adjust the size. For keyword service, because registered names are rarely changed, the TTL value can be set to a longer time, e.g., several days. Particularly, the TTL value could be set according to the frequency of changes of a registration record across the registry. For example, we could measure the time between changes to a database record (keyword, context, URI change), and plot that time distribution across the database. The curve will be a Gaussian curve with a mean "MEAN" and standard deviation "SIGMA" (or we will regress the curve to that). Then, we could set the TTL to: MEAN ¿C N * SIGMA (N>3), which would say that the probability that the cache is hit when the record has changed is pretty low. However, sometime we should ignore the TTL value and enforce the cache to refresh. In 3721's keyword client, some special way which is triggered by server will make part or all cached records expired. 5. Future consideration of caching mechanisms The future DNS-searching system will be more complicated, various kinds of services will exist in the same or different layers and they may inter-operate with each other via a variety of protocols. Cache mechanism will also be more complex. However, in the author's opinion, there at least two basic requirements. (a) caching for perfect or exact match result To cache the individual result for lookup function. This requirement is discussed above. (b) caching for search result list John Klensin pointed out in [DNSSEARCH] that DNS-searching system should not only support the lookup function, i.e., one-to-one match, but also support fuzzy match function which is necessary under some condition even for layer 2 services. So the cache mechanism should include caching the fuzzy search result, i.e., the record sets and the corresponding queries should be cached. Sometimes, the user selection in the list should also be cached to eliminate the need for further user interference. The design for caching record sets is somewhat more complicated. The TTL value for a record set will possibly be smaller than individually cached record, and the TTL value for user selection in the record set is difficult to set. It is up to different applications. For fuzzy match result caching, it may be possible to consider introducing a string normalization function that is much broader than the one for perfect match. For example, in English, the fuzzy normalization function would normalize "movies" into "movie" so that the two search queries ("movie" and "movies") correspond to the same "fuzzy normalized" string ("movie"). This is definitely a different normalization function than the one we would use for perfect-match (the perfect match normalization function would identify "movies" and "movie" as two different queries). Note that we could introduce as many normalization function (fuzzy level 1, fuzzy level 2, etcí¡), depending on how fuzzy we want to be. For each normalization function, we would have a distinct cache. Each cache would map a normalized query to the set of records that are identical to that normalize query, according to that normalization function. That way, the cache structure does not change, only the normalization function does. 6. Conclusion Based on the analysis of keyword usage statistic data in China, this draft discussed the necessity of cache mechanism in DNS search system proposed by John Klensin. The draft also lists some caching methods currently used in Chinese keyword services. The validity of these methods also demonstrate that implementing caching methodology in DNS-searching will have a good trade-off to make the whole system stable and scalable while keeping the protocol simple. 7. Reference [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997 [DNSSEARCH] John C. Klensin, "A Search-based access model for the DNS", draft-klensin-dns-search-03.txt, May 2001, [TC/SC] Xiaodong Lee & al. Traditional and Simplified Chinese Conversion, draft-ietf-idn-tsconv-02.txt. 8. Acknowlegement Discussions with a number of people have led to this document. Especially some important suggestions have come from conversations with Nicolas Popp, John Klensin, Patrik Faltstrom, Hongyi Zhou, Shu Cao and Jinqiang Gao. 9. Author's Address Xiaohong SHI Inter China Network Software CO., Ltd. RM.406 He Qiao North Tower, 8A Guanghua Rd., Chaoyang Dist. Beijing, P.R.China 100026 Phone: +86 10 65812445 ext. 625 E-mail: sxh@3721.com Karen LIU Inter China Network Software CO., Ltd. RM.406 He Qiao North Tower, 8A Guanghua Rd., Chaoyang Dist. Beijing, P.R.China 100026 Phone: +86 10 65812445 ext. 619 E-mail: karen@3721.com