Internet DRAFT - draft-xhshi-dns-search-caching

draft-xhshi-dns-search-caching



    Caching Mechanisms in Layered DNS Search Services. October 2002


Shi Xiaohong             Expires - April 2003                 [Page 1]


Internet Draft                                             Xiaohong SHI
                                                              Karen LIU
Document: draft-xhshi-dns-search-caching-00.txt             Inter China
                                                       Network Software
                                                               Co., Ltd
Expires: April 2003                                        October 2002


             Caching Mechanisms in Layered DNS Search Services

Status of this Memo

   This document is an Internet-Draft and is in full conformance with 
all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering 
Task Force (IETF), its areas, and its working groups. Note that other 
groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months 
and may be updated, replaced, or obsoleted by other documents at any 
time.  It is inappropriate to use Internet-Drafts as reference material 
or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

Copyright Notice
   Copyright (C) The Internet Society (2002).  All Rights Reserved.


Table of Contents
1.  Abstract
2.  Conventions used in this document
3.  Searching query distribution in Keyword Services
4.  Caching mechanisms used in Keyword services
5.  Future consideration of caching mechanisms
6.  Conclusion
7.  References
8.  Acknowlegement
9.  Authors' Addresses


1. Abstract

John Klensin proposed a multi-layered search-based access model for the 
DNS to address more human-friendly naming and navigation need of the 
modern Internet [DNSSEARCH]. When deploying the DNS search system in the 
global Internet, stability and scalability are two most important 
factors no matter what protocol we finally use. This draft demonstrates 
that caching is a significant objective for a stable and scalable DNS 
search system. Particularly, this document analyzes statistical 
operational data collected from our Keyword service. The collected data 
clearly demonstrates the potential high impact of caching in the 
resolution process. Based on deployment experience, the document 
describes alternative caching methods. From there, basic caching 
requirements for DNS Search systems are discussed.

The goal of this document is not to propose some standard caching 
mechanisms or methods but to emphasize the necessity of caching for DNS 
Search systems. Furthermore, the authors believe that the proven high 
efficiency of caching should be an important consideration when 
designing the future lookup protocol for such systems.


2. Conventions used in this document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this 
document are to be interpreted as described in [RFC2119 ].


3. Searching query distribution in Keyword Services

Searching and navigation are the mostly common used behavior on the 
Internet. To facilitate the search on the Internet, there are kinds of 
similar services implemented, such as Keyword service, specialized 
search service, yellow page services, and free-text search engines, 
corresponding to the different layers of the DNS-search framework. Since 
[DNSSEARCH] is to establish a multi-layered identifying and searching 
system, the user query characteristics of some existing search services 
will obviously influence the design of the whole system.

Based on the analysis of practical access log of 3721's Keyword service 
and some free-text search engines, it is found that the distribution of 
the search items complies with the well-known 80%-20% distribution rule, 
i.e., more than 80% of search traffic arises from less than 20% search 
items(or even more extreme than 80:20).

Statistic data of 3721's keyword service can explain what is above 
discussed. The table below is the search frequency distribution collected
from the log files of the searched items generated from the address bar
of web browsers during July 20th, 2002 to July 26th, 2002.

All the query keywords from the log files are filtered through a 
normalization process, thus figure out the equivalent keyword regardless 
of the format. The normalization process includes striping quotes, 
removing undesired characters such as control characters, turning 
multiple spaces into single space, lowering casing, and converting TC to 
SC[TC/SC], etc.(Note that the average total searched items per day does
not count in the keyword which is only searched once in a day)

    +---------------------------------------------------------------+
    |                  | Search traffic / day |  item%  | traffic%  |
    +---------------------------------------------------------------+
    |   Top5000 items  |       14184304       |  5.73%  |   89.17%  |
    +---------------------------------------------------------------+
    |   Top1000 items  |       12711044       |  1.15%  |   79.91%  |
    +---------------------------------------------------------------+
    |   Top500 items   |       12021089       |  0.57%  |   75.57%  |
    +---------------------------------------------------------------+
    |   Top100 items   |       10302625       |  0.11%  |   64.77%  |
    +---------------------------------------------------------------+
    |   Average total searched items / day    |        87205        |
    +---------------------------------------------------------------+
    |   Average total search traffic / day    |       15906821      |
    +---------------------------------------------------------------+

>From the above table, we can know that nearly 80% of the search traffic 
is due to less than 2% of the query words. Besides, for free-text search 
engine, we can also get the search item distribution because 3721's 
keyword service has been integrated into almost all the portal and local 
portal search engines in China. We get the same distribution and we 
believe that this discipline will also be effective in the future.

So the authors believe that there will be the similar situation when
implementing a DNS-searching system. Thus, whichever protocol we finally
use, caching will be highly efficient and necessary.


4. Caching mechanisms used in Keyword services

This section takes 3721 as an example to show some caching mechanisms 
used in keyword services, which will be possible options for future DNS-
searching systems.

4.1 Server side caching mechanisms

In implementation of the name resolving or data retrieving software on 
server side, some mature cache methods already exists.

(a) In-memory data storage or index

A very important goal of the resolution procee is to eliminate network  
delay and name server load from most requests by answering them from its  
cache of results. For conventional relational database system, even for 
exact match query, once the traffic or load exceeds some threshold, 
the overall performance of the system will decline sharply. To guarantee 
the high performance as well as the fuzzy search function, the resolving 
system of 3721's keyword service takes advantage of in-memory database. 
Almost all keywords and indexes based on which the exact and fuzzy match 
can be conducted are stored in memory. This kind of in-memory data 
storage and index structure can also be regarded as a caching mechanism. 
In fact, a resolving server configured with 1G byte memory can store 1 
million keywords and indexes, so most of the 28 million queries per day 
of 3721's keyword service are finished in memory, which guarantees the 
high performance of the system.

(b) Middle cache or proxy server

We can also setup cache or proxy servers between server and client. Many 
search engines take the similar way. Besides, some mature caching methods 
in other areas can also be used, e.g., using independent software or 
hardware for cache, such as the special caching devices of EMC, Network 
Appliance or CacheFlow.

In the global Internet environment, when we have very huge data volume,
we probably need more than one server to form a cluster for data 
partitioning, and we may also need several clusters for data mirroring. 
Under such circumstance, the cache server can also act as the inverse 
proxy inside or among the clusters, to fulfill the caching function 
together with other goals, such as client requests dispatching, load 
balance and fault tolerance.

4.2 Client side caching mechanisms

The typical caching method used in client software is to store the 
recently query results in cache. When a user query is matched with one 
record in the cache, then it is directly returned as the result, which 
will reduce the network traffic for the server. The records stored in 
client cache will be refreshed based on TTL(Time-to-live) set by the 
server.


Cache size and refresh frequency can both influence the hit rate of 
cache. In 3721's keyword client, these two parameters are carefully 
designed. Based on the analysis of access log of keyword resolve server
and analysis of the client buffer, we find that the 80%-20% discipline
is also effective for individual user. So we can use an appropriated
size for client cache, e.g.,200 to 300 keyword records. Of course, the
best size of client cache will vary with the popularity or query
frequency of the service. Server should be able to know such change
and can upgrade or notify the client to adjust the size.

For keyword service, because registered names are rarely changed, the 
TTL value can be set to a longer time, e.g., several days. Particularly, 
the TTL value could be set according to the frequency of changes of a 
registration record across the registry. For example, we could measure 
the time between changes to a database record (keyword, context, URI 
change), and plot that time distribution across the database. The curve 
will be a Gaussian curve with a mean "MEAN" and standard deviation 
"SIGMA" (or we will regress the curve to that). Then, we could set the 
TTL to: MEAN ¿C N * SIGMA (N>3), which would say that the probability 
that the cache is hit when the record has changed is pretty low. However, 
sometime we should ignore the TTL value and enforce the cache to refresh. 
In 3721's keyword client, some special way which is triggered by server 
will make part or all cached records expired.


5. Future consideration of caching mechanisms

The future DNS-searching system will be more complicated, various kinds 
of services will exist in the same or different layers and they may 
inter-operate with each other via a variety of protocols. Cache mechanism
will also be more complex. However, in the author's opinion, there at
least two basic requirements.

(a) caching for perfect or exact match result

To cache the individual result for lookup function. This requirement is 
discussed above. 

(b) caching for search result list

John Klensin pointed out in [DNSSEARCH] that DNS-searching system should 
not only support the lookup function, i.e., one-to-one match, but also 
support fuzzy match function which is necessary under some condition even
for layer 2 services. So the cache mechanism should include caching the
fuzzy search result, i.e., the record sets and the corresponding queries
should be cached. Sometimes, the user selection in the list should also
be cached to eliminate the need for further user interference. The design
for caching record sets is somewhat more complicated. The TTL value for a
record set will possibly be smaller than individually cached record, and
the TTL value for user selection in the record set is difficult to set.
It is up to different applications.

For fuzzy match result caching, it may be possible to consider introducing
a string normalization function that is much broader than the one for 
perfect match. For example, in English, the fuzzy normalization function 
would normalize "movies" into "movie" so that the two search queries 
("movie" and "movies") correspond to the same "fuzzy normalized" string 
("movie"). This is definitely a different normalization function than 
the one we would use for perfect-match (the perfect match normalization 
function would identify "movies" and "movie" as two different queries). 
Note that we could introduce as many normalization function (fuzzy level 
1, fuzzy level 2, etcí¡), depending on how fuzzy we want to be. For each 
normalization function, we would have a distinct cache. Each cache would 
map a normalized query to the set of records that are identical to that 
normalize query, according to that normalization function. That way, the 
cache structure does not change, only the normalization function does.


6. Conclusion

Based on the analysis of keyword usage statistic data in China, this
draft discussed the necessity of cache mechanism in DNS search system
proposed by John Klensin. The draft also lists some caching methods
currently used in Chinese keyword services. The validity of these
methods also demonstrate that implementing caching methodology in 
DNS-searching will have a good trade-off to make the whole system stable
and scalable while keeping the protocol simple.


7. Reference

[RFC2119]    Bradner, S., "Key words for use in RFCs to Indicate 
             Requirement Levels", BCP 14, RFC 2119, March 1997

[DNSSEARCH]  John C. Klensin, "A Search-based access model for the DNS",  
             draft-klensin-dns-search-03.txt, May 2001, 

[TC/SC]      Xiaodong Lee & al. Traditional and Simplified Chinese 
             Conversion, draft-ietf-idn-tsconv-02.txt.


8. Acknowlegement

Discussions with a number of people have led to this document. Especially
some important suggestions have come from conversations with Nicolas Popp,
John Klensin, Patrik Faltstrom, Hongyi Zhou, Shu Cao and Jinqiang Gao.

9. Author's Address

Xiaohong SHI
Inter China Network Software CO., Ltd.
RM.406 He Qiao North Tower, 8A Guanghua Rd., Chaoyang Dist. Beijing, 
P.R.China 
100026
Phone: +86 10 65812445 ext. 625
E-mail: sxh@3721.com

Karen LIU
Inter China Network Software CO., Ltd.
RM.406 He Qiao North Tower, 8A Guanghua Rd., Chaoyang Dist. Beijing, 
P.R.China 
100026
Phone: +86 10 65812445 ext. 619
E-mail: karen@3721.com