TF-CACHE Martin Hamilton INTERNET-DRAFT Loughborough University Andrew Daviel Vancouver Webpages January 1999 Cachebusting - cause and prevention draft-hamilton-cachebusting-01.txt Status of This Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.ietf.org (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). Distribution of this memo is unlimited. Editorial comments should be sent directly to the author. Technical discussion will take place on the mailing list of the TERENA Web Caching Task Force - TF-CACHE. For more information see . This Internet Draft expires July 1999. Abstract Cachebusting is the sometimes deliberate, sometimes inadvertant, practice of defeating caching. This document explains the nature of the problem with relation to proxy cache servers using the World-Wide Web's HTTP protocol, and outlines some simple measures which may be taken to make an HTTP based service more "cache friendly". Since Web caching is still a novel concept, we also explain the basic principles behind it. This document should be read by developers of [Page 1] INTERNET-DRAFT January 1999 HTTP based products and services - we assume that the reader is already familiar with HTTP. 1. The rationale for Web Caching Caching is a technique widely used in both computer systems hardware and software to improve performance and work around bottlenecks. General examples include physical memory devoted to caching transient data on disk drives and controllers, and operating system features such as directory name lookup cache. Web Caching operates at a higher level often referred to as "middleware". This typically implies caching of transient WWW objects by the end user's Web browser, or using a separate "proxy cache" server which sits between the end user's browser and the "origin server" which they are trying to contact. Figure 1 illustrates this relationship. +---------+ +---------+ +---------+ | End | ----------> | Proxy | ----------> | Origin | | user's | HTTP | cache | HTTP/FTP/.. | | | browser | <---------- | server | <---------- | server | +---------+ +---------+ +---------+ Figure 1 - a simple proxy cache configuration Proxy cache servers typically speak HTTP [1,2] to the end user's WWW browser, and a variety of protocols to the origin servers. In addition to caching WWW objects, they may also elect to cache other information such as reachability metrics (when choosing between multiple origin servers) and the results of domain name lookups. Recent developments have focussed on linking proxy cache servers together so as to pool their storage capacity - typically using the Internet Cache Protocol [3]. This is discussed further in [4]. Proxy caches offer additional functionality above and beyond the WWW browser's own built-in cache, since cached objects may be shared with the entire population of users and with cooperating proxy cache servers. By contrast - browser caches are typically private to the individual, or can only be shared with those browsers which have access to the filesystem on which the cached objects are found. Figure 2 illustrates the operation of the proxy cache server in the case that the requested WWW object (usually identified by its URL, or the URL plus the HTTP request headers sent by the WWW browser) has already been cached. +---------+ +---------+ +---------+ | End | ----------> | Proxy | < No need > | Origin | | user's | HTTP | cache | < to > | | | browser | <---------- | server | < contact > | server | [Page 2] INTERNET-DRAFT January 1999 +---------+ +---------+ +---------+ Figure 2 - fetching a cached object A cache's effectiveness is usually measured in terms of its "hit rate" - the ratio of requests which may be satisfied using cached objects. The goal of the cache administrator is to make this figure as high as possible, without serving a significant volume of stale material to the cache's users. Cache hit rates of 40% to 50% for WWW related traffic are common, for example [5]. Caching also helps to make more effective use of the available bandwidth by allowing TCP congestion control algorithms to work properly - conventional HTTP traffic takes the form of a very large number of short lived TCP connections, which often defeats TCP "slow-start" [6] on busy lines. It follows that proxy caching should be highly attractive to Internet Service Providers and organisations which buy connectivity from them, on a cost/benefit basis. Cache hits are typically delivered an order of magnitude faster than cache misses, since the objects requested do not have to be fetched from the origin server. This means that a site which encourages caching can provide the end user with a much higher perceived quality of service whilst at the same time getting better value for money from their leased line(s). The World-Wide Web community is standardising a new version of HTTP - 1.1 - which specifically addresses a number of caching issues. At the time of writing, this had yet to be widely deployed, and the specification was still being developed. In this document we only discuss the best of current practice. 2. The cachebusting problem Support in the HTTP protocol and its implementations for proxies and caching is something which has essentially been retro-fitted. As a result, there are many common practices which are incompatible with it, and either defeat caching completely or reduce the benefits which derive from it. This is primarily an educational issue involving developers of HTTP based services and systems. Caching at the HTTP level can cause problems for services which make heavy use of usage statistics - e.g. to provide "hit counts" for advertisers. Users of cached copies of an object are effectively invisible to the provider of the original service. This may provide a strong motivation to defeat caching. There is also the case that a product comes with an out-of-the-box [Page 3] INTERNET-DRAFT January 1999 configuration which defeats caching, perhaps unintentionally on the part of the vendor or its developers. If the product works for most users with few if any modifications to the default settings, there will be no incentive to dig deeper into its configuration possibilities. 3. How to be friendly to proxy cache servers We will go on to outline some simple measures which the developers of HTTP based systems and services can take to make their products more cache-friendly. 3.1 Tips for HTTP server administrators Use a server which supports HTTP 1.1 - this has a number of additional features to support caching. Send the Expires header on documents and images where feasible - this will help caches to decide when your objects are stale. Use an HTTP server which supports the GET method with the If-Modified-Since header - this will help browsers and proxy caches to figure out whether their cached copy of a file is out of date. Ensure that the time is set correctly on the server machine, e.g. via NTP [7], so that the timestamp information carried in the HTTP headers makes sense. 3.2 Tips for content providers (e.g. HTML authors) Encourage the sharing of links to common graphics and applets, so that only one URL is used for a given object. Use client-side imagemaps (USEMAP - [8]) where feasible, since server-side imagemaps generate HTTP Redirects which are typically uncacheable. Use trailing slashes (/) for directory names to avoid extra redirects. Where you are using a file which is returned when the directory name is requested (typically index.html or index.htm) "./" can usually be written instead of referring to the file by name. Try to use a single name for a server in the hostname part of the URL in the HTML which you create. [Page 4] INTERNET-DRAFT January 1999 Don't rename files to age them - give them unique names in the first place and update the links which point to them. Use the Internet domain name in the host component of the URLs you create, rather than the host's IP address. If you really want to count every access to a given page, embed a tiny non-cacheable image into it. This will give you an access count for the page without requiring the whole thing to be downloaded again by each user of given proxy cache. 3.3 Dynamic content (e.g. CGI) developers Make results cacheable where practical :- Use GET instead of POST for simple queries, since POST results aren't cached. Use the path component of the URL to pass information instead of QUERY_STRING - caches may treat objects with a ? in their URL as uncacheable. Use a directory name other than "cgi-bin", since caches can be expected to treat URLs containing this as uncacheable. Generate valid Last-Modified and Expires headers. Handle If-Modified-Since requests. Use applet and scripting technologies such as Javascript or Java instead of CGI for form validation, where feasible. If you use cookies, try to restrict them to the portions of your server where they're essential, since objects returned with a Set-Cookie header are commonly treated as uncacheable. Be aware that cookies may not interact well with proxy cache severs. Try not to parse the HTTP USER_AGENT header to select browser specific capabilities, since the cached HTML will be browser specific, and may be returned to a browser which doesn't know what to do with it. Use features like instead. Don't use server-side includes unless your server can send the Last-Modified HTTP header with them. Don't use redirects, since their results may be uncacheable. Try to keep the size and complexity of pages on secure servers to a minimum, since secure HTTP requests are not cached in proxy caches and may not be cached in many browsers. Try to avoid using secure servers for general pages where feasible. Don't set the objects your server returns to expire immediately, [Page 5] INTERNET-DRAFT January 1999 or at some time in the recent past, unless you want to be held up to public ridicule! Don't use content-negotiation until HTTP 1.1 is more widely deployed, since in HTTP/1.0 it interacts badly with proxy caches. Don't specify port 80 in the URL, e.g. when generating URLs programatically. Don't use server modules or scripts to convert document's character set on the server side. Leave it to the client. 3.4 Developers of stand-alone applications Implement proxy support. Give users of your application the ability to configure proxying, preferably allowing for a different proxy server and port number on a protocol by protocol basis, and allowing for some Internet domains and/or IP addresses to be exempted from the proxy configuration. Make use of user/admin configured preferences for HTTP proxying which may already have been set up before your application is installed, where these are available. Ideally any new URL protocol schemes, such as "urn:", should be passed to an HTTP proxy server, making it possible to support new protocols without having to upgrade individual software installations. 4. Security considerations Cachebusting is clearly justified in those cases where the use of caching has, in itself, security and privacy implications. The end user has no way of knowing what information is being logged, or where it will end up - e.g. bank account or credit card numbers. Proxy servers tend to subvert firewalls and access controls based on IP addresses and/or domain names. Proxy servers can be useful as a central mechanism for laundering incoming WWW traffic to (for example) remove or block offensive material, or to check applications and applets being downloaded for problems such as viruses and denial of service attacks. [Page 6] INTERNET-DRAFT January 1999 5. Acknowledgements Thanks to Duane Wessels, Vinod Valloppilli, George Michaelson, Donald Neal, Ernst Heiri, Wojtek Sylwestrzak, Alan J. Flavell and Jens-S Voeckler for their contributions to this document. 6. References [1] A. Luotonen and K. Altis, "World-Wide Web proxies", In WWW94 Conference Proceedings (Elsevier), 1994. [2] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2068 (Proposed Standard), 01/03/1997. [3] D. Wessels, K. Claffy, "Internet Cache Protocol (ICP), version 2", RFC 2186 (Informational), September 1997. [4] D. Wessels, K. Claffy. "Application of Internet Cache Protocol (ICP), version 2", RFC 2187 (Informational), September 1997. [5] K. Claffy, "NLANR Caching Workshop Report", June 1997. <URL:http://ircache.nlanr.net/Cache/Workshop97/minutes.html> [6] W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms", RFC 2001 (Pro- posed Standard), 01/24/1997. [7] D. Mills, "Network Time Protocol (v3)", RFC 1305 (Pro- posed Standard), 04/09/1992. [8] J. Seidman, "A Proposed Extension to HTML: Client-Side Image Maps", RFC 1980 (Informational), 08/14/1996. 7. Authors' addresses Martin Hamilton Department of Computer Science [Page 7] INTERNET-DRAFT January 1999 Loughborough University Leics. LE11 3TU, UK Email: martinh@gnu.org Andrew Daviel Vancouver Webpages Box 357, 185-9040 Blundell Road Richmond, BC V6Y1K3, CA Email: andrew@vancouver-webpages.com This Internet Draft expires July 1999. [Page 8]