TF-CACHE                                                 Martin Hamilton
INTERNET-DRAFT                                   Loughborough University
                                                           Andrew Daviel
                                                      Vancouver Webpages
                                                            January 1999


                  Cachebusting - cause and prevention

                   draft-hamilton-cachebusting-01.txt


                          Status of This Memo

      This document is an Internet-Draft.  Internet-Drafts are working
      documents of the Internet Engineering Task Force (IETF), its
      areas, and its working groups.  Note that other groups may also
      distribute working documents as Internet-Drafts.

      Internet-Drafts are draft documents valid for a maximum of six
      months and may be updated, replaced, or obsoleted by other
      documents at any time.  It is inappropriate to use Internet-Drafts
      as reference material or to cite them other than as ``work in
      progress.''

      To learn the current status of any Internet-Draft, please check
      the ``1id-abstracts.txt'' listing contained in the Internet-Drafts
      Shadow Directories on ftp.ietf.org (US East Coast), nic.nordu.net
      (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
      Rim).

      Distribution of this memo is unlimited.  Editorial comments should
      be sent directly to the author.  Technical discussion will take
      place on the mailing list of the TERENA Web Caching Task Force -
      TF-CACHE.  For more information see
      <URL:http://www.terena.nl/task-forces/tf-cache/>.

      This Internet Draft expires July 1999.

Abstract

   Cachebusting is the sometimes deliberate, sometimes inadvertant,
   practice of defeating caching.  This document explains the nature of
   the problem with relation to proxy cache servers using the World-Wide
   Web's HTTP protocol, and outlines some simple measures which may be
   taken to make an HTTP based service more "cache friendly".  Since Web
   caching is still a novel concept, we also explain the basic
   principles behind it.  This document should be read by developers of


                                                                [Page 1]

INTERNET-DRAFT                                              January 1999


   HTTP based products and services - we assume that the reader is
   already familiar with HTTP.

1. The rationale for Web Caching

   Caching is a technique widely used in both computer systems hardware
   and software to improve performance and work around bottlenecks.
   General examples include physical memory devoted to caching transient
   data on disk drives and controllers, and operating system features
   such as directory name lookup cache.  Web Caching operates at a
   higher level often referred to as "middleware".  This typically
   implies caching of transient WWW objects by the end user's Web
   browser, or using a separate "proxy cache" server which sits between
   the end user's browser and the "origin server" which they are trying
   to contact.  Figure 1 illustrates this relationship.

      +---------+             +---------+             +---------+
      | End     | ----------> | Proxy   | ----------> | Origin  |
      | user's  |    HTTP     | cache   | HTTP/FTP/.. |         |
      | browser | <---------- | server  | <---------- | server  |
      +---------+             +---------+             +---------+

             Figure 1 - a simple proxy cache configuration

   Proxy cache servers typically speak HTTP [1,2] to the end user's WWW
   browser, and a variety of protocols to the origin servers.  In
   addition to caching WWW objects, they may also elect to cache other
   information such as reachability metrics (when choosing between
   multiple origin servers) and the results of domain name lookups.
   Recent developments have focussed on linking proxy cache servers
   together so as to pool their storage capacity - typically using the
   Internet Cache Protocol [3].  This is discussed further in [4].

   Proxy caches offer additional functionality above and beyond the WWW
   browser's own built-in cache, since cached objects may be shared with
   the entire population of users and with cooperating proxy cache
   servers.  By contrast - browser caches are typically private to the
   individual, or can only be shared with those browsers which have
   access to the filesystem on which the cached objects are found.
   Figure 2 illustrates the operation of the proxy cache server in the
   case that the requested WWW object (usually identified by its URL, or
   the URL plus the HTTP request headers sent by the WWW browser) has
   already been cached.

      +---------+             +---------+             +---------+
      | End     | ----------> | Proxy   | < No need > | Origin  |
      | user's  |    HTTP     | cache   | <   to    > |         |
      | browser | <---------- | server  | < contact > | server  |


                                                                [Page 2]

INTERNET-DRAFT                                              January 1999


      +---------+             +---------+             +---------+

                   Figure 2 - fetching a cached object

   A cache's effectiveness is usually measured in terms of its "hit
   rate" - the ratio of requests which may be satisfied using cached
   objects.  The goal of the cache administrator is to make this figure
   as high as possible, without serving a significant volume of stale
   material to the cache's users.

   Cache hit rates of 40% to 50% for WWW related traffic are common, for
   example [5].  Caching also helps to make more effective use of the
   available bandwidth by allowing TCP congestion control algorithms to
   work properly - conventional HTTP traffic takes the form of a very
   large number of short lived TCP connections, which often defeats TCP
   "slow-start" [6] on busy lines.

   It follows that proxy caching should be highly attractive to Internet
   Service Providers and organisations which buy connectivity from them,
   on a cost/benefit basis.  Cache hits are typically delivered an order
   of magnitude faster than cache misses, since the objects requested do
   not have to be fetched from the origin server.  This means that a
   site which encourages caching can provide the end user with a much
   higher perceived quality of service whilst at the same time getting
   better value for money from their leased line(s).

   The World-Wide Web community is standardising a new version of HTTP -
   1.1 - which specifically addresses a number of caching issues.  At
   the time of writing, this had yet to be widely deployed, and the
   specification was still being developed.  In this document we only
   discuss the best of current practice.

2. The cachebusting problem

   Support in the HTTP protocol and its implementations for proxies and
   caching is something which has essentially been retro-fitted.  As a
   result, there are many common practices which are incompatible with
   it, and either defeat caching completely or reduce the benefits which
   derive from it.  This is primarily an educational issue involving
   developers of HTTP based services and systems.

   Caching at the HTTP level can cause problems for services which make
   heavy use of usage statistics - e.g. to provide "hit counts" for
   advertisers.  Users of cached copies of an object are effectively
   invisible to the provider of the original service.  This may provide
   a strong motivation to defeat caching.

   There is also the case that a product comes with an out-of-the-box


                                                                [Page 3]

INTERNET-DRAFT                                              January 1999


   configuration which defeats caching, perhaps unintentionally on the
   part of the vendor or its developers.  If the product works for most
   users with few if any modifications to the default settings, there
   will be no incentive to dig deeper into its configuration
   possibilities.

3. How to be friendly to proxy cache servers

We will go on to outline some simple measures which the developers of
HTTP based systems and services can take to make their products more
cache-friendly.

3.1 Tips for HTTP server administrators

     Use a server which supports HTTP 1.1 - this has a number of
       additional features to support caching.

     Send the Expires header on documents and images where feasible
       - this will help caches to decide when your objects are stale.

     Use an HTTP server which supports the GET method with the
       If-Modified-Since header - this will help browsers and proxy
       caches to figure out whether their cached copy of a file is
       out of date.

     Ensure that the time is set correctly on the server machine, e.g.
       via NTP [7], so that the timestamp information carried in the
       HTTP headers makes sense.

3.2 Tips for content providers (e.g. HTML authors)

     Encourage the sharing of links to common graphics and applets, so
       that only one URL is used for a given object.

     Use client-side imagemaps (USEMAP - [8]) where feasible, since
       server-side imagemaps generate HTTP Redirects which are typically
       uncacheable.

     Use trailing slashes (/) for directory names to avoid extra
       redirects.

     Where you are using a file which is returned when the directory
       name is requested (typically index.html or index.htm) "./" can
       usually be written instead of referring to the file by name.

     Try to use a single name for a server in the hostname part of the
       URL in the HTML which you create.


                                                                [Page 4]

INTERNET-DRAFT                                              January 1999


     Don't rename files to age them - give them unique names in the
       first place and update the links which point to them.

     Use the Internet domain name in the host component of the URLs you
       create, rather than the host's IP address.

     If you really want to count every access to a given page, embed a
       tiny non-cacheable image into it.  This will give you an access
       count for the page without requiring the whole thing to be
       downloaded again by each user of given proxy cache.

3.3 Dynamic content (e.g. CGI) developers

     Make results cacheable where practical :-
       Use GET instead of POST for simple queries, since POST results
         aren't cached.
       Use the path component of the URL to pass information instead of
         QUERY_STRING - caches may treat objects with a ? in their URL
         as uncacheable.
       Use a directory name other than "cgi-bin", since caches can be
         expected to treat URLs containing this as uncacheable.
       Generate valid Last-Modified and Expires headers.
       Handle If-Modified-Since requests.

     Use applet and scripting technologies such as Javascript or Java
       instead of CGI for form validation, where feasible.

     If you use cookies, try to restrict them to the portions of your
       server where they're essential, since objects returned with a
       Set-Cookie header are commonly treated as uncacheable.  Be aware
       that cookies may not interact well with proxy cache severs.

     Try not to parse the HTTP USER_AGENT header to select browser
       specific capabilities, since the cached HTML will be browser
       specific, and may be returned to a browser which doesn't know
       what to do with it.  Use features like <NOFRAMES> instead.

     Don't use server-side includes unless your server can send the
       Last-Modified HTTP header with them.

     Don't use redirects, since their results may be uncacheable.

     Try to keep the size and complexity of pages on secure servers
       to a minimum, since secure HTTP requests are not cached in proxy
       caches and may not be cached in many browsers. Try to avoid
       using secure servers for general pages where feasible.

     Don't set the objects your server returns to expire immediately,


                                                                [Page 5]

INTERNET-DRAFT                                              January 1999


       or at some time in the recent past, unless you want to be held
       up to public ridicule!

     Don't use content-negotiation until HTTP 1.1 is more widely
       deployed, since in HTTP/1.0 it interacts badly with proxy caches.

     Don't specify port 80 in the URL, e.g. when generating URLs
       programatically.

     Don't use server modules or scripts to convert document's character
       set on the server side.  Leave it to the client.

3.4 Developers of stand-alone applications

     Implement proxy support.

     Give users of your application the ability to configure
       proxying, preferably allowing for a different proxy server and
       port number on a protocol by protocol basis, and allowing for
       some Internet domains and/or IP addresses to be exempted from
       the proxy configuration.

     Make use of user/admin configured preferences for HTTP proxying
       which may already have been set up before your application is
       installed, where these are available.

     Ideally any new URL protocol schemes, such as "urn:", should be
       passed to an HTTP proxy server, making it possible to support
       new protocols without having to upgrade individual software
       installations.

4. Security considerations

   Cachebusting is clearly justified in those cases where the use of
   caching has, in itself, security and privacy implications.  The end
   user has no way of knowing what information is being logged, or where
   it will end up - e.g. bank account or credit card numbers.

   Proxy servers tend to subvert firewalls and access controls based on
   IP addresses and/or domain names.

   Proxy servers can be useful as a central mechanism for laundering
   incoming WWW traffic to (for example) remove or block offensive
   material, or to check applications and applets being downloaded for
   problems such as viruses and denial of service attacks.


                                                                [Page 6]

INTERNET-DRAFT                                              January 1999


5. Acknowledgements

   Thanks to Duane Wessels, Vinod Valloppilli, George Michaelson, Donald
   Neal, Ernst Heiri, Wojtek Sylwestrzak, Alan J. Flavell and Jens-S
   Voeckler for their contributions to this document.

6. References


   [1]         A. Luotonen and K. Altis, "World-Wide Web proxies", In
               WWW94 Conference Proceedings (Elsevier), 1994.


   [2]         R. Fielding, J. Gettys, J. Mogul, H. Frystyk, T.
               Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1",
               RFC 2068 (Proposed Standard), 01/03/1997.


   [3]         D. Wessels, K. Claffy, "Internet Cache Protocol (ICP),
               version 2", RFC 2186 (Informational), September 1997.


   [4]         D. Wessels, K. Claffy.  "Application of Internet Cache
               Protocol (ICP), version 2", RFC 2187 (Informational),
               September 1997.


   [5]         K. Claffy, "NLANR Caching Workshop Report", June 1997.
               <URL:http://ircache.nlanr.net/Cache/Workshop97/minutes.html>


   [6]         W. Stevens, "TCP Slow Start, Congestion Avoidance, Fast
               Retransmit, and Fast Recovery Algorithms", RFC 2001 (Pro-
               posed Standard), 01/24/1997.


   [7]         D. Mills, "Network Time Protocol (v3)", RFC 1305 (Pro-
               posed Standard), 04/09/1992.


   [8]         J. Seidman, "A Proposed Extension to HTML: Client-Side
               Image Maps", RFC 1980 (Informational), 08/14/1996.


7. Authors' addresses

   Martin Hamilton
   Department of Computer Science


                                                                [Page 7]

INTERNET-DRAFT                                              January 1999


   Loughborough University
   Leics. LE11 3TU, UK

   Email: martinh@gnu.org


   Andrew Daviel
   Vancouver Webpages
   Box 357, 185-9040 Blundell Road
   Richmond, BC V6Y1K3, CA

   Email: andrew@vancouver-webpages.com


                  This Internet Draft expires July 1999.


                                                                [Page 8]