Network Working Group V. Pappas Internet-Draft UCLA Expires: August 5, 2006 B. Zhang Colorado State Univ. E. Osterweil UCLA D. Massey Colorado State Univ. L. Zhang UCLA February 2006 Improving DNS Service Availability by Using Long TTLs draft-pappas-dnsop-long-ttl-01 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on August 5, 2006. Copyright Notice Copyright (C) The Internet Society (2006). Abstract Due to the hierarchical tree structure of the Domain Name System Pappas, et al. Expires August 5, 2006 [Page 1] Internet-Draft Improving DNS Service Availability February 2006 [RFC1034][RFC1035], losing all of the authoritative servers that serve a zone can disrupt services to not only that zone but all of its descendants. This problem is particularly severe if all the authoritative servers of the root zone, or of a top level domain's zone, fail. Although proper placement of secondary servers, as discussed in [RFC2182], can be an effective means against isolated failures, it is insufficient to protect the DNS service against a distributed denial of service attack (DDoS). This document proposes to mitigate the impact of DDoS attacks against top level DNS servers by setting long TTL values for NS records and their associated A records. Our proposed changes are purely operational and can be deployed incrementally. Our analysis shows that this simple operational tuning has a small impact on DNS performance but can significantly reduce the impact felt by client resolvers as a result of a successful DDoS attacks on the DNS service. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 2. Infrastructure RRsets Definitions and Conflicts . . . . . . . 5 2.1. Infrastructure Records and DNS Caching . . . . . . . . . . 6 2.2. Infrastructure RRset Conflicts . . . . . . . . . . . . . . 6 3. Setting Long Infrastructure TTLs . . . . . . . . . . . . . . . 10 3.1. Cases of Secondary Servers outside the Zone . . . . . . . 10 3.2. Intuition . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3. Handling Name Server Changes . . . . . . . . . . . . . . . 11 3.4. Impact on Cache Memory Size . . . . . . . . . . . . . . . 13 4. Measurement Results on Infrastructure RRSets Changes . . . . . 14 5. Effectiveness of Long TTL on Zone's Availability . . . . . . . 15 5.1. Further Enhancement Through Prefetching . . . . . . . . . 16 6. Backwards Compatibility . . . . . . . . . . . . . . . . . . . 17 7. Security Considerations . . . . . . . . . . . . . . . . . . . 18 8. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 19 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 20 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 21 Intellectual Property and Copyright Statements . . . . . . . . . . 22 Pappas, et al. Expires August 5, 2006 [Page 2] Internet-Draft Improving DNS Service Availability February 2006 1. Introduction [RFC2182] provides operational guidelines for selecting and operating authoritative servers to maximize a zone's availability. Proper placement of authoritative servers can be an effective means to guard DNS service against unintentional failures or errors, but it cannot effectively protect DNS services against intentional attacks. A distributed denial of service attack could target all of the authoritative servers for a zone, regardless of where they are placed. By disabling all of a zone's authoritative servers, an attacker can disrupt service for that zone and all the zones below it. In particular, attacks against domains such as the root, generic top level domains (gTLDs), country code top level domains (ccTLDs), and other zones serving popular DNS domains (such as co.uk. or co.jp.) could have a severe global impact. For example, knocking out all of the root zone servers may effectively render the entire Internet unreachable. Successful attacks against all authoritative servers for a large generic top level domain (gTLD) such as "com." can also impact availability for tens of millions of DNS zones. DNS caching can effectively help mitigate the impact of denial of service attacks. A caching resolver only consults an authoritative server if the requested data is not already present in the cache. The cache contains both specific records such as www.example.com and infrastructure records such as the name servers for example.com. In this document, we focus primarily on the caching of infrastructure records (defined formally in the next section) and show how setting long TTLs on these records can help mitigate the impact of DDoS attacks. For example, consider the case of a successful attack against all of the DNS root servers and suppose all root servers are unavailable for some time period P. Despite the attack, resolvers can still access commonly used gTLDs and ccTLDs as long as these NS records and their corresponding A/AAAA resource record sets (RRsets) remain in a locally available cache during the period P. Generally speaking, access to the root servers is only used for looking up top level domain entries that are not presently available in the cache. Similar arguments apply to attacks against servers of other top level domains, or any DNS domain for that matter. If the NS and associated A/AAAA RRSets for a domain are cached, an attack against higher level domains will have little or no impact on descendant domains. Based on the above observation, this document suggests an operational change regarding the setting of the TTL value for NS resource record sets and the A and AAAA resource records associated with these NS records. Throughout the remainder of the draft, we refer to these types of records as "infrastructure resource record sets" or simply "infrastructure RRsets" and infrastructure records are discussed more fully in later sections. As with all DNS RRsets, the cache lifetime Pappas, et al. Expires August 5, 2006 [Page 3] Internet-Draft Improving DNS Service Availability February 2006 for these infrastructure RRsets is determined by time to live (TTL) field which is typically set to a value between a small number of hours to two days. This draft recommends the use of a significantly longer TTL value (such as one week) for infrastructure RRsets in order to improve the DNS service's availability in the event of a successful attack or an unexpected correlated failure. This change is feasible because of the relatively stable nature of infrastructure RRsets, and the DNS's tolerance for occasional partial discrepancies in these RRsets. The recommendation for a longer TTL value in this draft applies only to DNS infrastructure RRsets; other RRsets such as those for end hosts should continue to use whatever TTL values that local administrators deem appropriate to meet the need of their dynamic changes. Currently, some of the root and TLD servers use shared unicast addresses [RFC3258] to improve availability during denial of service attacks. This approach can be effective when the number of replicated servers is large, however the interactions between shared unicast addresses and BGP routing dynamics are still not fully understood. Furthermore, the use of shared unicast addresses requires one entry in the global BGP routing for each protected zone. Therefore, it may not be a generic solution for protecting a large number of zones. In contrast, our proposal for using a long TTL for infrastructure RRsets to mitigate the impact of DDoS attacks is much simpler in operation, does not require any additional hardware support, and can also be applied to any DNS domains which desire high availability in face of top level DNS service failures. One can also combine the use of long TTL values for infrastructure RRsets with the shared unicast address approach to further enhance DNS' availability. We describe the exact mechanisms of our proposal in Section 2, and some related technical and operational issues that we have identified in Section 3. Section 4 discusses potential impacts on DNS security, and Section 5 presents a specific recommendation for setting TTL values for infrastructure RRsets. 1.1. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Pappas, et al. Expires August 5, 2006 [Page 4] Internet-Draft Improving DNS Service Availability February 2006 2. Infrastructure RRsets Definitions and Conflicts The DNS contains two very distinct types of data; general DNS data records and infrastructure resource records. However the DNS design did not explicitly distinguish between general data records and infrastructure records. As a result, the rules for identifying infrastructure resource records are somewhat complex. As a general rule of thumb, an infrastructure record is used by a resolver to navigate across a delegation. Applying this general rule allows us to distinguish between general records and infrastructure records as follows: All NS resource records are infrastructure records since a resolver uses NS RRsets to navigate between delegations. Similarly, the A RRsets associated with name servers (more precisely associated with names in the data portion of a NS RR) are also considered infrastructure records. An iterative resolver could not navigate from example.com to subzone.example.com without these records. All other resource record types defined in [RFC1035] are general data records. As the DNS evolves, a large number of new general data records have been added such as SRV resource records, LOC resource records, and so forth. A small number of new infrastructure resource records have also been added. In particular, AAAA (IPv6) records are associated with name servers, and are considered infrastructure records; an IPv6 resolver needs to know AAAA RRs in order to navigate from example.com to subzone.example.com. DNS Security Extensions [RFC4034] also introduced several new infrastructure records. The DS RR is an infrastructure record since it is needed by security aware resolvers to navigate between zones. DNSKEY RRs are infrastructure records if they are used to match DS RRs, or are configured as trust anchors in resolvers. Similarly, NSEC RRs that are associated with a delegation name are also infrastructure records since they may be used to indicate the lack of DS (or DNSKEY RR) and thus play a role in securely navigating between zone. Finally, RRSIGs are infrastructure records if they sign an infrastructure RRset. In summary, NS and DS resource records are always infrastructure records. The A and AAAA resource records are infrastructure records if and only if the name associated with the A or AAAA RR exactly matches a name in the data portion of some NS RR. An NSEC RR is an infrastructure RR if and only if its owner name is a delegation. A DNSKEY RR is an infrastructure record if and only if it matches a DS RR or is configured as a trust anchor in some resolver. An RRSIG is an infrastructure record if and only if it signs an infrastructure RRset. All other resource records defined at the time of this draft are general data records. Pappas, et al. Expires August 5, 2006 [Page 5] Internet-Draft Improving DNS Service Availability February 2006 2.1. Infrastructure Records and DNS Caching Infrastructure records play a large role in DNS caching. An end host typically sends DNS queries to a local caching resolver. If an exact match for a query is not found in the cache, the caching resolver uses cached infrastructure records to determine where to start an iterative search. Initially the cache will check whether the infrastructure records (e.g. the NS RRset and corresponding A RRsets in the case of IPv4 deployment without DNSSEC) for the requested zone are present in the cache. If the infrastructure records for the zone are not found, the cache checks for infrastructure records of the zone's predecessors. The cache then begins its sequence of iterative queries by first contacting the nearest ancestor that is in the cache (e.g. the zone itself in the best case and DNS root in the worst case). For example, suppose a caching resolver wants to obtain the TXT RRSet for www.subzone.example.com. The caching resolver first searches its local cache for the requested www.subzone.example.com TXT RRset and simply returns this answer if it is found. If this is not found, the caching resolver searches its local cache for the subzone.example.com NS RRSet. If both this NS RRSet and its corresponding A/AAAA RRsets are found, the caching resolver directly contacts these servers. If no NS RRsets for subzone.example.com are found in the cache, the cache resolver will searches its local cache for the example.com NS RRset and begin the iterative query at these servers if they are found in the cache. If no NS RRsets for example.com are found, the cache resolver will search its local cache for com NS RRsets and begin the iterative query at this point. Finally, if no com NS RRsets are found in the cache, the caching resolver will begin its iterative query at the root servers. From the above description we can see that, once the infrastructure RRsets for a zone are cached in a local resolver, it can go directly to the zone server to resolve DNS queries, even when the higher level DNS servers are unavailable. As stated in [Mock88], a high TTL value not only minimizes DNS traffic but also "allows caching to mask periods of server unavailability due to network or host problems." Following this hint, we propose a simple operational tuning of the TTL values of the infrastructure RRsets toward longer values. We show how this minimizes the dependency on top level zones and increases the overall availability of DNS zones, while still maintaining acceptable operational policies. 2.2. Infrastructure RRset Conflicts One unfortunate complication is that infrastructure records often appear on both sides of DNS delegation. In other words, the records Pappas, et al. Expires August 5, 2006 [Page 6] Internet-Draft Improving DNS Service Availability February 2006 appear in both the parent and child zone. For example, a non- authoritative copy of subzone.example NS RRset is stored in the example zone and an authoritative copy of the subzone.example NS RRset is stored in the subzone.example zone. When an infrastructure RRset appears in multiple zones, ideally all RRsets contain the same data. [RFC1034] states "The administrators of both (parent and child) zones should insure that the NS and glue RRs which mark both sides of the cut are consistent and remain so." However, this mandate is very optimistic, and is not always satisfied in real operations. This reality poses a problem that is particularly acute in the context of this draft. We propose to increase the TTL associated with infrastructure records such as the NS RRsets, but this change has different scopes if the parent and child zones do not represent the same TTL for the same RRset. For example, should one increase the TTL of the example.com NS RRset stored in the example.com zone or increase the TTL of the example.com NS RRset stored in the .com zone, or increase both? To address this question, we first consider how resolvers deal with the multiple copies of a particular RRset. A caching resolver may cache the non-authoritative NS RRset stored at the parent or cache the authoritative copy from the zone itself. Each set may have its own TTL that is determined by the zone storing the data. The parent zone operator sets the TTL for the non- authoritative copy stored at the parent zone. The zone operator sets the TTL for the authoritative copy stored at the zone itself. A caching resolver must not combine the two NS RRsets. When presented with a choice of which copy to use, a cache should always prefer the authoritative copy of the NS RRset over any non-authoritative copy[RFC2181], but a cache may not always encounter the authoritative copy. A caching resolver first relies on the NS RRset stored at the parent. For example, a caching resolver first fetches the subzone.example NS RRset from example zone. This non-authoritative NS RRset stored at the parent is needed to reach the zone and is stored in the local cache. The lifetime of this non-authoritative copy of the NS RRset depends on the TTL set in the parent zone. The non-authoritative NS RRset from the parent is replaced by the actual authoritative version that is stored in the zone itself only if one of the following happens: First, the caching resolver can explicitly query for subzone.example servers and request the subzone.example NS RRset. This ensures the cache obtains the authoritative copy, but is rarely done in practice. Second, the caching resolver may query the zone's name servers for a record that is currently stored in the zone, and the server returns the requested record together with a lists the authoritative NS RRset in the Pappas, et al. Expires August 5, 2006 [Page 7] Internet-Draft Improving DNS Service Availability February 2006 additional section of the response. For example, suppose the resolver requests the www.subzone.example TXT RRset. The caching resolver initially queries the example servers to learn the (non- authoritative) subzone.example NS RRset and the resulting subzone.example NS RRset from the example zone is stored in the cache using a TTL selected by the example zone administrator. The caching resolver next queries a subzone.example server to learn the www.subzone.example TXT RRset. If this RRset exists, the response includes the www.subzone.example TXT RRset in the answer section and includes the (authoritative) copy of the subzone.example NS RRset in the additional section. In this way, the caching resolver learns the authoritative copy of the NS RRset and this authoritative copy has a TTL set by subzone.example administrator. Note that if a zone is used purely for referrals, then the caching resolver never learns the authoritative NS RRset for the zone. For example, suppose the resolver requests the www.sub2.subzone.example TXT records. The resolver first queries the example servers and learns the (non-authoritative) subzone.example NS RRset stored at the example zone. The caching resolver next queries the subzone.example servers and learns the (non-authoritative) sub2.subzone.example NS RRset stored at the subzone.example zone. Finally the caching resolver queries the sub2.subzone.example servers are learns the www.sub2.subzone.example RRset and learns the authoritative sub2.subzone.example NS RRset (from the additional section, assuming www.sub2.subzone.example TXT RRset exists). During this entire process, the resolver never learns the authoritative copy of the subzone.example NS RRset. The TTL set by the subzone.example administrator makes no difference, only the TTL set by the example administrator has an impact on this caching resolver. Finally we note that a server may be authoritative for both a zone and one of its ancestors and thus further complicate which copy of the NS RRset is stored at a cache. For example, suppose a server is authoritative for both the .example and sub2.subzone.example zones. The example above now works as follows. The caching resolver first queries the example servers, but this server does not provide any referrals since this server is authoritative for both .example and sub2.subzone.example. The server replies with the www.sub2.subzone.example RRset and includes the authoritative sub2.subzone.example NS RRset (from the additional section, assuming www.sub2.subzone.example TXT RRset exists). Similar reasoning applies to other infrastructure records that are stored in multiple places. For example, non-authoritative copies of infrastructure A and AAAA records are often encountered by resolvers. A cache should always prefer authoritative answers when available. But whether a cache obtains the authoritative or non-authoritative Pappas, et al. Expires August 5, 2006 [Page 8] Internet-Draft Improving DNS Service Availability February 2006 version depends on the sequence of queries as illustrated above. Pappas, et al. Expires August 5, 2006 [Page 9] Internet-Draft Improving DNS Service Availability February 2006 3. Setting Long Infrastructure TTLs To reduce the dependency on top level DNS servers, and hence increase the availability of a zone, we recommend that DNS zone operators substantially increase the TTL values of their zones' infrastructure RRsets. In other words, the long TTL value should be set on the authoritative copy of the NS RRset and any related A or AAAA RRsets present in the zone (authoritative or not) that correspond to names listed in the NS RRset. Given that the TTL value is part of the RRs, we recommend that the non-authoritative copies of the infrastructure RRsets stored at the parent zone also be assigned the long TTL value. This recommendation is especially important for those zones that mainly provide referral answers for their children zones, rather than answers for records stored by them. For example, TLD zones mainly provide referrals for their delegated zones. As general guide, we suggest the TTL value of a non-authoritative record be no longer than the TTL at the authoritative copy. This presumes the authoritative copy has implemented the long TTL recommendation and has selected the longest possible TTL value given the expected dynamics of this RRset. Note also that the authoritative answers of the NS and associated A RRsets from the zone itself are preferred over any copy stored at the parent. Thus, a shorter TTL value set by the parent zone will not reduce the effectiveness of the long TTL values set by the child zone, provided a resolver learns the authoritative version. 3.1. Cases of Secondary Servers outside the Zone The following common case in DNS configuration deserves a special explanation. When a zone's name server for foo.example is located inside the zone, the operator for foo.example can configure the TTL for both the NS RRset and the A/AAAA records to a long time period. However some of foo.example's authoritative servers may be located in other domains, as illustrated in the following NS RRset: foo.example. NS ns1.foo.example. foo.example. NS ns2.foo.example. foo.example. NS ns.bar.example2. The foo.example zone is authoritative for the A and AAAA RRsets at both ns1.foo.example and ns2.foo.example, and can set a longer TTL value for their NS records and associated A records. However the TTL value of the third server is configured by the bar.example2 zone, which may or may not be set to the longer value. Nevertheless a short TTL for the A record of the third server should not have a big impact, because when the parent zone of foo.example is unavailable, the A record of the third server may still be resolved even when it Pappas, et al. Expires August 5, 2006 [Page 10] Internet-Draft Improving DNS Service Availability February 2006 is not in the local cache, because the outage of the example zone does not necessarily imply the failure of the bar.example2 zone. This example also illustrates the benefit of locating secondary servers under different branches of DNS tree. 3.2. Intuition The motivation for extending the TTLs on infrastructure RRsets is partially derived from the general caching model used by the DNS. With the DNS' long-standing use of caching it is very easy to imagine longer TTL values as just an emphasis on the DNS' data being more stable (i.e. the infrastructure RRsets don't change very often, so they can be cached for longer). There are practical limitations to increasing the TTL value of infrastructure RRsets. For example, current implementations of BIND, and other DNS server distributions, limit the maximum TTL used for RRsets. Therefore, extending the TTL on RRsets may still encounter limitations after being served (i.e. in the client's cache). In addition, the interactions with DNSSEC must be taken into account. For example, DNSSEC's key roll-over process is partially a function of an RRset's TTL. Therefore, a long TTL may extend the roll-over period. See draft for more details. As a result of the above considerations, 1 week seems to embody a long enough period to greatly augment the DNS availability in the face of an outage, and still a short enough period to avoid undesirable interactions with server implementations or DNSSEC signature lifetimes and policies. 3.3. Handling Name Server Changes A primary concern of an increased TTL value is data consistency. DNS servers do change from time to time, new servers are added, existing servers' IP addresses are changed or get removed due to network reconfigurations. Such changes can lead to inconsistencies between the cached Infrastructure RRsets for DNS servers and the actual name servers. When changes in DNS name servers or their IP addresses do occur, the following operational practices should be followed. First, as stated in [RFC1034], "If a change can be anticipated, the TTL can be reduced prior to the change to minimize inconsistency during the change, and then increased back to its former value following the change." Second, a planned change should involve a grace period. When the information in authoritative DNS servers has been modified, the obsolete nameserver and/or obsolete IP address should continue answering queries for at least the TTL period, during which the Pappas, et al. Expires August 5, 2006 [Page 11] Internet-Draft Improving DNS Service Availability February 2006 cached information can still be used to resolve DNS queries. The prescribed TTL adjustment and graceful transition represent the ideal handling of DNS server changes, but they may not always be possible. In cases where unexpected changes happen, some caches will inevitably contain invalid nameserver information for a zone. However, DNS can operate effectively even when some authoritative servers may not be reachable. As long as not all the servers for a zone have changed during the TTL period, the zone will continue to be accessible even by those resolvers who have cached the now partially obsolete zone data. By continuing to operate at least a single server from the original set, during the TTL period, queries that use cached data will still be answered, even when the data for the changed server is obsolete. We should also note that, after some zone server changes, when the query is answered by a working authoritative server, this server can include the updated NS RRset in the authoritative section of the reply. Such an inclusion will override the obsolete RRset that is cached at the caching resolver. Thus the only penalty paid by a caching server is possibly a longer resolution time for the first query issued after the DNS server changes, if that query goes to one or more no-longer-existing servers before hitting a working one. One effective way to assure DNS availability in the face of unexpected changes is for each zone to set up an adequate number of secondary servers in diverse locations. In the earlier example, when ns1.foo.example suddenly failed and had to be reinstalled on a different host, although the cached data for ns1.foo.example can stay in some resolvers for a long time before it gets flushed out of the cache, queries for foo.example zone can still be served by the remaining servers. This remains true as long as not all the RRs in the NS or A/AAAA RRset change at the same time. The only negative impact is a longer query time in the event that a cache resolver first sends a query to ns1.foo.example (the recently failed server). In this case, the query will timeout after a few second and the resolver will try the next server and succeed. Overall, if DNS operators place secondary servers in appropriate locations and follow the above rules (pertaining to the updating of infrastructure RRsets and in managing server changes), a long TTL value should have little negative impact on DNS performance. We also conducted measurements over a large set of randomly chosen DNS zones to gauge the frequency of zone server changes in the current DNS system. As described in the next section, our results show that the majority of DNS zones do not change their NS RRset and the associated A/AAAA records frequently. This observation provides further support for the feasibility of a long TTL value for the infrastructure Pappas, et al. Expires August 5, 2006 [Page 12] Internet-Draft Improving DNS Service Availability February 2006 RRsets. 3.4. Impact on Cache Memory Size Introducing longer TTLs has the potential to result in an increase in the caching server's memory requirements. We believe that this is not an issue, with the current typical hardware. For example, if the working set of a very popular caching server is 10 million zones (around 10% of the World's DNS zones), and assuming that each zone's infrastructure records take less than 100 bytes of memory, then the memory requirements will be under one gigabyte of memory, . Pappas, et al. Expires August 5, 2006 [Page 13] Internet-Draft Improving DNS Service Availability February 2006 4. Measurement Results on Infrastructure RRSets Changes The previous section proposed to set the TTL values of the infrastructure RRsets to a long period. A long TTL value for infrastructure RRsets implies that each zone has a stable set of DNS servers. To assess the stability of currently deployed DNS servers, we conducted a measurement study. From a crawl over 15 million DNS zones (the crawl was initiated at DMOZ.ORG), we randomly selected 100,000 zones and measured their infrastructure RRsets over a 4-month period. During this 4-month period we queried each of the 100,000 zones twice a day to obtain its infrastructure RRset. Our data shows that 75% of the measured zones did not change either the NS or corresponding A RRSets during the entire study period. 11% of the zones showed changes to their NS RRset during this 4-month period, and 5% of the zones made the changes in less than 2 months. The A records of all the measured zone servers had more changes than the NS RRsets: 22% of the zones had their servers' A records changed within 4 months, and 10% of the zones made servers' A record changes in less than 2 months. All in all, our measurement results show that the current DNS servers, in the majority of the zones, are very stable. Even those servers that made changes during our measurement period show that their DNS server changes are rather infrequent. We believe that, with special care, the changes to DNS servers can be further reduced, and that a TTL value of 1 week is indeed feasible for infrastructure RRsets. Pappas, et al. Expires August 5, 2006 [Page 14] Internet-Draft Improving DNS Service Availability February 2006 5. Effectiveness of Long TTL on Zone's Availability The following is a quick, back-of-envelope, calculation of the increased zone availability that would result from increasing the TTL value of an infrastructure RRset. Assume foo.example is a popular zone and its infrastructure RRset (with a TTL of 4 hours) tends to be cached in many cache resolvers. If a DDoS attack takes the example zone out of service for 2 hours, then on average 50% of the cache resolves will evict the foo.example zone's infrastructure RRset (due to expiration) by the end of the 2 hours. This would leave them unable to resolve foo.example or any name under it. If we increase the TTL value of foo.example's infrastructure RRset to 1 day, then during a two hour outage of the example zone, only 1/12, or 8% of the cache resolvers would flush out foo.example's infrastructure RRset from the cache. If we increase the TTL value to one week, then after the same 2-hour duration of the example zone's service outage, foo.example's infrastructure RRset would stay valid in the caches of 98.9% of those cache resolvers that had fetched the RRset earlier. The longer the TTL is, the greater the number of cache resolvers that will have valid DNS server information in their cache. Hence, we see an increased DNS availability in the face of temporary outages of top level servers. In order to gauge the effectiveness of a longer TTL value for the DNS infrastructure records, we used a real DNS trace that was captured by a UCLA caching server for 2 weeks. Based on this trace, we simulated a DoS attack on all root and TLD servers and we measured the percentage of queries that weren't resolved (excluding negative answers from the root and TLD zones), in the case of current TTL values, and in the case of a hypothetical TTL value of 3, 5, 7, and 9 days for all zones. The attack duration was 3, 6, 12 and 24 hours, and started at the eighth day (in simulation time). The following table shows the absolute number as well as the percentage of the queries that they did not resolve for each case of attack duration and TTL value: Pappas, et al. Expires August 5, 2006 [Page 15] Internet-Draft Improving DNS Service Availability February 2006 --------------------------------------------------------------------- | || Attack Duration (Hours) | --------------------------------------------------------------------- | || 3 | 6 | 12 | 24 | | TTL ||------------------------------------------------------------- |(day)|| 7776 Queries | 13799 Queries| 23586 Queries| 53636 Queries | |-------------------------------------------------------------------- | - || 2227 - 28.6% | 3829 - 27.7% | 6807 - 28.8% | 17099 - 31.8% | | 3 || 1132 - 14.5% | 1884 - 13.6% | 3154 - 13.3% | 7218 - 13.4% | | 5 || 917 - 11.7% | 1530 - 11.0% | 2562 - 10.8% | 5947 - 11.0% | | 7 || 767 - 9.8% | 1256 - 9.1% | 2092 - 8.8% | 4766 - 8.8% | | 9 || 711 - 9.1% | 1165 - 8.4% | 1898 - 8.0% | 4157 - 7.7% | --------------------------------------------------------------------- Figure 1 Clearly, we see that by using a longer TTL value we can increase the overall system availability under denial of service attacks. The table shows that with a TTL value of seven days we can decrease the impact of such an attack at the root and TLD servers by 70%, independent of the attack duration. Also the table shows that by increasing the TTL value, we are able more resilient to attacks. Based on these results we believe that a TTL value of seven days is adequate enough to considerably improve the resilience of the DNS system against denial of service attacks. 5.1. Further Enhancement Through Prefetching Although our above analysis shows that a long TTL value alone can be effective in increasing DNS service availability, we note that at any given time some cache resolvers will have the infrastructure RRsets in their caches expire. Thus, if some top level zones are out of service when a resolver's cache entries expire, that resolver loses the ability to directly contact the destination zones whose infrastructure RRsets got flushed out. To further improve DNS' service availability, we suggest that cache resolvers pre-fetch all the infrastructure RRsets that have an initial TTL value > 2 days (which is currently the default TTL value). We can interpret a long TTL value for a infrastructure RRset to mean that the zone is "long TTL aware" and desires high availability. We suggest that the pre- fetch is performed when an infrastructure RRset's cache time drops below TTL/2. Such pre-fetching assures that a cache resolver will have valid infrastructure RRsets in the cache, and hence be able to reach zone servers directly, even when some zones along the DNS lookup path may have failed. This would remain true as long as the outage is shorter than TTL/2 time period. Pappas, et al. Expires August 5, 2006 [Page 16] Internet-Draft Improving DNS Service Availability February 2006 6. Backwards Compatibility The advantages in this approach stem, largely, from its simplicity. The operational practice of using long TTLs for infrastructure records does not require any modifications to currently deployed caches. The proposal is, therefore, backwards compatible with existing infrastructure, and has no dependency on any specific implementation of a DNS cache (such as BIND, djbdns, etc.). Additional features associated with the use of the long TTL, such as re-fetching, may be incrementally deployed without adversely affecting any existing or neighboring caches. All additional logic pertains to an instance's local cache and does not have the ability to affect or exploit other caches. Some DNS resolvers set a maximum value of the TTL that they are willing to cache. Any TTL value larger than the maximum is trimmed down to the maximum value. For example BIND sets one week as the maximum value for caching resource records. Thus, zones with a TTL value larger than one week will not achieve any additional improvements over zones with just one week TTL value. Thus in this document we recommend a TTL value of one week. If future caching server implementations have a larger maximum acceptable TTL value, we recommend increasing the TTL value of the infrastructure records even more (up to one month). Pappas, et al. Expires August 5, 2006 [Page 17] Internet-Draft Improving DNS Service Availability February 2006 7. Security Considerations The long TTL solution prescribes an operational practice that facilitates DNS queries during prolonged outages. Such outages may result from extended DDoS attacks against key servers in the DNS. The use of long TTLs does not reduce the vulnerability of targeted servers to DDoS attacks. However, the use of long TTLs limits the effectiveness of a DDoS to the global DNS. While a DDoS may disrupt the availability of some critical nameservers, the NS records for the zones that are delegated by them will be available in remote caches for much longer. Therefore, while a DDoS is no less likely, its scope is dramatically reduced. Though the long TTL extends the roll-over period that should be followed when updating NS records for a zone, there exist no additional operational requirements beyond what is recommended now. The current guidelines recommend that operators continue to operate existing nameservers during the period between the date of a change to the NS records and that date plus the value of the old TTL. The only difference that results from this proposal is to that the roll- over period is increased in proportion to the TTL. Failure to adhere to these guidelines has 1 of 2 effects (which exist in the current mode of operation for the DNS too): If there exist some nameservers that appear in both the old NS RRSet and the new one, then any cache that is making use of a cached set may have to issue multiple A requests and timeout before reaching an active nameserver. However, if there is no intersection between the nameservers in the old and the new RRSet, then there exists a period between the date that the last cache has fetched the old values, and that time plus 1 TTL, when a cache will direct resolvers to inoperable nameservers. Neither of these scenarios is a concern if operators follow the standard procedure of maintaining both sets of servers (or at least an overlapping set) during roll-overs. Pappas, et al. Expires August 5, 2006 [Page 18] Internet-Draft Improving DNS Service Availability February 2006 8. Recommendations Our analysis shows that using long TTL values for infrastructure RRsets can be a simple and effective way to increase DNS service availability in face of top level DNS server outages, and that this simple operational tuning should have negligible impact on the DNS system and its performance. Our measurements over a large set of randomly selected DNS zones also suggest that, in today's practice, the infrastructure RRsets for the majority of DNS zones are indeed stable and change very infrequently. Based on our analysis and measurements, we make the following recommendations. First, we recommend that the TTL value for infrastructure RRsets to be increased to one week. Second, conduct a trial deployment of this long TTL value with a controlled set of zones and measure the zones' availability, performance (in terms of name resolution delays), and changes in the zones' server load (we expect a decrease in the server load). If the trial deployment succeeds without exposing any unexpected issues, we would like to recommend wide deployment of long TTL settings for infrastructure RRsets, both for top level zones as well as for any zones that desire a high availability. It is noteworthy that extending the TTL of infrastructure RRsets to one week constitutes a very palpable step toward ensuring the robustness of the DNS. Current caching in the DNS is invaluable for many reasons, but with this enhancement, caching in the DNS is being drafted into the realm of DDoS protection. Our analysis has shown that this long-standing, bulletproof, staple of DNS is capable of offering a very tangible level of protection with almost no overhead and with no new code. As future work, we plan to conduct further analysis on much longer TTL values (such as one month) for infrastructure RRsets and consider the impact on DNSSEC deployment. Pappas, et al. Expires August 5, 2006 [Page 19] Internet-Draft Improving DNS Service Availability February 2006 9. Acknowledgments We would like to express our thanks to Greg Minshall for an early discussion on the feasibility of using long TTLs to improve DNS availability, to Pete Resnick for his support and the suggestion of using one week or even longer TTL values, and to Rob Austin and Patrik Faltstrom who also provided constructive comments to our proposal. 10. References [Mock88] Mockapetris, P. and K. Dunlap, "Development of the Domain Name System", SIGCOMM, 1988. [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", STD 13, RFC 1034, November 1987. [RFC1035] Mockapetris, P., "Domain names - implementation and specification", STD 13, RFC 1035, November 1987. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS Specification", RFC 2181, July 1997. [RFC2182] Elz, R., Bush, R., Bradner, S., and M. Patton, "Selection and Operation of Secondary DNS Servers", BCP 16, RFC 2182, July 1997. [RFC3258] Hardie, T., "Distributing Authoritative Name Servers via Shared Unicast Addresses", RFC 3258, April 2002. [RFC4034] Arends, R., Austein, R., Larson, M., Massey, D., and S. Rose, "Resource Records for the DNS Security Extensions", RFC 4034, March 2005. Pappas, et al. Expires August 5, 2006 [Page 20] Internet-Draft Improving DNS Service Availability February 2006 Authors' Addresses Vasileios Pappas University of California, Los Angeles, Department of Computer Science 4805 Boelter Hall Los Angeles, CA 90095-1596 US Email: vpappas@cs.ucla.edu Bin Zhang Colorado State University, Department of Computer Science Fort Collins, CO 80523-1873 US Email: zhangb@cs.colostate.edu Eric Osterweil University of California, Los Angeles, Department of Computer Science 4805 Boelter Hall Los Angeles, CA 90095-1596 US Email: eoster@cs.ucla.edu Dan Massey Colorado State University, Department of Computer Science Fort Collins, CO 80523-1873 US Email: massey@cs.colostate.edu Lixia Zhang University of California, Los Angeles, Department of Computer Science 3713 Boelter Hall Los Angeles, CA 90095-1596 US Email: lixia@cs.ucla.edu Pappas, et al. Expires August 5, 2006 [Page 21] Internet-Draft Improving DNS Service Availability February 2006 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Pappas, et al. Expires August 5, 2006 [Page 22]