Network Working Group M. Andrews Internet-Draft ISC Expires: May 16, 2014 November 12, 2013 A Common Operational Problem in DNS Servers - Failure To Respond. draft-andrews-dns-no-response-issue-02.txt Abstract The DNS is a query / response protocol. Failure to respond to queries causes both immediate operational problems and long term problems with protocol development. This document will identify a number of common classes of queries that some servers fail to respond too. This document will also suggest procedures for TLD and other similar zone operators to apply to reduce / eliminate the problem. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on May 16, 2014. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Andrews Expires May 16, 2014 [Page 1] Internet-Draft Failure to respond November 2013 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Common queries class that result in non responses. . . . . . . 4 2.1. EDNS Queries . . . . . . . . . . . . . . . . . . . . . . . 4 2.2. Unknown / Unsupported Type Queries . . . . . . . . . . . . 4 2.3. TCP Queries . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Remediating . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4. Firewalls and Load Balancers . . . . . . . . . . . . . . . . . 6 5. Normative References . . . . . . . . . . . . . . . . . . . . . 6 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 7 Andrews Expires May 16, 2014 [Page 2] Internet-Draft Failure to respond November 2013 1. Introduction The DNS [RFC1034], [RFC1035] is a query / response protocol. Failure to respond to queries causes both immediate operational problems and long term problems with protocol development. Failure to respond to a query is indistinguishable from a packet loss without doing a analysis of query response patterns and results in unnecessary additional queries being made by DNS clients and unnecessary delays being introduced to the resolution process. Due to the inability to distingish between packet loss and nameservers dropping EDNS [RFC2671] queries, packet loss is sometimes misclassified as lack of EDNS support which can lead to DNSSEC validation failures. Allowing servers which fail to respond to queries to remain in the DNS hierarchy for extended periods results in developers being afraid to deploy new type codes. Such servers need to be identified and corrected / replaced. The DNS has response codes that cover almost any conceivable query response. A nameserver should be able to respond to any conceivable query using them. Unless a nameserver is under attack, it should respond to all queries directed to it as a result of following delegations. Additionally code should not assume that there isn't a delegation to the server even if it is not configured to serve the zone. Broken delegation are a common occurrence in the DNS and receiving queries for zones that you are not configured for is not a necessarily a indication that you are under attack. When a nameserver is under attack it may wish to drop packet. A common attack is to use a nameserver as a amplifier by sending spoofed packets. This is done because response packets are bigger than the queries and big amplification factors are available especially if EDNS is supported. Limiting the rate of responses is reasonable when a this is occuring and the client should retry. This however only works if legitimate clients are not being forced to guess whether EDNS queries are accept or not. While there is still a pool of servers that don't repsond to EDNS requests, clients have no way to know if the lack of response is due to packet loss, EDNS packets not being supported or rate limiting due to the server being under attack. Mis-classifications of server characteristics are unavoidable when rate limiting is done. Andrews Expires May 16, 2014 [Page 3] Internet-Draft Failure to respond November 2013 2. Common queries class that result in non responses. There are three common query class that result in non responses today. These are EDNS queries, queries for unknown (unallocated) or unsupported types and filtering of TCP queries. 2.1. EDNS Queries Identifying servers that fail to respond to EDNS queries can be done by first identifying that the server responds to regular DNS queries then making a series otherwise identical responses using EDNS, then making the original query again. A series of EDNS queries is needed as at least one DNS implementation responds to the first EDNS query with FORMERR but fails to respond to subsequent queries from the same address for a period until a regular DNS query is made. The EDNS query should specify a UDP buffer size of 512 bytes to avoid false classification of not supporting EDNS due to response packet size. If the server responds to the first and last queries but fails to respond to most or all of the EDNS queries it is probably faulty. The test should be repeated a number of times to eliminate the likely hood of a false positive due to packet loss. Firewalls may also block larger EDNS responses but there is no easy way to check authoritative servers to see if the firewall is misconfigured. 2.2. Unknown / Unsupported Type Queries Identifying servers that fail to respond to unknown or unsupported types can be done by making a initial DNS query for a A record, making a number of queries for unallocated type, them making a query for a A record again. IANA maintains a registry of allocated types. If the server responds to the first and last queries but fails to respond to the queries for the unallocated type it is probably faulty. The test should be repeated a number of times to eliminate the likely hood of a false positive due to packet loss. 2.3. TCP Queries All DNS servers are supposed to respond to queries over TCP [RFC5966]. Firewalls that drop TCP connection attempts rather that resetting the connect attempt or send a ICMP/ICMPv6 administratively prohibited message introduce excessive delays to the resolution process. Whether a server accepts TCP connections can be tested by first Andrews Expires May 16, 2014 [Page 4] Internet-Draft Failure to respond November 2013 checking that it responds to UDP queries to confirm that it is up and operating then attempting the same query over TCP. A additional query should be made over UDP if the TCP connection attempt fails to confirm that the server under test is still operating. 3. Remediating While the first step in remediating this problem is to get the offending nameserver code corrected, there is a very long tail problem with DNS servers in that it can often take over a decade between the code being corrected and a nameserver being upgraded with corrected code. With that in mind it is requested that TLD, and other similar zone operators, take steps to identify and inform their customers, directly or indirectly through registrars, that they are running such servers and that the customers need to correct the problem. TLD operators should construct a list of servers child zones are delegated to along with a delegated zone name. This name shall be the query name used to test the server as it is supposed to exist. For each server the TLD operator shall make a SOA query the delegated zone name. This should result in the SOA record being returned in the answer section. If the SOA record is not return but some other response is returned this is a indication of a bad delegation and the TLD operator should take whatever steps it normally takes to rectify a bad delegation. If more that one zone is delegated to the server it should choose another zone until it finds a zone which responds correctly or it exhausts the list of zones delegated to the server. If the server fails to get a response to a SOA query the TLD operator should make a A query as some nameservers fail to respond to SOA queries but respond to A queries. If it gets no response to the A query another delegated zone should be queried for as some nameservers fail to respond to zones they are not configured for. If subsequent queries find a responding zone all delegation to this server need to be checked and rectified using the TLD's normal procedures. Having identified a working tuple the TLD operator should now check that the server responds to EDNS, Unknown Query Type and TCP tests as described above. If the TLD operator finds that server fails any of the tests, the TLD operator shall take steps to inform the operator of the server that they are running a fault nameserver and that they need to take steps to correct the matter. The TLD operator shall also record the for followup testing. Andrews Expires May 16, 2014 [Page 5] Internet-Draft Failure to respond November 2013 If repeated attempts to inform and get the customer to correct / replace the fault server are unsuccessful the TLD operator shall remove all delegations to said server from the zone. It will also be necessary for TLD operators to repeat the scans periodically. It is recommended that this be performed monthly backing off to bi-annually once the numbers of faulty servers found drops off to less than 1 in 100000 servers tested. Follow up tests for faulty servers still need to be performed monthly. Some operators claim that they can't perform checks at registration time. If a check is not performed at registration time it needs to be performed within a week of registration in order to detect faulty servers swiftly. Checking of delegations by TLD operators should be nothing new as they have been required from the very beginings of DNS to do this [RFC1034]. Checking for compliance of nameserver operations should just be a extension of such testing. It is recommended that TLD operators setup a test web page which performs the tests the TLD operator performs as part of their regular audits to allow nameserver operators to test that they have correctly fixed their servers. Such tests should be rate limited to avoid these pages being a denial of service vector. 4. Firewalls and Load Balancers Firewalls and load balancers can affect the externally visible behaviour of a nameserver. Tests for conformance need to be done from outside of any firewall so that the system as a whole is tested. Firewalls and load balancers should not drop DNS packets that they don't understand. They should either pass through the packets or generate a appropriate error response. Requests for unknown query types are not attacks and should not be treated as such. 5. Normative References [RFC1034] Mockapetris, P., "DOMAIN NAMES - CONCEPTS AND FACILITIES", STD 13, RFC 1034, November 1987. [RFC1035] Mockapetris, P., "DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION", STD 13, RFC 1035, November 1987. Andrews Expires May 16, 2014 [Page 6] Internet-Draft Failure to respond November 2013 [RFC2671] Vixie, P., "Extension Mechanisms for DNS (EDNS0)", RFC 2671, August 1999. [RFC5966] Bellis, R., "DNS Transport over TCP - Implementation Requirements", RFC 5966, August 2010. Author's Address M. Andrews Internet Systems Consortium 950 Charter Street Redwood City, CA 94063 US Email: marka@isc.org Andrews Expires May 16, 2014 [Page 7]