The ISP Column
A column on things Internet
The first part of this report on the handling of large DNS responses looked at the behaviour of the DNS, and the interaction between recursive resolvers and authoritative name servers in particular and examined what happens when the DNS response is around the Internet’s de facto MTU size of 1,500 octets.
For responses larger than 1,500 octets we saw failure in some 2.5% of all cases. What we observed was two forms of DNS failure. The first was that resolvers were signalling to the server via query attributes that it was acceptable for the server to send large responses over fragmented UDP, but then were unable to reassemble the fragmented response, either due to local host or local network constraints. This scenario occurred in around three quarters of all such failure cases in our measurement tests. The second failure form was where the resolver had received a truncated DNS response and there was a subsequent TCP failure. This included the failure to open a TCP session or a TCP path MTU mismatch where the TCP session hung when attempting to pass back the DNS response. This occurred in slightly less than one quarter of all failure cases in our measurement tests. The measurement setup and the results from this work are to be found in part 1 of this report, “DNS XL”.
However, we’re not finished with these measurements. The results that are presented in the first part of this report are based on respecting the packet size constraints expressed in DNS queries. These constraints are that no UDP DNS response should exceed 512 octets unless there is an EDNS(0) extension with a UDP buffer size provided in the query, and the value of this buffer size field is greater than 512. When there is a UDP buffer size in the query, then the DNS response should be no larger than this size. In such cases where this is not possible, then the server will respond with a truncated DNS response over UDP. In this measurement the truncated response packet has an empty answer section, so the resolver making the query cannot use this truncated response to assemble an answer, and it should trigger the resolver to repeat the query over a TCP session with the server.
In this, the second part of the report, we ask the question: What if we break with these conventions?
In particular, we are interested in understanding the likely changes to DNS resolution behaviour of fragmented UDP responses, the behaviour of TCP responses and the behaviour of the DNS as a whole if all recursive resolvers were use the DNS Flag Day 2020 (https://dnsflagday.net/2020/) setting of 1,232 octets as a buffer size in their queries. Here, we will look at the behaviour of the DNS when we process incoming queries as if they all had an EDNS(0) extension and there was a buffer size in this extension that was set to a particular value. Yes, this server-based rewriting of queries is cheating, and it’s not what resolvers may be expecting, but it allows us to gain some further insights into the capabilities of the resolver to authoritative part of the DNS.
We are going to perform five variants of changing DNS queries. We will firstly test UDP buffer sizes where all incoming queries are altered to have buffer sizes of 512, 1,232, 1,440 and 4,096 octets. We will then modify the MSS of incoming TCP SYN packets and set this value to 1,220 octets.
When we force the buffer size to 4,096 octets for all incoming queries then at no stage will a recursive resolver receive a response with the truncation bit set. This means that the server will respond to all queries over UDP with a UDP response, and it will fragment all larger UDP responses. The fragmentation onset will reflect the server’s local MTU setting of 1,500 octets. The results are shown in Table 1.
Actually, that’s not quite “all”, In 1% of cases we observe a query over TCP, even though a truncated response has not been previously sent. It appears that some of the time a resolver that is not receiving fragmented UDP responses will probe the server with TCP in some kind of liveness test.
|DNS Response Size||Tests||UDP Pass Rate||UDP Fail Rate||IPv4 Failure Rate||IPv6 Failure Rate||Control Failure Rate|
The columns in Table 1 reflect a dual stack failure rate, an IPv4-only experiment, and an IPv6-only experiment, and the control, which is the experiment that does not alter the received buffer size in any way.
There are some unexpected outcomes in this data. The first is that we observed a 2% failure rate for unfragmented UDP responses with DNS payload sizes of 1,270 octets and greater. Oddly enough the failure rate for DNS payloads between 1,270 octets and 1,430 octets in IPv4-only (2.4%) is double that of IPv6-only (1.2%). These DNS responses are packaged by the server as unfragmented UDP packets.
As the smaller control unfragmented DNS response was successfully processed by the resolver, this presumably implies that there is some network infrastructure close to some resolvers that is discarding UDP packets where the payload size is between 1,270 and 1,430 octets, or the resolvers themselves are not accepting incoming DNS packets of size greater than 1232 octets in some circumstances.
This particular result is likely to be due to the nature of the experimental setup and resolver behaviour, rather than being due to network behaviour.
In this experiment we are deliberately abusing the DNS specification and the experiment’s server is ignoring the resolver clients’ offered UDP buffer size values.
Most resolver implementations appear not to raise an exception if the DNS response in the UDP packet is larger than the UDP buffer size specified in the query, but some resolver implementations appear perform a correlation between query and response. These implementations appear to be discarding a UDP response if the DNS payload is larger than the UDP buffer size in the original query. Similarly, there are instances where the response is being discarded if no buffer size was originally given by the resolver client and the response is larger than 512 octets.
When we look at the comparison between the resolver client’s buffer size and the size of the UDP response, then for each individual test there are three possible types of response: all responses for the test are smaller than the query-specified buffer sizes, all responses for the test are larger than the query-specified buffer sizes, or it's a mixed scenario. We then divide up each case into success and fail. The results are as follows:
It’s clear that in these unfragmented UDP cases the majority of failures occur when the DNS response is larger than the query-specified buffer size.
The conclusion drawn from this data is that the observed loss rates for unfragmented UDP responses when we use a test that deliberately disregards the offered UDP buffer size are generally attributable to these resolver clients rejecting the server’s responses in those cases where the response is larger than the size specified in the original query. There is no evidence of systematic network failure when using these packet sizes, either in IPv4 or in IPv6.
When we quote figures about IPv6 we are talking about the pass and failure rates as they relate to the subset of users who are located behind IPv6-capable DNS resolvers. This is currently measured to be around some 55% of users.
For UDP packets that are fragmented by the server before they are sent, namely with payloads greater than 1,472 octets (and 1,452 in IPv6) the failure rate rises considerably for both protocols. IPv6 fragmentation is evidently not handled as well as IPv4, but both protocols show an extremely high loss rate. There are likely to be two factors going on in this scenario. Firstly, there is the ‘oversized’ response being discarded by the resolver, which would account for a 2.4% failure rate based on the data from the smaller unfragmented packets. The additional failure component appears to be related to a fragmentation drop behaviour, which appear to account for the remaining 12% failure rate. In IPv6 the fragmentation-related drop rate appears to account for 15.2% of failure cases while in IPv4 the ‘oversize’ drop rate is higher and the residual fragmentation drop rate is 12%.
Why isn't the IPv6 fragmentation drop rate of 15.2% even higher? Other studies have reported IPv6 fragmentated packet drop rates between 20% to upward of 45%.
The reason probably lies in the particular circumstances of this experiment. Here we are looking at the path between recursive resolvers and a small set of authoritative servers. The servers are located in a data centre hosted environment that admits fragmented IPv6 packets and the recursive resolvers would presumably be located in operationally managed facilities that would likely to be also managed to achieve operational robustness. In order words, here we are looking more at the ‘core’ of the network rather than the connections to the edges.
The higher IPv6 fragmented packet drop rates have generally been observed in studies using end-to-end measurements which would presumably include edge networks. This implies that this observed 15.2% IPv6 UDP fragmentation drop rate reflects aspects of the recursive-to-authoritative network path but is not a good starting point to make more universal claims about IPv6 fragmentation performance in the end-to-end Internet. It’s also the case that the IPv4 fragmentation drop rate is 12% in this scenario. This is a critical observation, in that other studies of end-to-end fragmentation drop rates in IPv4 do not report such high levels of packet drop.
This implies that the observed IPv6 fragmentation drop rate is more likely to be due to specific security-based filter rules relating to UDP packet fragmentation rather than network behaviours dropping IPv6 packets with extension headers in this particular measurement scenario.
What is also somewhat unexpected is that the average query count is so high for failure cases when the response is fragmented (Table 2). The lack of a truncated response leads some resolution systems to re-query at a high rate over the 60 second measurement window.
A similar pattern is visible when looking at the average time taken to perform this resolution task (Table 3). While the average number of queries to successfully resolve a name rises by 2 queries for fragmented UDP packets, the average time taken to successfully complete the resolution process rises by a further 80ms on average when the UDP response is fragmented.
These results do not place fragmented UDP in a good light for the DNS, irrespective of the IP protocol version. There is a base rate of some 14% of experiments that fail when the only resolution mechanism is fragmented UDP, and this rises by a further 2.5% when IPv6-only is used. The elapsed time to resolve also stretches out, and 8 seconds on average for resolution of a name when fragmented UDP is the only resolution mechanism is simply too long a time to be useful.
The implication of these results suggests that the original recommendation in RFC 6891to use a default buffer size parameter value of 4,096 octets was overly optimistic about the performance characteristics of fragmented UDP when negotiating firewalls and filters in front of DNS resolvers. Avoiding UDP fragmentation in the DNS appears to be a prudent measure, not because of network drop per se, but because of the common operational conventions in filtering fragmented DNS over UDP packets.
Let’s test this theory some more.
What if we alter our measurement environment to truncate every response larger than 512 octets and only serve larger DNS responses over TCP?
When we force the buffer size to 512 for all received queries then the experiment server will use a truncated response for all queries received over UDP. The truncated response contains no answer section, so the resolver will need to perform the query over TCP to resolve the name. The results are shown in Table 4.
|DNS Response Size||Tests||TCP Pass Rate||TCP Fail Rate||IPv4 Failure Rate||IPv6 Failure Rate||Control Failure Rate|
It appears that some 1.6% of users sit behind a resolver that cannot perform DNS over TCP. If we look at the users behind IPv4-capable resolvers, then the proportion rises slightly to 1.9%. When we look at the subset of users behind IPv6-capable resolvers the number drops slightly to 1.6%. It is likely that more recent resolver deployments support both IPv6 and TCP, while there is a set of legacy resolver systems that do not support IPv6 and a higher proportion of these resolvers do not support TCP.
The failure rate rises slightly, by 0.2%, when the TCP response requires two TCP segments. This also means that the first TCP segment is sent using a segment size equal to the receiver’s offered MSS value. If there are any path MTU issues on the TCP path, then the first full-size packet may encounter a TCP black hole situation where the ICMP message is not passed back to the TCP sender (the DNS server), and the TCP connection hangs.
|DNS Response Size||Failure Count||NO TCP Failure||TCP ACK Failure||TCK OK Failure|
This appears to be the reason behind the increased failure rate in TCP ACK failure when the DNS payload exceeds the MSS and the response is delivered using a full-sized packet (where the offered MSS equals the outbound MTU minus the packet header overheads. As we noted in the first part of this report (Figure 11 of DNS XL, Part 1), some 80% of TCP sessions over IPv4 and 57% of TCP sessions over IPv6 use an MSS setting in the TCP session that assumes a 1,500-octet path MTU.
However, the more dominant factors when failure occurs are cases where there is no TCP at all and cases where there is what appears to be a successfully completed TCP transaction.
More than half the time failure occurs when the resolver cannot open the TCP and pass the query to the server. Most likely this is an enthusiastic filter setting close to the resolver that does not allow the DNS to use TCP port 53.
The other failure mode is not so readily explained. In a little over one third of cases the TCP session passes the response to the remote client and the client end of the TCP session acknowledges the data. This would normally lead us to conclude that the resolver now has the data. But the resolver does not then complete the overall DNS resolution process. It is unclear why this occurs. A possible explanation is that the DNS application is discarding TCP responses that exceed its UDP payload size, although why a resolver would apply a UDP maximum payload setting to responses received over TCP is not readily explained.
The average query count for pass experiments is 1 – 2 queries greater than the control, and 1 query greater than the UDP-only count for smaller packets and much the same as UDP-only for larger DNS responses. The query count for failed experiments is 10 times higher than UDP-only for smaller packets, and similar for the larger DNS responses (Table 6).
TCP takes some additional time to start in the DNS. There is 1 round trip time to deliver the UDP truncated response and a further round trip time to complete the TCP handshake, so we can expect the delay with TCP to be longer than simple UDP. Compared to the results in Table 3 (where only UDP was used), the results for this TCP-only experiment show’s an increased the elapsed time by a little under double the time (Table 7). However, larger responses are delivered reliably. Unlike fragmented UDP, the TCP failure rate is consistently low.
It appears that unfragmented UDP is both fast and reliable, while for larger responses where UDP fragmentation is unavoidable TCP is more reliable, albeit somewhat slower. What happens when we force this behaviour by setting the buffer size in all queries to a value where UDP fragmentation is avoided?
The next scenario to be explored here is that being used in DNS Flag Day 2020. Here we set our server to behave as if all incoming queries use a buffer size of 1,232 octets. The intent here is to use UDP when we can be reasonably confident that the UDP packet will not encounter UDP fragmentation scenarios, and then shift to TCP for larger responses. The shift to TCP is of course controlled by the server providing a truncated response in UDP. In our case we are once again pushing this beyond conventional behaviour, in that we are not loading an answer section into the truncated response. The only way that the resolver will receive the response is by using TCP once the DNS response size exceeds 1,232 octets. The results of this measurement experiment are shown in Table 8.
|DNS Response Size||Tests||1232 Pass Rate||1232 Fail Rate||1232 IPv4 Failure Rate||1232 IPv6 Failure Rate||Control Failure Rate|
Predictably, we see the UDP-only failure rate (0.5%) for DNS responses of less than 1,232 octets and the TCP-only failure rate (1.6%) for larger packets. This is comparable to the control experiment for smaller responses, slightly worse than the control for responses up to 1,430 octets and slightly better for larger responses.
The average query count in this case is 2 queries more than the control case for smaller DNS responses and 1 query more for larger responses.
The elapsed time to complete resolution rises once the DNS payload exceeds 1,232 octets, and there is on average a further 100ms to complete the resolution process for these larger packets. This is due to the overheads of the truncated DNS response and the TCP handshake time for these response sizes.
With an overall loss rate of 1.8% for DNS payloads larger than 1,232 octets the obvious question is whether we can improve on this scenario. What if we lift the buffer size to just below the onset of UDP packet fragmentation, namely at 1,440 octets?
Let’s now look at the scenario of lifting of the threshold point to switch to TCP to just below a packet size of 1,500 octets. We will force all queries to use a buffer size setting of 1,440 octets.
We know from the all UDP experiment (Table 1) that there is an elevated response loss rate when the DNS payload size in UDP exceeds the resolver-client specified buffer size in the query, and this is visible in Table 11. This appears to account for a minimum of some 2% of the 2.6% observed failure rate for these smaller-sized packets.
The UDP loss rate for this size range exceeds the TCP loss rate that we observed in Table 8 where the lower buffer size setting of 1,232 octets was used.
|DNS Response Size||Tests||1440 Pass Rate||1440 Fail Rate||1440 IPv4 Failure Rate||1440 IPv6 Failure Rate||Control Failure Rate|
The UDP average query count is uniformly low up until the TCP point, and the truncation and switch to TCP lifts the average query count for successful resolution efforts by slightly over 2 queries. The unsuccessful query count is more than quadrupled when there is a shift to TCP (Table 12).
The UDP-based retrieval is also considerably faster than TCP, completing the resolution in an average of 130ms, compared to 260ms, which is consistent with the overheads of the TCP connection. (Table 13).
This data suggests that the lower buffer size of 1,232 is more robust for resolvers, but it will add delays in resolution time and impose a greater query load on the server, both in terms of the TCP control overhead and the additional query volume for responses whose size falls into the range of 1,232 to 1,440 octets. It is possible, even likely, that the loss rate would fall were resolvers to use a default buffer size of 1,440 octets rather than 1,232 octets. The issue here appears to be application-level settings disregarding received packets and not an intrinsic behavioural property of the network path between the servers and recursive resolvers.
There is another variant to examine here, and that is to try and reduce the incidence of TCP path MTU issues. One way to achieve this is to drop the MTU setting on the server, so that it will not push out 1,500 octet IP packets. Another way is to modify the incoming MSS of TCP connection packets and rewrite the MSS to a lower value. In this experiment we’ve used the approach of rewriting the MSS on incoming TCP SYN packets, changing the MSS value to a value of 1,200 octets. This should reduce the TCP failure rate where the server sends the DNS data and does not receive an ACK for the data.
|DNS Response Size||Tests||1440 Pass Rate||1440 Fail Rate||1440 IPv4 Failure Rate||1440 IPv6 Failure Rate||Control Failure Rate|
There is a very small change in the failure rate for DNS responses larger than 1,500 octets, and the change is around 0.1%. (Table 14) The change improves the IPv6 performance, dropping the failure rate for larger packets from 1.7% to 1.6%.
The query count profile is largely unaltered, as one would expect, although the level of query thrashing for large responses that fail in TCP is higher. One the issue of TCP “black hole” failure is removed then the other failure cases relating to TCP become the dominant factor, and the number of TCP queries that are made in 60 seconds increases once the stalled TCP sessions are eliminated (Figure 15).
The profile of time to resolve is also similar, although the elapsed time for larger responses is somewhat larger (Figure 16).
So far, we have assumed a model where the resolver client is in control of the onset of UDP fragmentation by using the buffer size parameter in the EDNS(0) extension attached to a DNS query. Some DNS implementations also allow the server to also influence the onset of UDP fragmentation in DNS responses over UDP. In the Bind resolver the configuration option is the max-udp-size value:
Sets the maximum EDNS UDP message size named will send in bytes. Valid values are 512 to 4096 (values outside this range will be silently adjusted). The default value is 4096. The usual reason for setting max-udp-size to a non-default value is to get UDP answers to pass through broken firewalls that block fragmented packets and/or block UDP packets that are greater than 512 bytes. This is independent of the advertised receive buffer (edns-udp-size).
The intent of this setting is to allow the server to set its own maximum UDP response size. If the query provides a lower value for the buffer size then the server will use it, but if the query has a higher buffer size value, then this local setting will be used. What happens when we combine this approach with the server-size imposed TCP MSS value of 1,200? The results of this experiment are shown in Table 17.
|DNS Response Size||Tests||max 1232 Pass Rate||max 1232 Fail Rate||max 1232 IPv4 Failure Rate||max 1232 IPv6 Failure Rate||Control Failure Rate|
The change here is that we are avoiding the case where the client drops the response because it is larger than the clients’ originally specified maximum UDP response sizes. Because no UDP response is larger than 1,232 octets of payload then all intermediate sized responses (1,270 octets) and large responses (larger than 1430 octets) switch to TCP, and the larger TCP failure rate (of some 1.7%) kicks in. As observed already, the TCP failure rate for IPv4 resolvers is almost double the IPv6 failure rate.
The profile of number of queries (Table 18) and time to resolve (Table 19) the name is largely similar to the previous case.
|<= 1232||Control||<= 1232||Control|
|<= 1232||Control||<= 1232||Control|
This case is similar to case 6, but with the UDP-to-TCP threshold lifted to 1,440 octets.
|DNS Response Size||Tests||max 1440 Pass Rate||max 1440 Fail Rate||max 1440 IPv4 Failure Rate||max 1440 IPv6 Failure Rate||Control Failure Rate|
The outcomes for IPv4 and IPv6 non-fragmented packets in Table 20 are slightly better than the results in Table 14, particularly as it relates to DNS response sizes in the range 1,270 to 1,470 octets. It appears that some 2% of users sit behind recursive resolvers that will check the UDP DNS response size against the buffer size in the original query and reject the query if the response is larger than the query-specified size.
|<= 1440||Control||<= 1440||Control|
|<= 1440||Control||<= 1440||Control|
The number of queries (Table 21) and query time (Table 22) show a marked performance improvement for intermediate-sized responses as would be expected.
Let’s collect the results of these individual experiments into single table that look at the failure rates fro the various packet size management scenarios (Table 23).
There are a set of design trade-offs in the choices for transport for the DNS protocol.
For short responses UDP is an efficient and reliable transport vehicle. However, when the size of the UDP response is larger than the network path MTU and UDP fragmentation is required, then fragmentation packet losses create serious problems for the protocol, and it becomes unreliable.
For that reason, TCP will be more far more reliable than fragmented UDP for larger responses on average. However, TCP is slower and far less efficient than UDP and its basic reliability rate is worse than unfragmented UDP. If carriage efficiency and reliability is a consideration for the DNS, then unfragmented UDP is clearly superior to TCP, while TCP is clearly superior to fragmented UDP.
|DNS Response Size||Failure Rates|
|Control||512 (TCP)||4096 (UDP)||1232||<= 1232||1440||<= 1440|
What this means is that UDP should be used for as long as it will not encounter fragmentation, and then the DNS should shift to TCP.
How can this be achieved? It is unreasonable to expect that a lightweight UDP-based packet exchange should perform a path MTU discovery operation for each and every transaction. This implies that both the client and the server should use conservative settings for transport parameters that avoid path MTU issues.
What should a DNS client do?
The DNS Flag Day 2020 settings are a good start, but I think that they don’t quite catch the entirety of the space. Not only should a client use a EDNS(0) payload size setting equal to or less than 1452 in IPv6 (accounting for a 40 octet IPv6 header and an 8 octet UDP header), and 1472 in IPv4 (accounting for a 20 octet IPv4 header and an 8 octet UDP header). For TCP, a client should also use a TCP MSS setting less than 1440 octets in IPv6 (accounting for a 40 octet IPv6 header and an 20 octet TCP header) and 1460 octets in IPv4 (accounting for a 20 octet IPv4 header and an 20 octet TCP header).
What should a DNS server do?
The server should also avoid fragmentation, and it can do this by setting a maximum payload size value no larger than 1,452 in IPv6 and 1,472 in IPv4. It should also impose a ceiling on the size of outgoing TCP packets of 1,440 packets in IPv6 and 1,460 in IPv4.
Specific circumstances vary, and there is a difference between measurements at the edge of the Internet and within the infrastructure of the network. Our extensive measurements of the behaviour of the inner infrastructure of the Internet between recursive resolvers and authoritative servers indicate that the network behaviour is relatively uniform with IP packet sizes up to 1,500 octets. If we restrict ourselves to settings that relate only to the transactions between recursive resolvers and authoritative servers then the DNS Flag Day 2020 setting of 1,232 octets are too low. The result is that the transaction will invoke TCP too early. A more efficient outcome can be achieved by pushing the UDP packet size to 1,500 octets including the IP header.
At the same time, it is prudent to pull the TCP segment size down. The incremental performance cost of using a 1,200 octet MSS value is extremely small when looking at DNS transactions.
This leads to some recommendations for transport parameter values for DNS clients and servers, shown in Table 24. The intent of these settings is to use UDP all the way to 1,500 octets of IP packet size, then use TCP with a more conservative MSS setting that increases the reliability of TCP sessions.
|Client EDNS(0) Buffer Size||1,472||1,452|
|Client TCP MSS||1,200||1,200|
|Server Max Buffer Size||1,471||1,452|
|Server max TCP MSS||1,200||1,200|
It must be noted that these settings apply only to “inside” of the Internet in the path between recursive resolvers and authoritative servers. The edge of the Internet is shows greater levels of variability and it is probably prudent to use a lower UDP upper bound, although this is as aspect of the DNS where our measurement technique cannot gain a direct insight, so we’ve refrained from making any particular recommendations for the edge stub-to-recursive resolver scenario. It’s likely that the TCP MSS setting of 1,200 octets would still make sense, but less clear if the higher buffer size parameter is equally applicable at the edge.
The above views do not necessarily represent the views of the Asia Pacific Network Information Centre.
GEOFF HUSTON is the Chief Scientist at APNIC, the Regional Internet Registry serving the Asia Pacific region.