ISP Column
Geoff Huston
April 2016
DNS OARC held a two day workshop in Buenos Aires prior to IETF 95 at the
end of March 2016. Here are my impressions of this meeting.
For a supposedly simply query response protocol that maps names to IP
addresses there a huge amount going on under the hood with the DNS. Some
good, some not so good.
In the not so good category Anand Buddhdev from the RIPE NCC reported on
some attacks against the authoritative nameserver infrastructure
operated by the RIPE NCC. They operate an anycast cluster with 9 servers
in 3 locations at AMSIX (Amsterdam), LINX (London) and NetNOD
(Stockholm). They serve some 5,000 zones, with a mix of reverse
address-to-name servers and secondary services for a number of other
domains, including a number of country domains. They experienced a high
query load targetting the Turkey (.tr) secondary nameserver. The initial
attempts to filter this traffic lead to changes in the attack profile,
and the response then lead to a kernel filter to perform incoming packet
drop based on query patterns. This lead to a response of more intense
incoming traffic volumes that threatened to overwhelm the ingress paths
to the servers, leading to a distributed routing blackhole response.
This is another illustration of the acute level of toxicity in today’s
Internet which now have to capacity to overwhelm all but the very
largest of the service providers. The RIPE NCC is looking to take a path
of outsourcing its DNS services into facilities that can absorb
multi-gigabit traffic attacks. The attacks are temporal, but the
ensuring defensive measures are expensive, and the limited capacity to
deploy such services tend to concentrate service delivery into a small
number or providers, while the rest of the Internet experiences ever
rising toxic levels of background radiation through these attacks.
Also in the not so good category, Matt Weinberg and Duane Wessels from
Verisign reported on A root and J root traffic, and in particular
attacks against the root in November 2015 and December 2015. These
attacks generated a lot of discussion including the tech press at the
time. There were two attack events, on November 30 and December 1, and
10 out of 13 members of the root server constellations received the
attack traffic. D, L and M received no attack traffic, and it appears
that the traffic was being sent to the old IP addresses for these
servers). The traffic was UDP only and IPv4 only. They saw the command
and control traffic as well as the attack traffic. All the queries were
for www.336901.com (evidently a Chinese game site) with 5.2M query per
second (qp)s at A root and J root. The second attack was 24 hours later
for www.916yy.com at the same 5M qps query rates. This used 895 M unique
source IPs (obviously spoofed), but 68% of the queries were from 200 IP
addresses - i.e. there was a strong core of attack query traffic from a
small set of attackers. This attack caused service impact on the 10 root
server clusters, as seen by RIPEmon monitors. The attack traffic was
evenly spread across the major A-root server sites. J root use 90
smaller sites and packet loss was seen at some of these smaller sites.
Verisign use Response Rate Limiting (RRL) and this dropped up to 60% of
the attack traffic automatically (corresponding to the clustered attack
traffic). However, this does not stop all the traffic and regional sites
with low query volumes did not trip the RRL threshold. The presentation
included an effective graphic using Hilbert curve display of IPv4
address space to show the systematic way in which source addresses were
spoofed. The clear evidence of a number of distinct source address
“walkers” was seen as evidence of a number of distinct attack generation
engines at work. This is known malware used for generating DNS attacks.
The underlying motivation od the attack is not clearly understood. This
damage was to the root systems, so the query target sites were not
harmed. It is possible that this was flag planting or possibly a
diversionary exercise. In this case RRL responded well given the reuse
of certain IP addresses. One interesting area of speculation is the
response of CGNs when a source address of a packet is spoofed. The data
tends to suggest that a widespread CGN behaviour is to perform the NAT
mapping irrespective of whether the source address is spoofed or not!
There is a second form of unwanted DNS query traffic, but this query
traffic is not malicious. This is the detritus of the DNS itself. When
query is passed into the DNS system for resolution it does not go
completely unnoticed. I reported to the workshop on an analysis of DNS
queries seen as part of an unrelated measurement exercise. Out of 44
billion DNS queries seen by the measurement’s authoritative name servers
in a 3 month period, some 25%, or 11 billion queries, were ‘junk’
queries that reused old experiment names. Further analysis of these
zombie queries revealed one group of behaviours linked to be DNS
recursive resolvers performing some form of local cache refreshing
operating over long times. A distinct query behaviour looks to be DNS
stalkers that replay previous DNS queries. And there is a small group of
of totally deranged resolvers that appear to have some form of very
tight loop of repeat queries!
A second theme of the workshop was on key and key generation algorithm
changes for DNSSEC.
This is a topic with some level of background. The first time this
gained wider attention was some years ago when each time a key rolled
the query levels to the authoritative name servers increased (http://www.
potaroo.net/ispcol/2010-02/rollover.html). Since then resolver
behaviour has been moderated and we don’t observe that particular broken
behaviour any more, but that does not mean that the process is now
completely trouble free. There are two topic of interest at this OARC
workshop: change of algorithm and change of the Root Zone keys.
Annand Buddhdev of the RIPE NCC also spoke on the subject of crypto
algorithm rolls, looking at at the experience of the RIPE NCC in rolling
the algorithm of their signed zones. The RIPE NCC was an early adopter
of DNSSEC, and some 10 years ago the use of SHA-1 was a commendable
decision: it was probably the strongest algorithm available at the time.
These days its a point of vulnerability, so the decision to change to
use SHA-2 was entirely reasonable. However, they use Secure-64 units to
manage their signed zones, and these units were incapable to handling an
algorithm change. Their initial workaround plan, to go from signed with
SHA-1 to insecure to signed with SHA-2 was not well received by their
community, so they had to wait for algorithm roll support from Secure 64
to implement this capability. The plan was to introduce the SHA2 RRSIGS,
wait for twice the TTL, then introduce the new ZSK and KSK values. again
wait for twice the TTL then withdraw the old ZSK , and following a DS
update at the parent, to withdraw the KSK. The encountered problems with
the ZSK withdrawal. Section 5.10 of RFC 6840 requires that all the
algorithms present in the DS and DNSKEY records be used to sign the
entirety of the signed records in the zone. This was being strictly
enforced by both Unbound and the Versign resolvers which caused their
problems. They were forced to re-instate the ZSK, and reverse the
planned order, withdrawing the old KSK first and only after the DS is
updated, withdraw the old ZSK.
I reported on an update of the level of support for the ECDSA P-256 in
DNSSEC validating resolvers. When this measurement was first undertaken
over a small sample set of some 3 million user queries in 2014, some 1
in 3 of users were behind validating resolvers that recognized the RSA
algorithm did not recognize the ECDSA algorithm. This experiment was
repeated in early 2016, this time with a sample size of some 765 million
sample queries. The results show that some 83% of all the queries that
clearly showed the use of RSA-validating resolvers also clearly showed
the recognition of the ECDSA algorithm, while the other 17% showed that
the resolver was not recognizing ECDSA as an algorithm it could cope
with. This is a failure rate of 1 in 6, which represents a significant
improvement over the 2014 figures. It was possible to identify common
resolvers who fell into this failing category, and it was noted that
there was a large representation of mobile phone data service providers
in this set of failing resolvers. The other rather attractive aspect of
this protocol change is that validating resolvers that do not recognize
the signing protocol fail by regarding the data as unsigned. This means
that resolvers that fail to recognize the algorithm at least still
function as resolvers!
There was a very interesting panel discussion about DNSSEC algorithm
agility. It’s probably a discussion that used to be held within circles
of cryptographers, but these days the DNS is a large scale user of
crypto technology, and the algorithms that are being used are important.
The widespread reliance on prime number cryptography with the RSA
algorithms is not a good enough long term strategy these days, if it
ever was. RSA has a relatively poor cryptographic efficiency, requiring
relatively large key sizes to provide security that provides adequate
protection against cracing efforts using current technologies.
Alternative approaches exist, including the currently promising areas of
elliptical curve cryptography. But how can these more recent
cryptographic algorithms be introduced into mass use? We already see the
slow and drawn out process to introduce support for ECC-P256. What
chance do we have of being able to completely withdraw from an
algorithm, or do we have to carry support for all these algorithms
indefinitely in all validating resolvers? At the same time the barriers
to adoption of new algorithms is certainly quite substantial.
Certification of a crypto software library can be a painfully slow
process, and the process of upgrading deployed software is incredibly
painful. There is a long tail of old versions of software in active use.
Managed devices such as personal computers fare much better in this
respect than completely unattended or embedded devices, where if there
is an in situ update process its likely to be exercised highly
erratically if at all. We are leaving in our wake a trail of what we
would like to think are other people’s legacy compatibility problems.
The DNS, as it stands today, is caught up in this. The protocol expects
signed material to head out into caches and other forms of
intermediaries, so the concept of a capability negotiation between the
zone publisher and a resolver is somewhat of a far fetched notion in
today’s DNSSEC environment. So it seems that we are caught up in a
process that exhibits a high level of inertia. There are many barriers
to both adopting new crypto approaches and similarly many barriers that
prevent us walking away from old and potentially outdated approaches.
Duane Wessels of Verisign reported on current plans to change the size
of the Zone Signing Key (ZSK) of the Root Zone. The plan is to increase
the size of the ZSK to 2048 bits. Their plan is to pre-publish the
longer key in the DNS Root Zone commencing 21 September 2016. On 1
October the zone will be signed by this longer ZSK, while the old 1024
bit ZSK will be still published in the root zone for a further 30 days.
This is 20 days longer than the usual 10 day post-publish period. The
Key Signing Key (KSK) of the root zone is already a 2048 bit RSA key, so
there are no substantive concerns over resolvers’ acceptance of a
2048-bit key. However, the larger key will increase the size of DNS
responses for the root zone’s DNSKEY record by 128 octets. The standard
size response will increase from 736 octets to 864 octets. During the
regular ZSK key roll there will be 2 of these 2048-bit ZSKs, which will
entail a 256 octet increase from today’s 883 octet response to 1,139
octets for the ZSK roll. This may encounter some small levels of trouble
in some cases with some validating resolvers indicating that they are
unable to handle a 1,139 octet UDP response (through their offer of a
smaller buffer size in the EDNS0 buffer size in their queries) and at
the same time are unable to establish a TCP connection.
The meeting also heard a report on the plans to roll the Root Zone’s Key
Signing Key (KSK). A report from an ICANN-commissioned Design Team was
released at the start of March. This report calls for a 3-month process
of introducing the new KSK into the Root Zone, followed by a cutover to
the new KSK at the start of the second 3-month period, and the
revocation of the old KSK in the last 3-month period. The report called
for the preparation of the new key material in 2016 and the roll of the
KSK in the first 9 months of 2017. It seems likely that the roll of the
ZSK, currently planned for 2016, may push back the KSK roll activities,
and the timetable proposed in the Design Team’s report may be pushed
back by 2 or 3 quarters so that the KSK key roll may well occur at the
end of 2017 and moving in to early 2018. There are two major
considerations with respect to this change in the key. The first is that
there are transitory periods when the response to a DNSKEY query to the
root zone will be larger. The final phase of pre-announcement will see a
response size that is 1,414 octets, as it will contain two of the
2048-bit ZSKs and two of the 2048-bit KSKs, signed by a single KSK. The
second critical phase is the period of revocation of the old KSK. This
will contain one ZSK, two KSKs and two KSK signatures. The response size
of this packet will be 1,425 octets. It is anticipated that in both
cases a certain number of DNSSEC-validating resolvers will have problems
in receiving this response due to the response size. The second problem
is the issue of manual configuration of the trust key set. We cannot
tell in advance of the key roll how many resolvers have set themselves
with a manually configured trust anchor key. Or, perhaps a little more
concerning, we don’t know how many secure DNS validation packages that
are packaged with manually set trust anchor key values have been
distributed. In this case its not clear that the system administrator is
even aware of the package setting of manual keys. These resolvers will
be caught up by a key roll of the KSK, and at the point of the roll will
no longer be able to resolve any name, whether signed or unsigned. While
it is possible to estimate in advance the number of validating resolvers
who encounter problems with DNS responses of 1,425 octets, and even
estimate the number of users behind these resolvers, it is simply
impossible to perform any advance measurement of resolvers with manually
set trust anchors. Unfortunately, we will only know the extent of the
issue when the key rolls, and not before.
There are many other aspects of evolution of the DNS, and some of these
were evident at this workshop.
The world of resolvers has expanded from the BIND software to include a
number of products, including unbound, Power DNS and others. The most
recent entrant in this space is the Knot DNS Resolver, from the CZNIC
folk. Ondrej Surej of CZNIC talked of the developments with their Knot
DNS resolver project (https://www.knot-resolver.cz). It is an open
source platform for a recursive DNS resolver, written in C and LuaJIT.
The code is constructed on top of the Knot DNS libraries. Knot DNS
allows flexible programming constructs in the configuration files. It
implements IPv6 “happy eyeballs” with a 20ms head start bias towards
making DNS queries over IPv6 transport. It is a general purpose resolver
with intended application at both large and small scale. It has flexible
shared cache back ends that allows multiple daemons to run from a common
cache. It supports RFC7646 negative trust anchors, and can perform
prefetching. It performs QNAME minimization. It supports persistent
caching and a Tinyweb module. It’s really good to see the ecosystem of
the DNS branch out and have some real diversity and choice these days in
the resolvers we use.
The conventional model of DNS resolution is that this is a task that is
outsourced to dedicated servers. But what if an application wants to use
DNSSEC directly, and is not willing to trust intermediaries, even the
platform on which the application is running? Sarah Dickenson and Willem
Toorop reported on the getdns api work (https://getdnsapi.net/) This is
an api designed by applications developers as a natural evolution from
the getaddrinfo() api. getdns is an asynchronous interface into the DNS
so that the local library can operate either as a stub resolver or a
full recursive resolver (using the unbound resolver code). The intention
is to allow applications to have direct visibility into DNS behaviour
and DNSSEC validation, so that there is a better potential to integrate
services such as DANE into application behaviours. It supports DNSSEC
validation in both stub and resolver modes, and the recent work on DNS
over TLS as part of the DNS Privacy efforts. The api is asynchronous by
default, organised around events (libevent, libev and libuv). It uses a
JSON dictionary model of responses and response types. They are seeking
tighter integration into OS distributions and applications.
There were many other interesting presentations in a very packed two day
workshop. The best I can do is to point the reader to the workshop web
site: https://indico.dns-oarc.net/event/22/, where all of the workshop
presentations can be retrieved.