The ISP Column
On Things Internet

                                                                June 2020
                                                             Geoff Huston

DNS OARC32a Meeting Report

  Once the realisation sunk in that the lockdown response to the COVID-19
  pandemic was not a short-term hiatus in our lives but a new normal, at
  least for a while, then a set of meetings and workshops have headed into
  the online space. For many years I have been a keenly interested
  participant in the meetings organised by the DNS Operations and Research
  Community, or DNS OARC. This time around its most recent meeting headed
  into the online space. Here's my impressions of the material presented at
  the online DNS OARC 32a meeting.

Chrome's Impact on Root Traffic

  Verisign’s Duane Wessels reported on studies of root server query traffic
  that can be attributed to Chrome behaviours. Chrome is not only a popular
  browser, but its underlying ongoing is used by many other browsers,
  including Microsoft Edge, Opera, Amazon Silk and Brave, as well as a set
  of mobile platform browsers including Kiwi, Samsung Kromite and Ecosia. It
  is very much the dominant browser in the browser ecosystem, with some 70%
  of share of usage (https://www.w3counter.com/trends). The result of this
  position of dominance is that if Chrome performs some behaviour, no matter
  how innocuous, then the multiplying factor of all these billions of users
  using Chrome and Chrome-based browsers means that such behaviours get
  amplified to a volume  level of 10, and the result is significant.


  This story starts with the effort to simplify the user interface in the
  browser. Early browsers used to have two distinct input fields. One was
  the Go field, where you entered a URL and the browser attempted to
  retrieve it. The other was Search input, where you entered a search term
  and the browser fed the term to a search engine. Browsers simplified this
  to a single omnibox input field that fed the input text to a search engine
  or to the browser’s URL fetch logic depending on whether the browser
  through that you entered a search term or a URL


  "No problem!" you might think. If it starts with http:// or https:// then
  it’s clearly a URL otherwise it’s a search term. Well no, as the browser
  allows the http header to be dispensed with. It is after all URL verbiage
  and a waste of time tapping out on a tiny keyboard of a smartphone. Ok,
  let's try the rule that a string of labels, separated with periods is a
  URL, otherwise it's a search term. Well, not quite. URLs using a single
  label are valid, as single label DNS names are valid. So maybe the browser
  should direct all single label URLs to the DNS? That opens up a new issue.
  If all single labels are sent to the DNS to resolve then normally most of
  these labels would produce a "no such domain" (or NXDOMAIN) response, and
  the browser could then select its search engine and feed this label into
  that. But the ISP may have chosen to alter its DNS behaviour and instead
  of returning NXDOMAIN it could redirect the user to their own search page
  by returning the IP address of this search page rather than NXDOMAIN.
  Potentially it could monetise this redirection. (See
  https://en.wikipedia.org/wiki/DNS_hijacking for a description of this
  practice and a discussion of some of its implications.) From Google's
  perspective this NXDOMAIN substitution is not well regarded. Search is a
  major path to ad placement and ad placement is Google's core revenue. So
  how can Google’s Chrome browser platform detect environments where
  NXDOMAIN substitution is happening?

  The source code of Chrome contains the following:

     net::NetworkTrafficAnnotationTag traffic_annotation =
        net::DefineNetworkTrafficAnnotation("intranet_redirect_detector", R"(
        semantics {
          sender: "Intranet Redirect Detector"
          description:
            "This component sends requests to three randomly generated, and "
            "thus likely nonexistent, hostnames.  If at least two redirect to "
            "the same hostname, this suggests the ISP is hijacking NXDOMAIN, "
            "and the omnibox should treat similar redirected navigations as "
            "'failed' when deciding whether to prompt the user with a 'did you "
            "mean to navigate' infobar for certain search inputs."
          trigger: "On startup and when IP address of the computer changes."
          data: "None, this is just an empty request."
          destination: OTHER
        }

  Chrome generates three single label DNS queries, where the label is
  between 7 to 15 characters in length and composed of alpha characters. As
  Duane Wessels points out, prior to February 2015 the code used only
  10-character labels. It does so at startup, when the local IP address
  changes and is the local DNS server changes.

  There are a huge number of DNS queries, and you'd think that adding three
  further queries at sporadic intervals would be an undetectable increment
  to the DNS load. But that’s not the case.

  The first factor is that the Chrome platform is not just any browser in a
  crowded field of platforms. It is the dominant platform with 70% market
  share. That would imply that the Chrome platform is used by some three
  billion Internet users, and any behaviour in Chrome will be significant.

  The second factor is that the DNS resolution function is optimised for
  names that exist, and fares badly for non-existent unique names. The
  problem is that random one-off use names will not generate cache hits. The
  recursive resolver will need to pass these queries to the authoritative
  server, and in the case of a single label DNS name the authoritative
  server is the root server.

  Duane reports that some 45% of all DNS queries seen at A and J root
  servers are likely to be Chrome queries. There are some additional factors
  here that add some "fuzz" to this figure. Some recursive resolvers perform
  aggressive NSEC caching (RFC 8198), and because the root zone is signed
  the recursive resolver er can answer with an authoritative signed NXDOMAIN
  response without passing the query to a root server. That means that the
  original volume of these Chrome queries may be higher than what is seen at
  the root servers. On the other hand, the DNS is quite notorious at
  replaying queries and we have observed NXDOMAIN query amplification in the
  DNS in a number of studies (https://www.potaroo.net/ispcol/2019-02/nxd.html).   
  That would imply that the original query volume is lower than the volume
  seen at the root servers.

  Chrome introduced this NXDOMAIN probing in 2010, and over the past decade
  we've seen the continued growth of the Internet's user base, the
  increasing level of market share of the Chrome browser platform, and the
  proportion of query traffic seen at the root servers that match these
  Chrome probes has increased proportionally, It is now some 50% of the
  total query volume (Figure 1).

    <https://www.potaroo.net/ispcol/2020-06/oarc1.png">
    Figure 1 - Chrome queries see at the Root. 
    From "Intranet Redirect Detector or Pseudo Random Subdomain Attack?", 
    Duane Wessels, Verisign, June 2020


  There are some questions about this probe behaviour that do not have clear answers so far.
  
   - Why three queries? Why not two, or even one?

   - Why use a single label query? If the point of the probe queries is to
     detect NXDOMAIN substitution then why wouldn't a scheme that uses a top
     level domain name, like .chrome, work just as well? The advantage of
     such an approach is that it would deflect the probe queries away from
     the root servers and push it on to servers operated by the same folk
     who are responsible for the probe behaviour in the first place. The
     single label also triggers local suffix search list processing which
     can add to the DNS query load
     (https://www.potaroo.net/ispcol/2013-10/dotless.pdf)
   
   - Why isn't aggressive NSEC caching (RFC 8198) more effective here? It
     all recursive resolvers performed aggressive NSEC caching this query
     load would doubtless drop dramatically. Instgead of passing all such
     queries to a root server the recursive resolver could use a locally
     cached NSEC response and provide that without further reference to any
     root server. It appears that NSEC caching is not very effective and at
     the moment seems to remove at best the queries from some 7% of users
     that could be captured and answered by NSEC caching
     (https://www.potaroo.net/presentations/2019-10-31-oarc-nsec-caching.
     pdf).

  There is one very curious aspect of this behaviour, in that the percentage
  of query traffic seen by each of the root service letters appears to vary
  significantly. In many respects the root servers are intentionally
  identical, and it is unusual to see resolvers prefer certain root services
  for these chrome queries.

  Part of the strength of the Internet lies in the decoupled nature of the
  network’s infrastructure, where many component service providers operate
  within their chosen niche of activity, and the overall orchestration of
  the collective efforts is left to market forces. No one is in charge. But
  while this is a strength it can also be a weakness, particularly in cases
  of cost displacement. The design decision by Chrome to probe for NXDOMAIN
  substitution through one-off labels queries is a decision that imposes
  negligible marginal cost to Chrome or Chrome users. However, it does
  impose significant costs to root service operators given that one half of
  their overall query load come from this behaviour. But in the same way
  cost and benefit are displaced, the tools to remedy this situation lie in
  the hands of a third class of actors. If all recursive resolvers, and
  their front-end load balancers, performed effective NSEC caching (and
  presumably DNSSEC validation as well) then these Chrome queries would be
  absorbed by the recursive resolver. In a centrally orchestrated
  environment, the costs and benefits could be directly compared, and such
  solutions could be deployed where it was cost-beneficial to do so.
  However, without such orchestration there is little in the way of
  incentive for recursive resolver operators to spend their time and effort
  to address how to mitigate this class of queries, so the root servers are
  left with the problem without the means of providing incentives for any
  other party to provide a remedy.

DNSSEC signing of .org

  The org top level domain was one of the early adopters of DNSSEC, signing
  the zone in June 2009.  There have been a number of issues with this
  setup.

  The staging of DNSKEY RRs in the zone meant that for many years the signed
  response to a DNSKEY query for .org was 1,625 octets in size. Anything
  over 1440 octets gives some resolvers (and some users behind these
  resolvers) some problems,  as the resolver problems of inability to
  receive fragmented responses and an inability to perform DNS over TCP
  meant that some users (a surprisingly large 4% of users have problems with
  large DNS responses - https://www.potaroo.net/presentations/2018-05-16-atr.pdf). 
  
  These days .org has altered the key staging process and the size of the
  .org response is now 1,058 octets in size, which is a significant
  improvement (there are some 500 other TLDs in the root zone that still
  have DNSKEY responses greater than 1460 octets, but that is part of a
  different story (https://www.potaroo.net/presentations/2020-05-27-ripe-dnswg-v6-dns.pdf).

  The zone is signed with RSA/SHA1. Since 2005 SHA1 has been considered to
  be inadequate as a defence against well-resourced attackers (see
  https://en.wikipedia.org/wiki/SHA-1). Use of RSA/SHA256 will improve the
  crypto security of the signed entries, but at the expense of increased
  response sizes. ECDSA P-256 looks like a more attractive option, providing
  stronger crypto and smaller sizes of DNS responses, but .org uses a
  hardware signer and a change of algorithm requires changes to the signing
  hardware which may not be available for some time.

  The zone is signed using NSEC3 with opt-out. NSEC3 provides negligible
  protection against zone walking these days, and the opt-out provisions
  make negative caching (RFC 8198) ineffectual. There is also the issue that
  the overall zone size is unpredictable, as the uptake of zone signing in
  .org subdomains will impact the size of the .org zone. Shifting to NSEC
  would simplify zone signing but would add some 20M resource records to the
  zone file and require 10M generated signatures, which has implications in
  both the process of zone signing and the requirements of the platforms
  that are the authoritative servers for the zone. One option is to use
  synthetic NSEC records that span a single byte in the same space
  (https://blog.cloudflare.com/black-lies/) but perhaps such subtle
  manipulation of DNSSEC responses runs counter to the operator's  preferred
  ethnical position of clear conformance to technical standards (The Public
  Interest Registry, who administers .org was established by ISOC as a
  registry that espouses and practices "best practices" for a registry
  operator.) It's also likely that no large zone has rolled from NSEC3 to
  NSEC given that there is no documented set of operational experiences in
  performing such a roll for very large domains.

  And in all this there is the issue of pandemic responses and restrictions
  and limitations in the movement of people and goods. Many network service
  operators appear to have placed a pause on engineering changes to their
  services during this period, and the program for the upgrading of DNSSEC
  in .org is at a similar position.

The DIINER program

  The OARC workshops are intended to provide fodder for research activities
  as well as operational experience. ISI's Wes Hardaker presented on DIINER.
  This is a research tool that exposes part of the 'B' root service query
  profile for research use. The root servers have long been a fruitful area
  of DNS research, and studies of query traffic presented to root servers
  have been used in many research studies. However, root query traffic is
  not public, so many research efforts rely on a once-per-year data
  collection effort operated under the auspices of DNS OARFC. This DITL data
  collection is an annual one-day snapshot of root query traffic presented
  to all root servers.

  Now these DITL arrangement are better than nothing. But it’s still not
  that good. DIINER proposes a different approach that forces a subset of
  incoming queries to the B root service to an experimental rig that can
  process the queries in a different manner (Figure 2).
  
    <https://www.potaroo.net/ispcol/2020-06/oarc2.png>
    Figure 2 – DIINER architecture. 

  This is best illustrated by example, where Wes presented output of
  parallel processing incoming queries in their production system, running
  the Bind DNS server software and the experimental rig running the Knot DNS
  server software over the same query set (Figure 3). That is just a single
  example in a rich potential area of study. The kind of questions that
  could be studied in this setup include what is the difference between dual
  stack DNS and running all DNS queries over IPv6? What if all the queries
  used TCP? Or DoT? Or DoH? How time sensitive are responses?
  
    <https://www.potaroo.net/ispcol/2020-06/oarc3.png>
    Figure 3 – Comparison of server platforms.
    From "USC/ISI's DIINER DNS and naming testbed", Wes Hardaker, ISI, June 2020      

  This looks to be a useful leap forward. Rather than looking at a grainy,
  imperfect and slightly out of focus picture of 24 hours of query traffic
  presented to the root servers, this platform allows for a continuous real
  time view of query traffic with the opportunity to alter the ways that a
  server generates responses and compare the output with the current
  production service.

Lockdown in NZ

  I'm sure we will see a number of these kinds of presentations in the
  coming months. InternetNZ's Sebastian Castro looked at DNS query traffic
  in New Zealand., looking at queries to the .nz servers. The lockdown in
  New Zealand saw increased DNS activity, and various events, such as daily
  broadcasts on the state of the pandemic could be mapped directly to
  altered patterns of DNS traffic activity.

DNS Zombies

  I presented on work on DNS repeated queries. Aside from saying that there
  are a lot of DNS zombie queries out there, I'll present the outcomes of
  this study in a separate report.

Meeting Material


  DNS OARC 32a material can be found at:https://indico.dns-oarc.net/event/35/timetable/#20200609.detailed


Disclaimer

  The above views do not necessarily represent the views of the Asia Pacific
  Network Information Centre.


About the Author

  GEOFF HUSTON AM, B.Sc., M.Sc., is the Chief Scientist at APNIC, the
  Regional Internet Registry serving the Asia Pacific region.

  www.potaroo.net