Network Working Group                                         R. Whittle
Internet-Draft                                          First Principles
Intended status: Experimental                           January 19, 2010
Expires: July 23, 2010


                    Ivip Mapping Database Fast Push
                 draft-whittle-ivip-db-fast-push-03.txt

Abstract

   From the base of draft-whittle-ivip-arch-03 and later, this ID
   describes Ivip's fast-push mapping distribution system.  This accepts
   mapping changes from end-user networks or organizations they
   authorise to make these changes.  The mapping changes are handled by
   RUAS (Root Update Authorization Server) companies who collectively
   run the initial levels of a global network of Replicator servers.
   This is a secure, packet-based flooding system which will propagate
   the mapping changes to potentially hundreds of thousands of full
   database query servers (QSDs) in ISPs and larger end-user networks
   all over the world.  This ID describes the overall system.  The
   distributed Fast Payload Forwarding system is described in detail in
   draft-whittle-ivip-fpr.

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on July 23, 2010.

Copyright Notice


Whittle                   Expires July 23, 2010                 [Page 1]

Internet-Draft              Ivip DB Fast Push               January 2010


   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the BSD License.


Whittle                   Expires July 23, 2010                 [Page 2]

Internet-Draft              Ivip DB Fast Push               January 2010


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
     1.1.  Outline of the RUAS and Replicator systems . . . . . . . .  4
     1.2.  Assumptions  . . . . . . . . . . . . . . . . . . . . . . .  6
     1.3.  It may not be so daunting... . . . . . . . . . . . . . . .  7
   2.  Goals, Non-Goals and Challenges  . . . . . . . . . . . . . . .  9
     2.1.  Goals  . . . . . . . . . . . . . . . . . . . . . . . . . .  9
     2.2.  Non-goals  . . . . . . . . . . . . . . . . . . . . . . . . 11
     2.3.  Challenges . . . . . . . . . . . . . . . . . . . . . . . . 11
   3.  Definition of Terms  . . . . . . . . . . . . . . . . . . . . . 12
     3.1.  SPI - Scalable PI space  . . . . . . . . . . . . . . . . . 12
       3.1.1.  Conventional global unicast address space  . . . . . . 12
     3.2.  MAB - Mapped Address Block . . . . . . . . . . . . . . . . 12
     3.3.  UAB - User Address Block . . . . . . . . . . . . . . . . . 13
     3.4.  Micronet . . . . . . . . . . . . . . . . . . . . . . . . . 13
     3.5.  RUAS - Root Update Authorisation System  . . . . . . . . . 14
     3.6.  UAS - Update Authorisation System  . . . . . . . . . . . . 14
     3.7.  UMUC - User Mapping Update Command . . . . . . . . . . . . 15
     3.8.  SUMUC - Signed User Mapping Update Command . . . . . . . . 17
     3.9.  MABUS - Update Stream specific to one MAB  . . . . . . . . 17
     3.10. Level 0 Replicators  . . . . . . . . . . . . . . . . . . . 17
     3.11. Level 1 and greater Replicators  . . . . . . . . . . . . . 18
     3.12. QSD - Query Server with full Database  . . . . . . . . . . 18
     3.13. QSC - Query Server with Cache  . . . . . . . . . . . . . . 19
   4.  Update Authorities and User Interfaces . . . . . . . . . . . . 20
     4.1.  RUAS Outputs . . . . . . . . . . . . . . . . . . . . . . . 21
       4.1.1.  Update packets to level 0 Replicators  . . . . . . . . 21
       4.1.2.  MAB snapshots  . . . . . . . . . . . . . . . . . . . . 22
       4.1.3.  Missing Payload Servers (MPSes)  . . . . . . . . . . . 24
     4.2.  Authentication of RUAS-generated data  . . . . . . . . . . 25
       4.2.1.  Snapshot and missing packet files  . . . . . . . . . . 25
       4.2.2.  Mapping updates  . . . . . . . . . . . . . . . . . . . 25
     4.3.  RUAS - UAS interconnection . . . . . . . . . . . . . . . . 26
   5.  Common information to be sent by the FMS . . . . . . . . . . . 31
   6.  The Fast Payload Replication system  . . . . . . . . . . . . . 32
   7.  Scaling limits . . . . . . . . . . . . . . . . . . . . . . . . 33
   8.  Managing Replicators . . . . . . . . . . . . . . . . . . . . . 36
   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 37
   10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 38
   11. Informative References . . . . . . . . . . . . . . . . . . . . 39
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 40


Whittle                   Expires July 23, 2010                 [Page 3]

Internet-Draft              Ivip DB Fast Push               January 2010


1.  Introduction

   The aim of this I-D is to establish that Ivip's fast-push mapping
   distribution system (FMS) is practical and desirable for very large
   numbers of micronets (EIDs in LISP terminology) and rates of change
   of the mapping database.  All parts of Ivip are intended to be
   operated by a variety of organisations, with appropriate cooperation
   - including between companies which are competing with each other.

   Please refer to [I-D.whittle-ivip-arch] for an explanation of Ivip in
   general.  A glossary of Ivip and some general scalable routing terms
   and acronyms is: [I-D.whittle-ivip-glossary].

   This is a revision of the 02 version, with a substantial
   simplification of what was previously the "Launch server" system
   which drove the level 1 Replicators.  These are replaced by level 0
   Replicators, which are functionally identical to other relatively
   simple Replicators, but use more input streams.  The level 0
   Replicators are fully meshed with each other and this mesh is driven
   by packets from the multiple RUASes (Root Update Authorization
   Servers).  A new network element - a Missing Payload Server - is also
   introduced.  Please see [I-D.whittle-ivip-fpr] for a detailed
   explanation.

1.1.  Outline of the RUAS and Replicator systems

   The most important part of the FMS is comprised of thousands (perhaps
   tens of thousands in the long term future) of essentially identical
   "Replicator" servers.  This can be viewed as taking a tree-like
   structure somewhat similar to a multicast set of routers, driving
   multiple such trees with the same data, and then cross-linking the
   branches at multiple levels so that the payload of a packet lost at
   one replication point will be replaced by an identical payload in
   another packet from another branch.

   Each Replicator receives at least two streams of identical mapping
   data, so it is much less likely to miss a payload than if it only
   received the payloads in a single stream of packets from a single
   source.

   A better way to view the system is that it floods each replication
   point from at least two directions from the previous level.  The
   items being flooded are the payloads of DTLS packets - UDP packets
   whose contents are encrypted to prevent attacks involving spoofing
   such packets in order to propagate them in the Replicator system.

   At level 0, the Replicators flood each others and the larger number
   of level 1 Replicators.  The level 1 Replicators flood the larger


Whittle                   Expires July 23, 2010                 [Page 4]

Internet-Draft              Ivip DB Fast Push               January 2010


   number of level 2 Replicators.  Since the flooding unit is a packet
   payload, as soon as a particular payload is received, it is
   replicated, so the delay time in each point of replication will be
   very short indeed - probably only milliseconds.

   In this way, it is reasonable to expect a single payload, injected
   securely to the level 0 Replicators, to be fanned out to hundreds of
   thousands of end-points all over the Net, in a time not much longer
   than is imposed by the intervening routers and data links.  If there
   are 5 levels of x10 "amplification" and each involves a 20msec delay
   (hopefully it would be less), then we can expect global delivery of
   the payloads to most fibre-linked locations within 300 to 350ms.
   (From Melbourne Australia, most of the distant sites in Russia or
   Africa have RTTs in the 300 to 450msec range, but I found a server
   apparently in Swaziland (cr1mba.swazi.net) with RTTs of 710 to
   900ms.)

   In this way, each Replicator consumes two identical streams from
   geographically and topologically different sources, and fans the
   content of the streams out to some larger number of Replicators or
   QSDs at the next level.  This number of output streams per Replicator
   may be in the tens to one hundred range, depending on the volume of
   updates.  Initially, it would be quite high, when update rates are
   low - meaning that the initial global Replicator network could serve
   the growing number of QSDs with just three or so levels of
   Replicators, and with each one fanning out updates to a large number
   of Replicators at the next level.

   After some number of levels of replication, determined by local
   conditions, the streams deliver the update information at a QSD.
   Ideally, each QSD will receives two streams from two geographically
   dispersed Replicators.  These need not be at the same level, so the
   system is relatively flexible, and each Replicator will generally be
   sending a complete streams of packets.

   There will also be a distributed system of Missing Packet Servers
   (MSPs) which receive streams from Replicators and store the payloads
   for ten or so minutes.  MSPs will compare notes with each other via
   TCP (probably HTTP or HTTPS) and so form one or more worldwide
   distributed groups which will quickly replace any payload they
   missed.  QSDs will query one or more MSPs - probably a close one and
   a distant one - and so be able to receive missing payloads on the
   hopefully rare occasions when one or more payloads are missing from
   the two or more streams each QSD normally receives from Replicators.

   RUASes asynchronously feed packets into the fully meshed ring of
   level 0 Replicators.


Whittle                   Expires July 23, 2010                 [Page 5]

Internet-Draft              Ivip DB Fast Push               January 2010


   Snapshots of segments of the mapping database are taken regularly by
   each RUAS.  Each snapshot contains a complete copy of the mapping of
   one MAB (Mapped Address Block) at a particular instant.  At that
   point in time, a hash function of the mapping data for this MAB is
   generated and within a few seconds is sent to all QSDs.  This enables
   each QSD to verify its copy of the mapping for this QSD is fully up-
   to-date.

   During initialisation, and if an error is found in the local copy of
   the mapping for a particular MAB, the QSD downloads snapshots from
   HTTP servers provided by the RUAS companies.  The QSD buffers all
   updates for the MAB which arrive after the snapshot and hash message.
   Once the snapshot is downloaded and unpacked into the QSDs copy of
   the mapping database, the buffered updates are applied and the
   database then contains an up-to-date copy of mapping for this MAB.
   Updates are then applied as they arrive from the two or more upstream
   Replicators.

1.2.  Assumptions

   For the purposes of this discussion, it is assumed there will be a
   single global Ivip system, with multiple organisations being
   responsible for the management of the various blocks of address space
   which are managed with Ivip.  The system itself is intended to be
   decentralised and have no single point of failure.  Furthermore, it
   is intended to be highly suitable for being built, operated and
   expanded upon by a number of separate organisations, who cooperate
   much as do the organisations which run the DNS today.

   It would also be possible for an organisation to establish an Ivip-
   like system, without reference to any IETF RFCs, and to conduct a
   business renting out address space in small, flexible, chunks, with
   portability and multihoming via any ISP who provides the requisite,
   relatively simple, ETRs.  The most likely scenario is this being
   done, with one or more independent Ivip-like systems operated by
   different companies, primarily for supporting TTR mobility [TTR
   Mobility], but also usable for portability, multihoming and inbound
   Traffic Engineering for non-mobile end-user networks.

   For simplicity, this ID assumes that Ivip development will be
   coordinated into a single global system, as DNS is, following
   appropriate IETF engineering work and administrative decisions in
   RIRs and other relevant organisations.  A development timeframe of
   2010 to ca. 2014 is assumed, with widespread deployment being
   achieved later in the decade, for IPv4 at least.

   The IPV4 FMS for is identical in principle to the IPv6.  The server
   software which implements the Replicators will probably remain as two


Whittle                   Expires July 23, 2010                 [Page 6]

Internet-Draft              Ivip DB Fast Push               January 2010


   separate items, but a single server could run them both,
   independently, and so be both an IPv4 and IPv6 Replicator.  Each RUAS
   would have both IPv4 and IPv6 sections, with separate outputs of
   mapping data.  The level 0 Replicator servers for IPv4 would be
   physically different and independent of those for IPv6.

   In addition to the global fast push database update distribution
   system discussed in this ID and in [I-D.whittle-ivip-fpr], Ivip also
   involves Query Servers sending "notifications" to ITRs which recently
   requested mapping for a micronet whose mapping has just changed.
   This is a second form of push - on a local scale - and is outlined in
   [I-D.whittle-ivip-arch] .

   This ID concentrates on IPv4, since the future core-edge separation
   architecture is more urgently required for IPv4 than for IPv6.  In
   principle, the same arrangements will apply for IPv6, with a
   different and more verbose data format than the 12 or so bytes
   required for each IPv4 mapping update.  It may make sense to defer
   finalisation of any future IPv6 map-encap scheme until substantial
   operational experience was gained with the IPv4 scheme.

1.3.  It may not be so daunting...

   Ivip documentation is written with a preference for detailed
   discussion over terseness.  So Ivip IDs may appear rather daunting at
   first.  Hopefully these IDs will be clearly understandable, and the
   reader will recognise that this scalable routing solution is a
   momentous development, requiring detailed consideration.  Ivip goes
   beyond the formal RRG requirements of providing portability (the only
   way of allowing free choice of alternative ISPs) multihoming and
   inbound traffic engineering, by also providing with TTR mobility, a
   global mobility system for both IPv4 and IPv6.  While no mapping
   changes are required unless the Mobile Node moves a large distance,
   such as 1000km or more, it is important that the Ivip FMS be able to
   scale to very large numbers of updates and cope with mapping
   databases for up to 10^10 micronets.

   This ID focuses on handling billions of micronets and potentially
   thousands of updates a second.  These data-rates may sound high
   today, but domestic customers are already downloading full quality
   video in real-time.  By the time such large levels of adoption arise,
   the bandwidth needed for these will not be a significant obstacle.

   However, it is difficult to imagine a situation where more than 10
   billion mapping changes are needed each year, which is an average of
   320 a second.  There would be peaks, but with an IPv6 mapping change
   requiring about 32 bytes this is an average of 100kbps.


Whittle                   Expires July 23, 2010                 [Page 7]

Internet-Draft              Ivip DB Fast Push               January 2010


   During initial deployment, the demands on the fast push system will
   be far lighter than those anticipated below, so the system might
   initially be somewhat simpler.  In the initial stages of
   introduction, there may be little need to deploy dedicated servers
   for the "Replicator" functions, since the volume of updates may be so
   light as to make it practical to run this software on existing
   servers, such as nameservers.

   Furthermore, in the early years of introduction, when there are
   hundreds of thousands or a few million micronets, the low level of
   update packets (compared to the highest imaginable levels
   contemplated below) should enable each Replicator to fan out to many
   more next-level Replicators than would be possible when hundreds of
   millions or billions of micronets are handled by the system.  This
   would mean fewer levels of Replicators and fewer Replicators than
   would be possible with current technology if the system was handling
   billions of micronets.

   This ID explores how the FMS would be structured in the most
   demanding future scenarios which can be realistically expected.
   Building the initial FMS for trials and early services won't be as
   daunting as it may look from the diagrams and discussions below.


Whittle                   Expires July 23, 2010                 [Page 8]

Internet-Draft              Ivip DB Fast Push               January 2010


2.  Goals, Non-Goals and Challenges

2.1.  Goals

   The overall goal of the fast push system is to enable end-users, who
   manage the mapping of their one or more micronets of address space,
   to securely, reliably and easily communicate their mapping change
   command to some organisation with which they have a business
   relationship, so that that change will be propagated to every QSD as
   soon as possible.

   "As soon as possible" means typical delay times of a few seconds,
   ideally zero seconds, but in practice probably two or so seconds.
   Prior to 2010-01-18, the Ivip IDs mentioned longer times than this,
   but this was on the basis of "Launch" servers executing a complex
   pipelined process which would take three or so seconds.  This
   arrangement is now replaced by fully meshed level 0 Replicators,
   which have no complex protocols, pipelining or delays.

   "Reliably" means that in the great majority of cases, the QSDs
   receive every mapping change as expected and that in the relatively
   rare event of this being impossible due to packet loss, that the QSD
   can recover from this situation within one or at the most two seconds
   by requesting a copy of the packet from one or more Missing Payload
   Servers (MPSes - which were also introduced on 2010-01-18).

   Reliability also involves robustness against DoS attacks.  This can
   never be completely protected against for any device on the open
   Internet, since its link(s) can easily be flooded by packets sent
   from botnets etc.  As mentioned in [I-D.whittle-ivip-fpr]
   considerable protection from DoS attacks could be achieved by running
   the level 0 and level 1 Replicators via private network links.  These
   levels would be owned and operated by the RUAS companies working
   together.  This would enable reliable feeds to hundreds or perhaps a
   thousand or so level 2 Replicators all over the Net, which would mean
   that a DoS attack would not be able to cause so much trouble.

   "Securely" means that each QSD which receives the updates will be
   able to instantly verify that the updates are genuine, rather than
   the result of an attacker who might, for instance, send forged
   packets to that device or to some other part of the fast push system.
   The data format for the mapping update packets is for further work.
   There will be end-to-end encryption so that the QSD can authenticate
   the mapping data originated from the RUAS which sent it.  Whether
   this involve authenticating each individual payload, or combining
   typically multiple packet payloads into a single body of data to be
   authenticated, remains to be decided.  Sometimes, probably quite
   frequently, the RUAS will send only a single packet of updates, so


Whittle                   Expires July 23, 2010                 [Page 9]

Internet-Draft              Ivip DB Fast Push               January 2010


   then the entire payload would be authenticated, since there are no
   other payloads to consider.  The data format needs to provide for
   open-ended extensions in the future and to support authentication.

   In the present design, DTLS RFC 4347 UDP packets are sent from RUASes
   to Replicators, from Replicators to other Replicators and from
   Replicators to QSDs and MPSes.  This protects against an attacker
   spoofing a packet and having it Replicated or accepted by a QSD or
   MPS.  However, it cannot be completely assured that a replicator was
   not under control of an attacker - which would enable them to send
   packets which would be replicated and accepted.

   The most common mapping change command, as sent by the end-user, or
   by some other organisation or device which has the end-user's
   credentials, would involve the length of the micronet being checked
   to ensure it is the same as the currently configured length of the
   micronet which starts at that location.  The end-user's command might
   be part of an encrypted exchange involving a challenge-response
   protocol and the end-user's private key.  Alternatively, an encrypted
   link could be used, such as via HTTPS, and a conventional username
   and password given as part of the command.

   The end-user would previously have communicated directly or
   indirectly with their RUAS to configure their total assigned address
   space into one or more micronets.  This ID concentrates on the
   changes of ETR address for existing micronets, but the mapping change
   packets will also contain information about how existing micronets
   have been deleted and replaced by other micronets, smaller or larger
   and with different start and end-points.

   RUASes and the level 0 and level 1 Replicators are few in number and
   will be administered carefully, so this ID does not consider
   automated aids to their management and debugging.  However, the rest
   of the Replicators, level 2 and greater, will be numerous and
   operated by a wide range of organisations.  Future work will concern
   maximising the degree to which the Replicator system can be robustly
   and easily managed, rather than requiring a great deal of manual
   configuration etc.

   In order to debug the way the Ivip system is used, such as transient
   erroneous or malicious mapping updates which cause packets to be
   tunnelled to addresses where they are not welcome, there will need to
   be a system which monitors all mapping changes and keeps a lasting
   record of them.  Then, aggrieved parties can search such a system for
   the address on which the received the unwanted packets, and so
   determine the micronet involved.  This will enable the aggrieved
   party to complain to the RUAS which is responsible for that micronet.
   This "mapping history" function could be performed by one or multiple


Whittle                   Expires July 23, 2010                [Page 10]

Internet-Draft              Ivip DB Fast Push               January 2010


   separate systems, each simply taking a feed from the Replicator
   system.

2.2.  Non-goals

   Apart from checking the ETR address against any specific exclusion
   lists (such as specific prefixes, private RFC 1198 and multicast
   space) and to ensure it is not part of a Mapped Address Block (MAB -
   a BGP advertised prefix containing SPI space, divided into many
   micronets), the entire Ivip system takes no interest in whether there
   is a device at that address, whether the address is advertised in
   BGP, whether there is or was an ETR at that address, whether the ETR
   is reachable or whether the ETR can deliver packets to the micronet's
   destination device.

   These are all matters which fall under the responsibility of the end-
   user network whose micronet is being mapped to this ETR address.

   It is not a goal of the system to keep mapping changes secret from
   any party.  This would be impossible.  Therefore, it cannot be a goal
   of this or probably any core-edge elimination scheme that in a mobile
   setting, the movement of an individual's device could not be inferred
   by anyone who monitors the mapping updates.  However, the mapping
   only concerns the currently active TTR.  MNs can still use a TTR no-
   matter where they are physically connected, and using a TTR hundreds
   or even thousands of km distant will probably present no serious
   difficulties due to path-length or lost packets.  So mapping changes
   need not indicate much, or anything, about the physical location of
   the MN.

   Replicators perform a best-effort copying of mapping update packets.
   They do not store the payloads of these packets for any appreciable
   time or attempt to request a payload which is missing from their two
   or more input streams.

2.3.  Challenges

   Please refer to the Ivip Fast Payload Replication ID
   [I-D.whittle-ivip-fpr] for discussion of the most difficult
   challenges or the FMS.  The present ID concentrates on the overall
   system, including the RUASes and UASes which connect to them.  Here,
   the FPR system - Replicators and Missing Payload Servers - are
   regarded as a subsystem.


Whittle                   Expires July 23, 2010                [Page 11]

Internet-Draft              Ivip DB Fast Push               January 2010


3.  Definition of Terms

3.1.  SPI - Scalable PI space

   Once Ivip is operational, a growing subset of the global unicast
   addresses will be handled by ITRs tunnelling the packets to an ETR,
   which delivers the packets to the destination.  This subset is used
   by end-user networks and provides portability, multihoming and
   inbound traffic engineering in a manner which is highly scalable -
   does not overly burden DFZ routers.

   SPI space is "mapped" by Ivip and this mapping system can divide it
   into smaller sections than is possible with BGP in the DFZ - a 256 IP
   address granularity for IPv4, due to a widely enforced convention on
   the lengths of routes which are accepted.

   The granularity with which Ivip maps SPI space - dividing it into
   micronets (described below) is single IP addresses for IPv4, and /64
   prefixes for IPv6.

3.1.1.  Conventional global unicast address space

   This is global unicast address space as it is used today.  With Ivip,
   this will be a subset of the full unicast space - the part which is
   not used for SPI space.  The LISP term for this is "RLOC" space.

3.2.  MAB - Mapped Address Block

   A MAB is a BGP advertised prefix which is used as SPI space.  DITRs
   (Default ITRs in the DFZ) all over the Net advertise this prefix,
   tunnelling the packets to ETRs according to the current mapping for
   the destination address of each packet.

   A MAB could, in principle, be as large as a /8.  Larger MABs are
   preferred in general, because each one burdens the BGP system with
   only a single advertisement, but includes the SPI space of
   potentially hundreds of thousands of end-user networks.  However, for
   reasons discussed below - including load sharing between ITRs and
   ease of initially loading snapshots of the mapping database - it may
   be best if MABs are more typically in the /12 to /17 range for IPv4.

   MABs do contribute to the load on the DFZ's BGP control plane, and
   involve one more route in the RIB and FIB of all DFZ routers.
   However, a MAB typically supports the address needs of thousands or
   tens of thousands of end-user networks.  This ratio is how Ivip or
   any other successful core-edge separation architecture solves the
   routing scaling problem.  Without such an architecture, each of these
   end-user networks would either require their own route (AKA "prefix")


Whittle                   Expires July 23, 2010                [Page 12]

Internet-Draft              Ivip DB Fast Push               January 2010


   in the DFZ, or not be able to obtain address space which was portable
   and suitable for multihoming and inbound TE.

3.3.  UAB - User Address Block

   Each MAB typically contains address space which has been assigned by
   some means to many (perhaps tens of thousands) separate end-users.  A
   UAB is a contiguous range of addresses within a MAB which is assigned
   to one end-user.  UABs are important divisions for the RUAS company,
   but UABs are not specifically mentioned or needed in the mapping
   update packets handled by Replicators.  Nor are UABs relevant to the
   operation of QSDs, QSCs (caching query servers), ITRs or ETRs.

   A MAB could be assigned entirely to one end-user - as might be the
   case if the end-user converted a prefix of theirs which was
   previously conventional PI space to be managed as SPI space by the
   Ivip system.  Generally speaking, MABs are ideally large (short
   prefixes) and each contains space for multiple end-users.  Generally,
   MABs are owned or at least administered by MAB companies, who rent
   SPI space to end-user networks.  Each MAB must have its mapping
   handled by a single RUAS.  The company which operates the MAB may
   have its own RUAS.  If not, it will contract the services of an RUAS
   to handle mapping distribution for this MAB.  Ivip is intended to
   support dozens of RUASes, perhaps a hundred or so - though if there
   was a need, more than this could be accommodated.

   An end-user might have multiple UABs in a MAB, UABs in multiple MABs
   from the same company or UABs in MABs from multiple MAB companies.
   For simplicity, this ID assumed each end-user has a has a single UAB.
   UABs are specified by starting address and length, in units as
   mentioned above: IPv4 addresses or IPv6 /64s.  A MAB's boundaries are
   always on power-of-two boundaries of these units, since it is a
   prefix advertised in the DFZ.  UABs and micronets have arbitrary
   starting points and lengths - they are not at all constrained by
   binary "prefix" boundaries.

3.4.  Micronet

   Following Bill Herrin's suggestion, the term "micronet" refers to a
   range of SPI space for which all addresses have the same mapping.  In
   LISP, these are known as EID prefixes.  In Ivip, a micronet need not
   be on binary boundaries - it is specified by a starting address and a
   length, in units of single IPv4 addresses or IPv6 /64 prefixes.

   An end-user could use their entire UAB as a single micronet, or they
   could split it into as many micronets as they wish, and change these
   divisions dynamically.


Whittle                   Expires July 23, 2010                [Page 13]

Internet-Draft              Ivip DB Fast Push               January 2010


   Any micronet which is mapped to zero (its ETR address is 0.0.0.0 in
   IPv4) will cause ITRs to drop any packets addressed to this micronet.
   A micronet can be defined within the whole or part of a contiguous
   range of address space which is currently mapped to zero, by the FMS
   carrying an update message specifying the new micronet's starting
   address, its length, and a non-zero address for its mapping.  (Future
   work: decide exactly what instructions are needed and which sequences
   of operations are allowable for making new micronets in place of
   existing ones.)

3.5.  RUAS - Root Update Authorisation System

   Multiple RUASes collectively generate the total stream of mapping
   update messages.  Each RUAS is responsible for one or more MABs.
   There may be a dozen to a hundred or so RUASes.  Greater numbers of
   RUAS companies is good for competition and innovation.  Prior to
   2010-01-18 it looked technically difficult to have more than a dozen
   or so RUASes.  With the simplified layer 0 Replicator arrangement,
   there can be as many RUASes as each (or most) layer 0 Replicators
   have DTLS sessions with.  So in principle, if there was a need for
   several thousand RUASes, I am sure the Replicator software could be
   made to handle this number of DTLS sessions.

   Each RUAS receives mapping updates either directly from end-user
   networks (or their appointed Multihoming Mapping companies) - or
   indirectly via intermediate organisations, each of which runs a UAS.

3.6.  UAS - Update Authorisation System

   A UAS is the system of an organisation which accepts mapping change
   commands from end-users, and conveys them directly - or perhaps
   indirectly via another UAS - to the RUAS which handles the relevant
   MAB.  An RUAS which accepts mapping update commands from end-users
   does so via its own UAS system.

   A UAS accepts upstream input from end-users and/or other UASes.  It
   generates output to downstream RUASes and/or other UASes.  One UAS
   may have relationships with multiple RUASes.  A MAB may be assigned
   to an RUAS and control of parts of this may be delegated to multiple
   UASes.  A single UAS may work only with a single RUAS, or with
   multiple and perhaps all RUASes.

   Whether the MAB itself is administratively assigned (by an RIR, or
   some national Internet Registry) to the UAS or to the RUAS is not
   important in a technical sense.  End-users will choose address space
   according to the RUAS (and any UASes) it depends upon with care,
   because the reliability of this MAB's address space will forever be
   dependent on these organisations.


Whittle                   Expires July 23, 2010                [Page 14]

Internet-Draft              Ivip DB Fast Push               January 2010


   If the MAB is not operated by an RUAS company, then the company or
   organisation which operates it can choose any RUAS to handle its
   mapping.  In this case, while an end-user network may choose to rent
   its SPI space from this particular MAB operating company, in part
   based on the reputation of the RUAS company currently chosen by the
   MAB operating company, the operating company could at any time select
   another RUAS company.  If it did so, it would presumably arrange for
   whatever UAS system its SPI-renting customers used to work with the
   new RUAS.  Assuming this is the case, then the end-user networks
   would not perceive any change, or alter however they control their
   mapping.

   The number of RUASes will probably be limited to some degree, such as
   dozens or a hundred or so, enable them to efficiently and reliably
   work together with their jointly operated system of level 0 and 1
   Replicators to create a single stream of updates for the entire Ivip
   system.  The ability of companies with UASes to act as agents for
   RUAS companies and/or to have their own MABs which they contract a
   RUAS to handle the mapping for, will enable a large number of
   organisations to compete in the rental of SPI space.

3.7.  UMUC - User Mapping Update Command

   (I apologise for the muddy sounding acronym.  Finding short, unused,
   meaningful, pronounceable acronyms which have not already acquired
   meanings in the IETF is quite a challenge!)

   A UMUC is whatever action the end-user performs on one or more
   different user-interfaces of whatever UAS they use to change the
   mapping of their one or more micronets.  The system would also be
   able to tell the user the current mapping and also confirm that a
   requested change to the mapping was acceptable.  In other words, the
   system lets end-user networks (and/or whichever Multihoming
   Monitoring company they contract to control the mapping of their
   micronets) to "see" (server-to-human and server-to-server) how their
   UAB is broken into micronets and what ETR addresses those micronets
   are mapped to.

   The UAS system could also provide diagnostics such as testing the
   reachability of their network via one or more ETR addresses.  The
   system would also enable trialling mapping changes and altered
   micronet boundaries without actually executing the changes - so the
   end-user network operators can manually test their proposed changes
   are valid, before actually making them.

   QSDs will only accept certain kinds of updates, and it is vital that
   the mapping updates are applied in the order they are sent - and that
   these updates are in themselves valid.  For instance, it will


Whittle                   Expires July 23, 2010                [Page 15]

Internet-Draft              Ivip DB Fast Push               January 2010


   probably be mandatory for micronets to be mapped to an ETR address of
   0.0.0.0 before being split or joined.  This rule will probably apply
   firstly to mapping updates arriving in QSDs and being applied to
   update the local copy of a MAB's mapping database, but also to
   mapping updates sent by QSDs to any querier which previously received
   mapping for a micronet whose mapping has just been changed.  The
   querier could be a QSD or an ITR.  It will be important for the UAS
   to ensure the update commands it sends to the RUAS are valid
   according to these constraints.

   In addition to testing proposed changes for validity, the UAS system
   should be able to combine multiple updates into a single set, to be
   executed in order, but at the same time.  The complete set would be
   sent on the FMS as part of a single message.  Ideally the message
   would be in a single payload of a packet, but if not, then the data
   format will recognise a complete set of updates are spread over two
   or more payloads, and ensure the complete message is ready before
   executing it.  For instance, mapping an 8-long micronet's ETR address
   to zero, and splitting it into three smaller micronets and then
   setting the ETR address of each.  This would involve 17 commands.

   When testing proposed changes, or deciding whether to accept changes
   which have been ordered with the end-user network's credentials, the
   UAS system would generate an error if the mapping was to a disallowed
   address - multicast, SPI space, private address space or to some
   other prefixes to which the Ivip system does not support the
   tunnelling of packets.  Similarly, and error would be generated if
   the end-user attempted to change the mapping for some address space
   outside their UAB, or if they defined a new micronet within that
   space with non-zero mapping, or which overlapped some addresses for
   which the mapping was currently non-zero.

   For the sake of discussion, it will be assumed that all UMUCs have
   passed these validity tests at the UAS and are for valid mapping
   addresses - so a UMUC is a successfully accepted update command from
   the end-user, or some person or system or with the end-user's
   credentials.

   There could be many methods by which this command is communicated to
   the UAS, including HTTPS web forms with username and password
   authentication.  SSL sessions might be more suitable for automated
   mapping change systems, such as those of a Multihoming Monitoring
   company which the end-user authorises to control the mapping of some
   or all of their UAB.

   In addition to authentication, the command takes the form of the
   starting address of the micronet, the length of the micronet, and a
   single ETR IP address to which this micronet will have its mapping


Whittle                   Expires July 23, 2010                [Page 16]

Internet-Draft              Ivip DB Fast Push               January 2010


   changed to.

3.8.  SUMUC - Signed User Mapping Update Command

   This is the information contained in a UMUC, signed by the UAS which
   accepted it from the user (or by some other UAS), being handed down
   the tree to another UAS or to the RUAS of the tree, so that the
   recipient UAS/RUAS can verify the signature and regard the UMUC as
   authoritative.

3.9.  MABUS - Update Stream specific to one MAB

   This is a stream of data by which the real-time updates to the
   mapping data for any one MAB are conveyed.  For the purposes of
   discussion, the RUASes and the Launch system are assumed to work in a
   synchronized fashion, generating a body of updates for each MAB which
   are gathered together in some way over a short period of time.  Prior
   to 2010-01-18, I assumed the whole FMS would operate on one-second
   cycles.  Now, the core of the FMS - the Replicator system - is
   asynchronous and the best thing would be for RUASes to sent packets
   along it in a reasonably even manner, but coordinated so as not to
   exceed some agreed total maximum data rate in any period such as 0.1
   seconds.

   Mapping changes are typically not urgent to the point of not being
   able to wait a second or so.  So it would make sense for an RUAS to
   bundle multiple updates for one MAB together, before sending them to
   the FMS, either alone in a packet payload, or together with updates
   for other MABs.

   For the purposes of discussion, we can imagine each RUAS buffering
   changes for any one MAB for up to a second in order to collect them
   together.  Of course, for some MABs, hours or even days may pass
   without a mapping change.  This discussion is intended to explore the
   more demanding scenarios.

   Each RUAS will generate one MABUS for each of its MABs.  So each
   second or so, the RUASes collectively generate a variable length body
   of update information for every MAB in the Ivip system.  Some or many
   of these may contain no updates.  The MABUS includes mapping changes
   (altering ETR addresses of existing micronets), changes to micronet
   boundaries and snapshot messages (described above).  The data format
   would be extensible for purposes not yet anticipated.

3.10.  Level 0 Replicators

   A small (such as 8) number of widely dispersed Replicators which
   receive packets from all the RUASes on a continual basis, and where


Whittle                   Expires July 23, 2010                [Page 17]

Internet-Draft              Ivip DB Fast Push               January 2010


   each one also sends a stream of whatever it received to each other
   one.  This is a "fully meshed" set of Replicators.  These are the
   only ones to receive packets from RUASes and the only ones to drive
   Replicators in the other levels.

3.11.  Level 1 and greater Replicators

   A cross-linked, tree-like, system of Replicators form a redundant,
   reliable, high-speed distribution system for delivering mapping
   updates to full database ITRs and Query Servers all over the Net.

   Each Replicator receives one or more (typically two) streams of
   update packets from an upstream Replicator or Launch server.  These
   two source streams should come from widely topologically separated
   sources, ideally over two separate physical links.  For instance a
   Replicator in Berlin might receive its update streams from London and
   Berlin, two sources in Berlin which are in different ISP networks, or
   in any combination which minimises the likelihood that both sources
   will be disrupted by any one fault.

   The Replicator identifies the DTLS payloads of each packet by the
   "Fresh / Repeat" algorithm, which is described in:
   [I-D.whittle-ivip-fpr].  The first time a packet with a particular
   payload arrives at a Replicator, it is detected as being "Fresh" and
   then the payload is replicated as DTLS packets to all the downstream
   devices, which can be Replicators, QSDs or MPSes.  When another
   packet with the same payload arrives later, as it probably will from
   the other input stream, the second one is recognised as a "Repeat"
   and no further action is taken with it.

   At present I am assuming each Replicator will receive typically two
   streams and send typically 20 streams.  However, it may be possible
   to have many more output streams, such as 50 or 100.

   Replicators could be implemented in routers, but are probably best
   implemented in ordinary software on a GNU-Linux/BSD etc.  COTS
   (Commercial Off The Shelf) server.  Replicators do not cache
   information and need no hard drive storage.  A server performing as a
   QSD could also operate as a Replicator.

3.12.  QSD - Query Server with full Database

   QSDs get a full feed of updates from one or more Replicators.  When
   they boot, they download individual snapshot files for each MAB in
   the Ivip system.

   QSDs respond immediately to queries from nearby ITRs and from caching
   Query Servers (QSCs) - and send notifications to these if mapping


Whittle                   Expires July 23, 2010                [Page 18]

Internet-Draft              Ivip DB Fast Push               January 2010


   data changes for a micronet which was the subject of a recent query.

   QSDs have no routing or traffic handling functions.  In a full-scale
   billion-plus micronet deployment they need a lot of memory, so the
   best way to implement a QSD is probably on an ordinary server with
   one or more gigabit Ethernet interfaces.  No hard drive is required,
   except perhaps for logging purposes.

3.13.  QSC - Query Server with Cache

   A QSC could be implemented in a router or more likely a COTS server.
   It does not route packets, and its memory and computational
   requirements will be modest compared to those of a QSD.  There is no
   need for a full feed of updates from the Replicator system.  However,
   each QSD must be able to get mapping information from one or more
   upstream QSDs - or via upstream QSCs which themselves access upstream
   QSDs.

   The easiest way to implement a QSC would be software on a modest
   server, which would only need a hard drive for logging purposes.


Whittle                   Expires July 23, 2010                [Page 19]

Internet-Draft              Ivip DB Fast Push               January 2010


4.  Update Authorities and User Interfaces

   This section is a detailed discussion of the fast push mapping
   distribution system itself, starting with the systems which accept
   commands from end-users (or their authorised representatives or
   systems) and prepare the information to be fanned out worldwide via
   the level 0 Replicators.

   This is the early stage of an ambitious design, so a number of
   options are contemplated.  This section of the system may not need
   IETF standardised protocols, since only a small number of
   organisations need to interact to make it work.  The Replicators and
   the data format of mapping updates do need to be standardized.  The
   purpose of exploring the RUAS and Launch server systems is to
   estimate the difficulty of constructing them - and hopefully to show
   that an approach like this is feasible and desirable.  There may well
   be easier approaches than the ones explored here.

   Probably the closest thing to them would be the large scale systems
   for managing DNS, such as for .com and other major TLDs.  I don't
   know anything about these and people with experience in such systems
   could probably design the UAS, RUAS and perhaps Launch server systems
   better than I could.

   The real-time nature of these systems of controlling ITR behavior has
   no precedent.  Generally, the system should work on a continual
   basis.  However, if there is a technical problem or the system is
   stopped for a few minutes to do an upgrade or whatever, the Internet
   is not going to grind to a halt.  In that downtime, end-user networks
   which experience a multihoming failure will have to wait for their
   connectivity to be restored.  Likewise, end-user networks which send
   mapping changes for inbound TE will have to wait.  The effect on TTR
   mobility would be minor, since mapping changes are not required when
   the MN changes its physical connections, including when moving to an
   entirely different access network.  The delay in mapping changes
   means that those few MNs which have chosen a new, closer, TTR will
   need to wait for traffic to be tunneled to that new TTR - meaning
   they will need to keep up the tunnel to the old, and now more
   distant, TTR for these minutes.  Normally, with mapping changes
   getting to ITRs in a few seconds, the MN could terminate the tunnel
   to the old TTR within a few seconds of the ITRs beginning their
   tunneling to the new TTR.

   The final authority to control mapping information is fully devolved
   to end-users, who by means of a username and password or some other
   authentication method, are able to issue commands to define micronets
   within their UAS, and to map each micronet to any ETR address.


Whittle                   Expires July 23, 2010                [Page 20]

Internet-Draft              Ivip DB Fast Push               January 2010


   However the physical authority to control the mapping of all Mapped
   space within a single MAB rests with a single RUAS.  That RUAS may be
   acting for a UAS who is administers a MAB.  The RUAS may administer
   it - perhaps on behalf of another company - and may delegate control
   of parts of it to one or more UASes.  The RUAS may have relationships
   directly to the end-users of this MAB, through its own UAS.  Here we
   discuss the flow of information and trust between these various
   entities, in real-time, so that every second or so each RUAS
   assembles a body of update information for each of its MABs.

   In the diagrams below, each RUAS or UAS is depicted as a single
   entity.  Each such entity acts as a single functional block, but
   would typically be implemented as a redundant system over several
   servers.

4.1.  RUAS Outputs

4.1.1.  Update packets to level 0 Replicators

   Each RUAS is largely autonomous in when it generates packets to be
   sent to level 0 Replicators.  Ideally it would spread its packets out
   smoothly in time.  Ideally it would send fewer, larger, packets than
   more numerous small ones.

   In future work I intend to describe a means by which the RUASes
   collectively manage the data capacity of the FMS.  One aspect of this
   is usage fees of some kind.  Since the FMS is a shared resource,
   which burdens Replicators, QSDs and MPSes all over the world
   according to the packets it carries, there needs to be an arrangement
   whereby RUASes don't send packets for no good reason.  Since RUASes
   will be charging end-user networks, directly or indirectly, for each
   mapping change, there will probably be some kind of traffic-based
   usage fees or settlement system amongst the RUASes which collectively
   run the first two or more levels of the Replicator system.

   Exactly how this will be done commercially does not need to be
   defined.  What matters is that the technical elements can feasibly be
   used in a way which supports a shared, cooperative, effort to run the
   system reliably and in a way that no RUAS places unreasonable burdens
   on other parties.  There would probably need to be some kind of
   agreement, consortium or the like for governing the FMS.  The design
   presented here is to show that such a system could work well, not
   depend on any one RUAS or device, and that it could support a large
   enough number of RUAS companies, with RUAS systems and the level 0
   Replicators, physically dispersed in many countries.

   Another aspect is the moment-to-moment management of the total volume
   of packets sent.  This would be partly a question of the number of


Whittle                   Expires July 23, 2010                [Page 21]

Internet-Draft              Ivip DB Fast Push               January 2010


   packets and mainly a question of their total length - in bits per
   second over some short time period such as 0.1 seconds or so.

   While data rates would grow over the years, at any one point in time,
   the whole FMS system would have some kind of specification for the
   peak data rate of the packets it carries.  If this was 100kbps, then
   each Replicator which accepts two input streams would need to ensure
   its data links from the two upstream replicators could, in general,
   handle this data rate with minimal chance of packet loss.

   The operators of Replicators, QSDs and MPSes need some guidance on
   peak bandwidth, and the only way to ensure the level 0 Replicators do
   not send out greater than this bandwidth is some kind of real-time
   demand balancing arrangement between the RUASes.

   RUASes will probably have widely varying needs to send updates, and
   these may change with time of day, due to a flurry of multihoming
   mapping changes resulting from a network outage or for any other
   reason.  At each point in time, each RUAS needs a "quota" - a
   quantity of data, in bytes, which is the limit of the total packets
   it is allowed to send in the next time period, which may be 0.1, 0.2
   or some other fraction of a second.  If the RUAS needs to send more
   packets than this, it should buffer the data, request a higher quota,
   and only send the packets if and when it has received a higher quota.

   Since the quota represents the right to use this shared resource, and
   the sending of packets involves the actual use of this right, it is
   likely that some kind of market forces will govern how the capacity
   of the system is divided, moment-to-moment.  There could be many ways
   of arranging this, and it doesn't need to be standardised by the
   IETF.  The RUAS companies will need to work together, choose who to
   accept as new RUAS companies, decide how to share the burdens of any
   common infrastructure etc.

4.1.2.  MAB snapshots

   Every few minutes (or some other time period, as chosen by the RUAS,
   but with some reasonable maximum defined by a BCP) the RUAS makes a
   copy of the complete mapping information for a MAB.  Snapshots for
   each MAB are independent of each other, and so can be done with
   different frequencies.

   The snapshot is in a format which needs to be standardized, so it can
   be downloaded and understood by any QSD, now and in the future.  This
   data format needs to be extensible to cover new kinds of mapping
   information and other functions not yet anticipated - which will be
   ignored by devices which are not capable of these functions.


Whittle                   Expires July 23, 2010                [Page 22]

Internet-Draft              Ivip DB Fast Push               January 2010


   The exact format for this is for future work, but for instance would
   begin with some identifying information about the MAB, a block
   defining that the following data concerns IPv4 micronet mapping
   information (and snapshot announcements), with the possibility of
   other blocks containing different kinds of data.  Binary format would
   probably be best, and the file could then be compressed with gzip
   etc.

   Each such file will be given a distinctive name, according to a
   standardised format, which indicates at least the MAB starting
   address and length, and the time of the snapshot.

   The snapshot process will take a second or two to complete from the
   time it is initiated, and the resulting file will be copied to a
   number of servers, ideally located in a variety of locations around
   the Net.

   Each such server would be run by the RUAS directly, or as part of all
   RUASes working together.  The servers can probably be conventional
   HTTP servers, so that QSDs can download the snapshots when needed.
   There is scope for some careful design with DNS so that there is an
   automatic structure in the domain names of these servers, enabling an
   expandable system to be automatically used by QSDs without manual
   configuration.

   These files will be publicly available, and need to be made available
   for somewhat longer than the cycle time of snapshots.  So with a ten
   minute snapshot cycle, the previous snapshot should be available for
   a while - probably 10 minutes or so - after the new one is available.

   Snapshots are downloaded by QSDs when they boot, and if they suffer a
   disruption in mapping updates which necessitates a reload of this
   part of the complete mapping database.  To facilitate this, MABs
   should not be too large in terms of IPv4 addresses or IPv6 /64s - or
   at least should not contain too many micronets - which would make
   individual snapshot files excessively large.

   At boot time, or when re-synching, the QSD will monitor the update
   streams for each MAB until a snapshot announcement is found.  It will
   then buffer all subsequent updates and download the snapshot as soon
   as it is available.  Once the snapshot has arrived, and been unpacked
   to RAM, the buffered updates are applied to it.  Then, this MAB's
   part of the mapping database is up-to-date and the QSD can being
   using it to answer queries.  (During the re-synching operation, the
   QSD will need to tell a querier it can't answer the query, or may
   buffer the query and send the same query to another QSD, passing on
   the response when it arrives.


Whittle                   Expires July 23, 2010                [Page 23]

Internet-Draft              Ivip DB Fast Push               January 2010


   In order to reduce total path lengths for these file downloads, and
   likewise for retrieving missing packets from the same servers, it
   would be desirable if each QSD in a given location could access a
   nearby snapshot server.  It may be desirable to have every snapshot
   of every MAB in a single server, or a single set of servers which are
   accessed by geographically close QSDs.  Anycast is not a good
   technology for this, since file retrieval is best done via TCP
   sessions.  The servers need to be on conventional addresses, rather
   than SPI addresses, so the QSDs can access them without needing to
   use ITRs which themselves depend on mapping.  Likewise, any DNS
   servers involved in this server system need to be strictly on
   conventional addresses.

   Each QSD needs to be configured with, or to automatically discover,
   two or more such servers - at least one of which is relatively close
   - so the data can be found despite one server being down.

   From the point of view of the QSC, seeking an update for a given MAB
   of a particular RUAS, the address to request the file from could be
   made up from the RUAS identifier yyyy which is contained in the
   snapshot announcement (in the stream of mapping updates),
   concatenated with a locally configured "xxxxx" and
   "ipv4.ivipservers.net".  In the event that this server was
   unavailable one or more locally configured alternatives to this
   initial "xxxxx" value could be tried - including one or more for
   nearby countries.

   The most significant 24 bits of the MAB's starting address (probably
   48 bits for IPv6, assuming this is the granularity of BGP
   advertisements) for would be transformed into a text string such as
   150.101.072.  A similar transformation of the precise time of the
   snapshot would result in a second text string, and these would be
   used to reliably identify the appropriate directory and file in the
   server.

4.1.3.  Missing Payload Servers (MPSes)

   Until 2010-01-18 I planned QSDs to download the payloads of any
   packets they missing from one of several HTTP servers, as described
   above for snapshot files - where those servers would be run by each
   RUAS.  This may be possible and desirable, but please see
   [I-D.whittle-ivip-fpr] for a description of a distributed arrangement
   of Missing Payload Servers which QSDs could access to obtain any
   payloads which did not arrive via their typically two input streams
   from level ~4 Replicators.

   ISPs and larger end-user networks would run these MPSes and they
   would be linked by HTTP or HTTPS so each could query the other,


Whittle                   Expires July 23, 2010                [Page 24]

Internet-Draft              Ivip DB Fast Push               January 2010


   obtaining payloads each one was missing.  These TCP-based links are
   not subject to any PMTU constraints, since the payloads of any length
   can be sent via HTTP or some other query-response protocol.

   QSDs would query one or more MPSes as needed, with persistent or
   temporary HTTP or HTTPS sessions.

   To the extent that missing packets result from local outages, is it
   more likely that a topologically distant MPS will have the payloads a
   local MPS or QSD is most likely to want.  So HTPP or HTTPS links
   across oceans and continents would naturally be used by ISPs which
   wanted to run MPSes - for mutual benefit.

4.2.  Authentication of RUAS-generated data

   Careful consideration must be given to how QSDs can quickly and
   reliably ensure that the information they receive ostensibly from
   each RUAS is genuine.

   The DTLS links between Replicators and to QSDs will prevent an
   attacker injecting bogus payloads into the FMS.  But there's no way a
   QSD could be entirely sure that all its upstream Replicators, which
   could be quite numerous (2 above, 2 above each of them, 2 above each
   of them etc.) are not under the control of an attacker.  Being able
   to direct traffic to an attacker's site, by means of altering the
   mapping information in an ITR, is such a threat to security, and such
   an attractive proposition for attackers, that some kind of digital
   signing of the mapping update information will be required.

4.2.1.  Snapshot and missing packet files

   Each RUAS has a key pair and signs the MAB snapshot and missing
   packet files with its private key.  QSDs can verify the signature
   with the RUAS's public key, subject to a PKI arrangement of
   certificates, or some other simpler arrangements.

   Both these types of files are only handled occasionally, so the
   overhead in performing crypto operations is insignificant.

4.2.2.  Mapping updates

   This principle does not apply to the update information contained in
   packets received from the Replicator system.  The system needs to be
   highly secure against attack, because even a second or two of an ITR
   mapping packets to the attacker's site constitutes an unacceptable
   breach.

   Sometimes, possibly frequently, the RUAS will send a single packet,


Whittle                   Expires July 23, 2010                [Page 25]

Internet-Draft              Ivip DB Fast Push               January 2010


   and the QSD needs to be able to authenticate this information
   independent of any which follows a second or two later, because it
   needs to use the information immediately to update its local copy of
   the mapping database.  So there will frequently be need to
   authenticate individual packets.

   There are multiple ways of solving this problem.  I doubt anyone
   would argue that it is so difficult as to warrant the abandonment of
   the entire fast-push, local query server concept.  With more work
   later, I believe a satisfactory method can be found of the QSD
   ensuring the updates are authentic before applying them.

4.3.  RUAS - UAS interconnection

   This section depicts a single tree of delegated responsibility for
   the user control of mapping of one MAB.  The Root UAS at the base of
   the tree is run by Company X - RUAS-X.  RUAS-X could be authoritative
   for other MABs, and each such tree of delegation may have the same
   set of other UAS systems, or it could be different.  Each delegation
   tree is separate from the delegation trees of other MABs, even if
   they look similar, because the tree includes specific subsets of the
   whole MAB address range as one of the defining characteristics of its
   branches and leaves.

   The initial action which leads to the database being changed is a
   user generated (manually or by the user's equipment or by a system
   authorised by the user) UMUC (User Mapping Update Command).

   For authorising and feeding UMUCs to the RUAS-X, there is a tree as
   depicted in Figure 1.  Delegation of authority flows up the tree as
   the total address range of the MAB is split at each branching
   junction.  This tree structure involves data, in the form of SUMUCs
   (Signed User Mapping Updated Commands) flowing down towards the root
   of the tree.  (Data would also flow up the tree so each user-
   interface leaf could tell end-users what their current mapping was,
   could test their requests against constraints etc.)  The idea is that
   RUAS-X could delegate control of one or more subsets of the MAB's
   total range of addresses to some other system, which in turn could
   delegate control to other systems.  There would be no absolute limit
   on the height (usually called depth) of these hierarchies.

   The RUAS maintains the master database, for each of its MABs, of what
   the mapping, division into micronets etc. actually *is*.  This
   information is used to inform UASes of the current state, which they
   can convey to end-users and use to check the validity of requests
   from these end-users.  This information is also used to generate
   snapshot files.  As the mapping in the master database is changed,
   this gives rise to actual changes which must be assembled into


Whittle                   Expires July 23, 2010                [Page 26]

Internet-Draft              Ivip DB Fast Push               January 2010


   MABUSes to be sent to the level 0 Replicators in the near future.

   The servers which handle the end-user interaction needs to be one of
   the leaves of this tree structure, so as not to burden the RUAS-X
   database servers themselves with details of user interaction.  This
   enables various companies to give different kinds of control for the
   mapping of the SPI space their branch of the tree controls.  Figure 1
   does not show RUAS-X having any user interface servers, but it could.
   The simplest arrangement would be the RUAS having simply a user-
   interface server and no tree of other UASes.

   There would need to be IETF standardised methods by which some server
   could execute a UMUC with the user-interface servers of any of these
   UASes.  This standardisation would be especially important for
   multihoming, because some reasonably trusted company could run an
   automated monitoring system, and have the credentials (username,
   password, key etc.) stored in their system so their system can change
   the mapping of one or more micronets the moment one link was detected
   to be faulty.  It is vital that there be a standardised method by
   which all multihoming monitoring companies could send these mapping
   change commands (and queries about the current state of mapping) to
   UASes.  Also, the company (such as X, Y or Z in Figure 1) which
   controls a particular range of the Mapped space may offer such a
   multihoming monitoring system itself.

   The tree in this example controls an MAB with the address range
   20.0.0.0 to 20.3.255.255.  In this example, company X has been
   assigned by an RIR the entire range 20.0.0.0 to 20.3.255.255.
   Company X leases to Y a quarter of this: 20.1.0.0 to 20.1.255.255.
   These divisions are on binary boundaries, but they need not be.  It
   would be just as possible for X to delegate to Y an arbitrary subset
   of the whole range, or the entire range - or just one IPv4 address or
   IPv6 /64.

   X's Root Update Authorisation Server (RUAS) has a private key for
   signing all the MAB snapshot files it periodically creates and makes
   available.  The same key would be used for signing the mapping change
   information for each MAB which are sent to the level 0 Replicators
   and so to all QSDs.

   In this example, company Y delegates control of some of its space to
   company Z, and Z has an end-user U, who needs to control the mapping
   of a UAB containing one or more micronets in Z's range.


Whittle                   Expires July 23, 2010                [Page 27]

Internet-Draft              Ivip DB Fast Push               January 2010


              User-R   User-S  User-T  User-U       Multihoming
                    \        \      |       |       Monitoring
                     \        \     |       |       Inc.
                      \      .................     /
                       \----. Web interface   .---/
                            . other protocols .
                            . etc.            .
                             ....UAS-Z........
                                   |
   Other companies                 |
   like Y and Z                    |
                        /-----<----/
   |   |           \ | /
   |   |            \|/
   |   |           UAS-Y
   \   |             |
    \  |  /----<-----/
     \ | /
      \|/
    RUAS-X    Root Update Authorisation Server company X
       | \
       |  \
       V   \->-[ Multiple web servers for MAB snapshot ]
       |
       |      Other RUASes like RUAS-X, each authoritative
       |      for mapping one or more MABs and producing
       |      regular MAB snapshots and update streams to
       |      which are sent to all level 0 Replicators.
        \
         \        |    |    |        /
          \       |    |    |       /
           \      |    |    |      /
            \     |    |    |     /
             \    |    |    |    /
              \   |    |    |   |
              |   |    |    |   |
              V   V    V    V   V
              |   |    |    |   |

            Each line depicts 8 streams of packets with
            identical payloads - one stream for each of
            the 8 level 0 Replicators.

   Figure 1: Delegation tree of UASes above one RUAS.  Multiple RUASes
   all driving their mapping updates to every level 0 Replicator.  These
   fan the packets out to hundreds of thousands of QSDs all over the
   world, in a second or so.


Whittle                   Expires July 23, 2010                [Page 28]

Internet-Draft              Ivip DB Fast Push               January 2010


   Z has various interfaces by which U can do this, with its own
   arrangements for authentication, for monitoring a multihoming system
   and making changes automatically etc.  Ideally there might be one or
   more automated, host-to-server, IETF-standardised protocols so all
   end users and their appointed multihoming monitoring companies could
   have standardised software for talking to whichever company's servers
   they use to control the mapping of their IP address(es).

   When user-U (or a device or system with user-U's credentials) changes
   the mapping of their micronet via a web interface this is achieved
   via Z's website, authenticating him-, her- or it-self, by whatever
   means Z requires.  This causes UAS-Z to generate a signed copy of
   this update command (a SUMUC) and to send it to UAS-Y.  This may
   include multiple commands to be executed in order.

   The simplest SUMUC would be a change to the ETR address of an
   existing micronet.  This would consist of three items (assuming IPv4
   for simplicity): A starting address for which micronet this update
   covers, the number of IP addresses covered by the micronet to be
   changed (>=1) (or alternatively the last address of the micronet),
   and a new mapping value - a 32 bit ETR address.  The SUMUC could also
   consist of a time in the future the update should be executed.  In
   that case, it would be stored by RUAS-X and sent to the FMS at the
   appointed time.

   Mapping change commands would also include commands to join and split
   micronets.  Sequences of these commands would be sent, in order - and
   the UAS should check their validity before putting them into a SUMUC.
   So a SUMUC consists of one or multiple mapping change commands
   concerning a particular micronet, or perhaps a set of micronets.  The
   commands will be executed in order, but as if at once.

   If the SUMUC consists simply of changing a micronet's ETR address,
   including zeroing it, then this will be applied by every QSD and
   updates sent to any ITRs which need it.  Multiple such changes all
   together in the one SUMUC would cause the same effects, for multiple
   micronets.  However, if the changes involved a sequence of changes
   affecting the same SPI addresses, the QSD will update its queriers,
   which could be ITRs or QSCs, to the final state of the mapping after
   the changes.

   For instance a sequence of changes could zero two micronets (set
   their ETR address to 0.0.0.0) and then join them into one micronet.
   The resulting micronet could then be split into five micronets and
   each one mapped to a different ETR address.  The QSD may have a
   querier which is caching the mapping for the first original micronet,
   but not the other.  It will send that querier updates which define
   the new mapping arrangements for exactly that range of SPI addresses


Whittle                   Expires July 23, 2010                [Page 29]

Internet-Draft              Ivip DB Fast Push               January 2010


   which the original response covered.  This avoids the ITR (or the
   QSC, if that is the querier) having to be told about a larger amount
   of SPI space than it was told about in the initial reply.  As noted
   previously, the caching time for these newly defined micronets, each
   of which will now be in the cache of the ITR or QSC, will be flushed
   from the cache at the same time as the originally cached micronet
   would have been.

   UAS-Y trusts this SUMUC because it can authenticate UAS-Z's
   signature.  It strips off the signature and adds its own, before
   passing the SUMUC down to the next level: RUAS-X.

   RUAS-X likewise has a copy of UAS-Y's public key and within a
   fraction of a second of U initiating the UMUC, the master copy of
   this MAB's database, in RUAS-X is altered accordingly.  (This would
   be a distributed, redundant, database system.)

   Authority is delegated up the tree, because UAS-Y will only accept
   update commands if they are signed by one of its branch UASes, and
   for the particular address range that UAS has been authorised to
   control.

   User-U may have given their username and password etc. to Multihoming
   Monitoring Inc. so this company can monitor their multihoming links
   and change the mapping as soon as one link goes down.  UAS-Z doesn't
   know or care who actually makes the change - as long as they can
   authenticate themselves for whatever micronet they want to change the
   mapping of.  UAS-Z would keep an audit trail of all interactions such
   as with User-U or Multihoming Monitoring Inc.


Whittle                   Expires July 23, 2010                [Page 30]

Internet-Draft              Ivip DB Fast Push               January 2010


5.  Common information to be sent by the FMS

   In future work I will consider what common information all QSDs need,
   such as to reliably gain the basic information about the current
   state of Ivip-mapped SPI space.  The most important things are the
   identities of the RUASes, how each RUAS is represented in the 7 bit
   (for instance) "ruas" field in the FPR header of each packet in the
   FMS, and the exact details of each current MAB.  This will include
   which RUAS is responsible for the mapping of which MAB.

   One way of doing this is for QSDs to download it periodically via
   HTTPS from one or several servers which are somehow trusted and
   operated by either a consortium of the RUAS companies, or by
   individual RUASes.  Another way, would be for information such as
   this to be periodically sent on the FMS itself.  Probably the best
   way is the downloaded file approach, with a regular schedule by which
   each day, QSDs would download the latest information.  MABs could be
   added to the Ivip system on a day-by-day basis.  There's no need to
   expect QSDs to set up another MAB mapping database on the basis of a
   command to this effect which arrives on the FMS itself.

   Some kind of distributed and secure rsync arrangement is probably a
   good method of doing this.


Whittle                   Expires July 23, 2010                [Page 31]

Internet-Draft              Ivip DB Fast Push               January 2010


6.  The Fast Payload Replication system

   Please refer to [I-D.whittle-ivip-fpr] for all details of this, the
   most critical, global, part of Ivip's FMS.


Whittle                   Expires July 23, 2010                [Page 32]

Internet-Draft              Ivip DB Fast Push               January 2010


7.  Scaling limits

   The Replicator system is scalable to any size simply by adding
   Replicators.  Assuming two input streams for each Replicator, N
   output streams gives an N/2 amplification of stream numbers per
   level.  N could be quite high in the early years of introduction,
   when the number of micronets and updates is small by comparison with
   the design target of one to ten billion micronets, with accompanying
   update rates driven by their use for inbound TE for multihomed non-
   mobile end-user networks and by mobile devices selecting new TTRs.

   First, a maximal IPv4 example will be considered.  Assume a billion
   micronets, most of them for single IP addresses.  Presumably most of
   these will be for individual end-users, at home or with mobile
   devices.  The update rate will be relatively low for multihoming the
   home and office-based micronets.

   The update rate due to inbound TE is impossible to predict.  Being
   able to steer traffic dynamically to maximise utilization of multiple
   links is economically highly attractive.  Market mechanisms will tend
   to set prices for updates which balance competing concerns.  If the
   price is too low, there will be more of them and the FPR system will
   need to be improved to cope with them - so the price would rise to
   either reduce the number, or pay for the upgrades.

   It is possible that the RUASes could collectively set prices low
   enough to make a profit running their operation and many of the
   Replicators - levels 0 and 1 at least, and perhaps level 2 or 3 too -
   with a very high volume of TE updates.  TE updates are the class of
   updates with the most elastic demand.  Multihoming updates are needed
   urgently when they are needed, but most of the time, for any one end-
   user network, none are needed.  TTR mobility updates are probably
   somewhat elastic.  If it is expensive to choose a nearby TTR, then
   people will make do with a distant one for longer, or indefinitely.

   There is a potentially large market for TE changes, because if an
   end-user network made lots of them, they may be able to make much
   better use of less expensive links.

   If RUASes collectively set mapping update prices so low that the
   volume rose to quite a high level, it is possible that ISPs and end-
   user networks which run QSDs may feel less and less inclined to
   accept all these updates - without some financial encouragement from
   the RUASes who are making money from the updates.

   If this grew to the point where those operating QSDs found they had
   to spend money upgrading their QSDs just to cope with the volume,
   then there would be the possibility that they could instead program


Whittle                   Expires July 23, 2010                [Page 33]

Internet-Draft              Ivip DB Fast Push               January 2010


   their QSDs to ignore the most frequent updates which had patterns
   resembling TE updates.

   Then, in order for the RUASes to be able to continue charging for
   these TE updates, the RUASes might need to pay QSD operators to
   accept such a high level of updates.  This would probably be
   excessively expensive considering the number of ISPs and larger end-
   user networks which would be running QSDs.  So RUASes would be under
   strong pressure to limit the total rate of updates to a level the
   great majority of QSD operators are happy with.  The price of updates
   will not deter their use for multihoming service restoration - and
   this would represent a small proportion of total updates.  Higher
   prices per update would reduce the number for TE, in a highly elastic
   manner.  Likewise, higher prices per update would cause mobile users
   (or more directly the TTR companies, who are paying for each update)
   not to change TTRs as often.

   So overall, it is impossible to state with confidence what update
   rates might be expected.

   Even with the entire Earth's population owning a mobile device with
   its own micronets, if we pick some figure, such as 1000 km, within
   which there is no significant benefit in choosing a closer TTR, then
   a WAG (Wild-Ass Guess) could be based on airline passenger numbers.
   If we assume that each such trip would be long enough to require a
   new TTR, then we would get some very approximate worst-case figure.

   Statistics from the International Air Transport Association
   [IATA-2009] indicate that commercial airlines carried 2.271 billion
   passengers in 2008.  I have not been able to find estimates for the
   number of people travelling large distances by road or train, but it
   is reasonable to assume these are relatively small compared to the
   numbers of airline passengers.  Most travel by car and train involves
   trips short enough, with a return trip home, that there will be no
   need to use a closer TTR during the whole trip.  Truck drivers
   crossing continents might be an exception, but the number of such
   trips would be small compared to the 2 billion airline passenger
   figure.

   There could be growth in passenger numbers and it is possible that on
   long trips, the aircraft's satellite link would connect to several
   ground stations, with the MNs in the aircraft therefore (ideally)
   changing their mapping to a new TTR near the ground station.  (This
   is explored in [TTR Mobility].  There are various ways of
   extrapolating these figures, such as with population growth.  For
   simplicity, I will double the 2 billion figure and use this to
   roughly include all mapping changes due to multihoming service
   restoration and TE.  So I have WAG of 4 billion mapping changes a


Whittle                   Expires July 23, 2010                [Page 34]

Internet-Draft              Ivip DB Fast Push               January 2010


   year.

   This is about 128 updates a second.

   The raw data for change to an IPv6 micronet's ETR address is 32
   bytes: 64 bits for the micronet's starting /64, another 64 bits for
   its length or end, and 128 bits for the ETR address. 128 of these a
   second is 4k bytes a second - 32kbps.  There would be peaks and
   troughs, and there could be peaks due to a major outage driving many
   end-user networks to switch ETRs for multihoming service restoration.
   This is a low data rate in the scheme of things.  VoIP calls
   typically run at 16, 32 or 64kbps for the actual voice data, plus
   considerable overhead due to IP and other headers.

   If there were 5 or 10 billion mobile devices, each with a micronet,
   many of these would keep using the same TTR from one year to the
   next.  There would be a mapping change when the micronet was assigned
   to a given handset, and then another when the handset was no longer
   used, or replaced by another.  So there would also be a significant
   background level of administrative mapping changes with billions of
   micronets for mobile devices.

   It is hard to imagine a scenario in which the update rate would
   require prohibitive volumes of data, even by today's standard, for
   any substantial ISP.  The flow of update packets would be somewhat
   greater than this raw data rate due to the need for packing them into
   some kind of robust format, having hashes of them with digital
   signatures etc.  The total amount of mapping data coming into an ISP
   would be 2 to 4 times this due to the need for feeds from two or more
   Replicators.  Still, by the times such high levels of adoption could
   occur, the bandwidth they require will surely not present a
   significant difficulty for any ISP, or for larger end-user networks
   which want to run their own ITRs and wish to have their own QSDs,
   rather than relying on the QSDs of their ISPs.


Whittle                   Expires July 23, 2010                [Page 35]

Internet-Draft              Ivip DB Fast Push               January 2010


8.  Managing Replicators

   Replicators should be easy to create and deploy.  Any substantial
   server with the requisite software, in a suitable location, will do
   the job - but it should be well secured against attackers gaining
   root access.  A successful system will require some mechanisms which
   ensure reliable operation with a minimal amount of configuration and
   ongoing management.

   In the current model, each Replicator normally receives feeds from
   two upstream Replicators, and generates some figure N feeds for
   downstream devices.  Each Replicator should be able to request and
   quickly gain a replacement feed from another upstream Replicator if
   one of those it is using becomes unavailable, or unreliable.

   This requires that Replicators in general be operating below
   capacity, so that when others in their level fail, they can take up
   the slack.  This needs to be locally configured beforehand, with
   upstream Replicators of organisations which have agreed to provide
   the feeds, and with downstream Replicators of organisations who have
   requested them.

   It is possible to imagine a sophisticated, distributed, management
   system for the Replicator network.  This could be developed over
   time, since for initial deployment, considerable manual configuration
   and less automation would be acceptable.


Whittle                   Expires July 23, 2010                [Page 36]

Internet-Draft              Ivip DB Fast Push               January 2010


9.  Security Considerations

   This ID mentions some authentication and security problems and
   possible solutions to them, but full consideration of security can
   only occur when the architecture is fleshed out in greater detail.


Whittle                   Expires July 23, 2010                [Page 37]

Internet-Draft              Ivip DB Fast Push               January 2010


10.  IANA Considerations

   For future work.


Whittle                   Expires July 23, 2010                [Page 38]

Internet-Draft              Ivip DB Fast Push               January 2010


11.  Informative References

   [I-D.whittle-ivip-arch]
              Whittle, R., "Ivip (Internet Vastly Improved Plumbing)
              Architecture", draft-whittle-ivip-arch-04 (work in
              progress), January 2010.

   [I-D.whittle-ivip-fpr]
              Whittle, R., "Fast Payload Replication mapping
              distribution for Ivip", draft-whittle-ivip-fpr-00 (work in
              progress), January 2010.

   [I-D.whittle-ivip-glossary]
              Whittle, R., "Glossary of some Ivip and scalable routing
              terms", draft-whittle-ivip-glossary-00 (work in progress),
              January 2010.

   [IATA-2009]
              "Fact sheet: industry statistics", September 2009, <http:/
              /www.iata.org/NR/rdonlyres/
              8BDAFB17-EED8-45D3-92E2-590CD87A3144/0/
              FactSheetIndustryFactsSept09.pdf>.

   [TTR Mobility]
              Whittle, R. and S. Russert, "TTR Mobility Extensions for
              Core-Edge Separation Solutions to the Internets Routing
              Scaling Problem", August 2008,
              <http://www.firstpr.com.au/ip/ivip/TTR-Mobility.pdf>.


Whittle                   Expires July 23, 2010                [Page 39]

Internet-Draft              Ivip DB Fast Push               January 2010


Author's Address

   Robin Whittle
   First Principles

   Email: rw@firstpr.com.au
   URI:   http://www.firstpr.com.au/ip/ivip/


Whittle                   Expires July 23, 2010                [Page 40]