Network Working Group R. Whittle Internet-Draft First Principles Intended status: Experimental January 19, 2010 Expires: July 23, 2010 Ivip Mapping Database Fast Push draft-whittle-ivip-db-fast-push-03.txt Abstract From the base of draft-whittle-ivip-arch-03 and later, this ID describes Ivip's fast-push mapping distribution system. This accepts mapping changes from end-user networks or organizations they authorise to make these changes. The mapping changes are handled by RUAS (Root Update Authorization Server) companies who collectively run the initial levels of a global network of Replicator servers. This is a secure, packet-based flooding system which will propagate the mapping changes to potentially hundreds of thousands of full database query servers (QSDs) in ISPs and larger end-user networks all over the world. This ID describes the overall system. The distributed Fast Payload Forwarding system is described in detail in draft-whittle-ivip-fpr. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on July 23, 2010. Copyright Notice Whittle Expires July 23, 2010 [Page 1] Internet-Draft Ivip DB Fast Push January 2010 Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the BSD License. Whittle Expires July 23, 2010 [Page 2] Internet-Draft Ivip DB Fast Push January 2010 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Outline of the RUAS and Replicator systems . . . . . . . . 4 1.2. Assumptions . . . . . . . . . . . . . . . . . . . . . . . 6 1.3. It may not be so daunting... . . . . . . . . . . . . . . . 7 2. Goals, Non-Goals and Challenges . . . . . . . . . . . . . . . 9 2.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2. Non-goals . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . . 11 3. Definition of Terms . . . . . . . . . . . . . . . . . . . . . 12 3.1. SPI - Scalable PI space . . . . . . . . . . . . . . . . . 12 3.1.1. Conventional global unicast address space . . . . . . 12 3.2. MAB - Mapped Address Block . . . . . . . . . . . . . . . . 12 3.3. UAB - User Address Block . . . . . . . . . . . . . . . . . 13 3.4. Micronet . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5. RUAS - Root Update Authorisation System . . . . . . . . . 14 3.6. UAS - Update Authorisation System . . . . . . . . . . . . 14 3.7. UMUC - User Mapping Update Command . . . . . . . . . . . . 15 3.8. SUMUC - Signed User Mapping Update Command . . . . . . . . 17 3.9. MABUS - Update Stream specific to one MAB . . . . . . . . 17 3.10. Level 0 Replicators . . . . . . . . . . . . . . . . . . . 17 3.11. Level 1 and greater Replicators . . . . . . . . . . . . . 18 3.12. QSD - Query Server with full Database . . . . . . . . . . 18 3.13. QSC - Query Server with Cache . . . . . . . . . . . . . . 19 4. Update Authorities and User Interfaces . . . . . . . . . . . . 20 4.1. RUAS Outputs . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1. Update packets to level 0 Replicators . . . . . . . . 21 4.1.2. MAB snapshots . . . . . . . . . . . . . . . . . . . . 22 4.1.3. Missing Payload Servers (MPSes) . . . . . . . . . . . 24 4.2. Authentication of RUAS-generated data . . . . . . . . . . 25 4.2.1. Snapshot and missing packet files . . . . . . . . . . 25 4.2.2. Mapping updates . . . . . . . . . . . . . . . . . . . 25 4.3. RUAS - UAS interconnection . . . . . . . . . . . . . . . . 26 5. Common information to be sent by the FMS . . . . . . . . . . . 31 6. The Fast Payload Replication system . . . . . . . . . . . . . 32 7. Scaling limits . . . . . . . . . . . . . . . . . . . . . . . . 33 8. Managing Replicators . . . . . . . . . . . . . . . . . . . . . 36 9. Security Considerations . . . . . . . . . . . . . . . . . . . 37 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38 11. Informative References . . . . . . . . . . . . . . . . . . . . 39 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 40 Whittle Expires July 23, 2010 [Page 3] Internet-Draft Ivip DB Fast Push January 2010 1. Introduction The aim of this I-D is to establish that Ivip's fast-push mapping distribution system (FMS) is practical and desirable for very large numbers of micronets (EIDs in LISP terminology) and rates of change of the mapping database. All parts of Ivip are intended to be operated by a variety of organisations, with appropriate cooperation - including between companies which are competing with each other. Please refer to [I-D.whittle-ivip-arch] for an explanation of Ivip in general. A glossary of Ivip and some general scalable routing terms and acronyms is: [I-D.whittle-ivip-glossary]. This is a revision of the 02 version, with a substantial simplification of what was previously the "Launch server" system which drove the level 1 Replicators. These are replaced by level 0 Replicators, which are functionally identical to other relatively simple Replicators, but use more input streams. The level 0 Replicators are fully meshed with each other and this mesh is driven by packets from the multiple RUASes (Root Update Authorization Servers). A new network element - a Missing Payload Server - is also introduced. Please see [I-D.whittle-ivip-fpr] for a detailed explanation. 1.1. Outline of the RUAS and Replicator systems The most important part of the FMS is comprised of thousands (perhaps tens of thousands in the long term future) of essentially identical "Replicator" servers. This can be viewed as taking a tree-like structure somewhat similar to a multicast set of routers, driving multiple such trees with the same data, and then cross-linking the branches at multiple levels so that the payload of a packet lost at one replication point will be replaced by an identical payload in another packet from another branch. Each Replicator receives at least two streams of identical mapping data, so it is much less likely to miss a payload than if it only received the payloads in a single stream of packets from a single source. A better way to view the system is that it floods each replication point from at least two directions from the previous level. The items being flooded are the payloads of DTLS packets - UDP packets whose contents are encrypted to prevent attacks involving spoofing such packets in order to propagate them in the Replicator system. At level 0, the Replicators flood each others and the larger number of level 1 Replicators. The level 1 Replicators flood the larger Whittle Expires July 23, 2010 [Page 4] Internet-Draft Ivip DB Fast Push January 2010 number of level 2 Replicators. Since the flooding unit is a packet payload, as soon as a particular payload is received, it is replicated, so the delay time in each point of replication will be very short indeed - probably only milliseconds. In this way, it is reasonable to expect a single payload, injected securely to the level 0 Replicators, to be fanned out to hundreds of thousands of end-points all over the Net, in a time not much longer than is imposed by the intervening routers and data links. If there are 5 levels of x10 "amplification" and each involves a 20msec delay (hopefully it would be less), then we can expect global delivery of the payloads to most fibre-linked locations within 300 to 350ms. (From Melbourne Australia, most of the distant sites in Russia or Africa have RTTs in the 300 to 450msec range, but I found a server apparently in Swaziland (cr1mba.swazi.net) with RTTs of 710 to 900ms.) In this way, each Replicator consumes two identical streams from geographically and topologically different sources, and fans the content of the streams out to some larger number of Replicators or QSDs at the next level. This number of output streams per Replicator may be in the tens to one hundred range, depending on the volume of updates. Initially, it would be quite high, when update rates are low - meaning that the initial global Replicator network could serve the growing number of QSDs with just three or so levels of Replicators, and with each one fanning out updates to a large number of Replicators at the next level. After some number of levels of replication, determined by local conditions, the streams deliver the update information at a QSD. Ideally, each QSD will receives two streams from two geographically dispersed Replicators. These need not be at the same level, so the system is relatively flexible, and each Replicator will generally be sending a complete streams of packets. There will also be a distributed system of Missing Packet Servers (MSPs) which receive streams from Replicators and store the payloads for ten or so minutes. MSPs will compare notes with each other via TCP (probably HTTP or HTTPS) and so form one or more worldwide distributed groups which will quickly replace any payload they missed. QSDs will query one or more MSPs - probably a close one and a distant one - and so be able to receive missing payloads on the hopefully rare occasions when one or more payloads are missing from the two or more streams each QSD normally receives from Replicators. RUASes asynchronously feed packets into the fully meshed ring of level 0 Replicators. Whittle Expires July 23, 2010 [Page 5] Internet-Draft Ivip DB Fast Push January 2010 Snapshots of segments of the mapping database are taken regularly by each RUAS. Each snapshot contains a complete copy of the mapping of one MAB (Mapped Address Block) at a particular instant. At that point in time, a hash function of the mapping data for this MAB is generated and within a few seconds is sent to all QSDs. This enables each QSD to verify its copy of the mapping for this QSD is fully up- to-date. During initialisation, and if an error is found in the local copy of the mapping for a particular MAB, the QSD downloads snapshots from HTTP servers provided by the RUAS companies. The QSD buffers all updates for the MAB which arrive after the snapshot and hash message. Once the snapshot is downloaded and unpacked into the QSDs copy of the mapping database, the buffered updates are applied and the database then contains an up-to-date copy of mapping for this MAB. Updates are then applied as they arrive from the two or more upstream Replicators. 1.2. Assumptions For the purposes of this discussion, it is assumed there will be a single global Ivip system, with multiple organisations being responsible for the management of the various blocks of address space which are managed with Ivip. The system itself is intended to be decentralised and have no single point of failure. Furthermore, it is intended to be highly suitable for being built, operated and expanded upon by a number of separate organisations, who cooperate much as do the organisations which run the DNS today. It would also be possible for an organisation to establish an Ivip- like system, without reference to any IETF RFCs, and to conduct a business renting out address space in small, flexible, chunks, with portability and multihoming via any ISP who provides the requisite, relatively simple, ETRs. The most likely scenario is this being done, with one or more independent Ivip-like systems operated by different companies, primarily for supporting TTR mobility [TTR Mobility], but also usable for portability, multihoming and inbound Traffic Engineering for non-mobile end-user networks. For simplicity, this ID assumes that Ivip development will be coordinated into a single global system, as DNS is, following appropriate IETF engineering work and administrative decisions in RIRs and other relevant organisations. A development timeframe of 2010 to ca. 2014 is assumed, with widespread deployment being achieved later in the decade, for IPv4 at least. The IPV4 FMS for is identical in principle to the IPv6. The server software which implements the Replicators will probably remain as two Whittle Expires July 23, 2010 [Page 6] Internet-Draft Ivip DB Fast Push January 2010 separate items, but a single server could run them both, independently, and so be both an IPv4 and IPv6 Replicator. Each RUAS would have both IPv4 and IPv6 sections, with separate outputs of mapping data. The level 0 Replicator servers for IPv4 would be physically different and independent of those for IPv6. In addition to the global fast push database update distribution system discussed in this ID and in [I-D.whittle-ivip-fpr], Ivip also involves Query Servers sending "notifications" to ITRs which recently requested mapping for a micronet whose mapping has just changed. This is a second form of push - on a local scale - and is outlined in [I-D.whittle-ivip-arch] . This ID concentrates on IPv4, since the future core-edge separation architecture is more urgently required for IPv4 than for IPv6. In principle, the same arrangements will apply for IPv6, with a different and more verbose data format than the 12 or so bytes required for each IPv4 mapping update. It may make sense to defer finalisation of any future IPv6 map-encap scheme until substantial operational experience was gained with the IPv4 scheme. 1.3. It may not be so daunting... Ivip documentation is written with a preference for detailed discussion over terseness. So Ivip IDs may appear rather daunting at first. Hopefully these IDs will be clearly understandable, and the reader will recognise that this scalable routing solution is a momentous development, requiring detailed consideration. Ivip goes beyond the formal RRG requirements of providing portability (the only way of allowing free choice of alternative ISPs) multihoming and inbound traffic engineering, by also providing with TTR mobility, a global mobility system for both IPv4 and IPv6. While no mapping changes are required unless the Mobile Node moves a large distance, such as 1000km or more, it is important that the Ivip FMS be able to scale to very large numbers of updates and cope with mapping databases for up to 10^10 micronets. This ID focuses on handling billions of micronets and potentially thousands of updates a second. These data-rates may sound high today, but domestic customers are already downloading full quality video in real-time. By the time such large levels of adoption arise, the bandwidth needed for these will not be a significant obstacle. However, it is difficult to imagine a situation where more than 10 billion mapping changes are needed each year, which is an average of 320 a second. There would be peaks, but with an IPv6 mapping change requiring about 32 bytes this is an average of 100kbps. Whittle Expires July 23, 2010 [Page 7] Internet-Draft Ivip DB Fast Push January 2010 During initial deployment, the demands on the fast push system will be far lighter than those anticipated below, so the system might initially be somewhat simpler. In the initial stages of introduction, there may be little need to deploy dedicated servers for the "Replicator" functions, since the volume of updates may be so light as to make it practical to run this software on existing servers, such as nameservers. Furthermore, in the early years of introduction, when there are hundreds of thousands or a few million micronets, the low level of update packets (compared to the highest imaginable levels contemplated below) should enable each Replicator to fan out to many more next-level Replicators than would be possible when hundreds of millions or billions of micronets are handled by the system. This would mean fewer levels of Replicators and fewer Replicators than would be possible with current technology if the system was handling billions of micronets. This ID explores how the FMS would be structured in the most demanding future scenarios which can be realistically expected. Building the initial FMS for trials and early services won't be as daunting as it may look from the diagrams and discussions below. Whittle Expires July 23, 2010 [Page 8] Internet-Draft Ivip DB Fast Push January 2010 2. Goals, Non-Goals and Challenges 2.1. Goals The overall goal of the fast push system is to enable end-users, who manage the mapping of their one or more micronets of address space, to securely, reliably and easily communicate their mapping change command to some organisation with which they have a business relationship, so that that change will be propagated to every QSD as soon as possible. "As soon as possible" means typical delay times of a few seconds, ideally zero seconds, but in practice probably two or so seconds. Prior to 2010-01-18, the Ivip IDs mentioned longer times than this, but this was on the basis of "Launch" servers executing a complex pipelined process which would take three or so seconds. This arrangement is now replaced by fully meshed level 0 Replicators, which have no complex protocols, pipelining or delays. "Reliably" means that in the great majority of cases, the QSDs receive every mapping change as expected and that in the relatively rare event of this being impossible due to packet loss, that the QSD can recover from this situation within one or at the most two seconds by requesting a copy of the packet from one or more Missing Payload Servers (MPSes - which were also introduced on 2010-01-18). Reliability also involves robustness against DoS attacks. This can never be completely protected against for any device on the open Internet, since its link(s) can easily be flooded by packets sent from botnets etc. As mentioned in [I-D.whittle-ivip-fpr] considerable protection from DoS attacks could be achieved by running the level 0 and level 1 Replicators via private network links. These levels would be owned and operated by the RUAS companies working together. This would enable reliable feeds to hundreds or perhaps a thousand or so level 2 Replicators all over the Net, which would mean that a DoS attack would not be able to cause so much trouble. "Securely" means that each QSD which receives the updates will be able to instantly verify that the updates are genuine, rather than the result of an attacker who might, for instance, send forged packets to that device or to some other part of the fast push system. The data format for the mapping update packets is for further work. There will be end-to-end encryption so that the QSD can authenticate the mapping data originated from the RUAS which sent it. Whether this involve authenticating each individual payload, or combining typically multiple packet payloads into a single body of data to be authenticated, remains to be decided. Sometimes, probably quite frequently, the RUAS will send only a single packet of updates, so Whittle Expires July 23, 2010 [Page 9] Internet-Draft Ivip DB Fast Push January 2010 then the entire payload would be authenticated, since there are no other payloads to consider. The data format needs to provide for open-ended extensions in the future and to support authentication. In the present design, DTLS RFC 4347 UDP packets are sent from RUASes to Replicators, from Replicators to other Replicators and from Replicators to QSDs and MPSes. This protects against an attacker spoofing a packet and having it Replicated or accepted by a QSD or MPS. However, it cannot be completely assured that a replicator was not under control of an attacker - which would enable them to send packets which would be replicated and accepted. The most common mapping change command, as sent by the end-user, or by some other organisation or device which has the end-user's credentials, would involve the length of the micronet being checked to ensure it is the same as the currently configured length of the micronet which starts at that location. The end-user's command might be part of an encrypted exchange involving a challenge-response protocol and the end-user's private key. Alternatively, an encrypted link could be used, such as via HTTPS, and a conventional username and password given as part of the command. The end-user would previously have communicated directly or indirectly with their RUAS to configure their total assigned address space into one or more micronets. This ID concentrates on the changes of ETR address for existing micronets, but the mapping change packets will also contain information about how existing micronets have been deleted and replaced by other micronets, smaller or larger and with different start and end-points. RUASes and the level 0 and level 1 Replicators are few in number and will be administered carefully, so this ID does not consider automated aids to their management and debugging. However, the rest of the Replicators, level 2 and greater, will be numerous and operated by a wide range of organisations. Future work will concern maximising the degree to which the Replicator system can be robustly and easily managed, rather than requiring a great deal of manual configuration etc. In order to debug the way the Ivip system is used, such as transient erroneous or malicious mapping updates which cause packets to be tunnelled to addresses where they are not welcome, there will need to be a system which monitors all mapping changes and keeps a lasting record of them. Then, aggrieved parties can search such a system for the address on which the received the unwanted packets, and so determine the micronet involved. This will enable the aggrieved party to complain to the RUAS which is responsible for that micronet. This "mapping history" function could be performed by one or multiple Whittle Expires July 23, 2010 [Page 10] Internet-Draft Ivip DB Fast Push January 2010 separate systems, each simply taking a feed from the Replicator system. 2.2. Non-goals Apart from checking the ETR address against any specific exclusion lists (such as specific prefixes, private RFC 1198 and multicast space) and to ensure it is not part of a Mapped Address Block (MAB - a BGP advertised prefix containing SPI space, divided into many micronets), the entire Ivip system takes no interest in whether there is a device at that address, whether the address is advertised in BGP, whether there is or was an ETR at that address, whether the ETR is reachable or whether the ETR can deliver packets to the micronet's destination device. These are all matters which fall under the responsibility of the end- user network whose micronet is being mapped to this ETR address. It is not a goal of the system to keep mapping changes secret from any party. This would be impossible. Therefore, it cannot be a goal of this or probably any core-edge elimination scheme that in a mobile setting, the movement of an individual's device could not be inferred by anyone who monitors the mapping updates. However, the mapping only concerns the currently active TTR. MNs can still use a TTR no- matter where they are physically connected, and using a TTR hundreds or even thousands of km distant will probably present no serious difficulties due to path-length or lost packets. So mapping changes need not indicate much, or anything, about the physical location of the MN. Replicators perform a best-effort copying of mapping update packets. They do not store the payloads of these packets for any appreciable time or attempt to request a payload which is missing from their two or more input streams. 2.3. Challenges Please refer to the Ivip Fast Payload Replication ID [I-D.whittle-ivip-fpr] for discussion of the most difficult challenges or the FMS. The present ID concentrates on the overall system, including the RUASes and UASes which connect to them. Here, the FPR system - Replicators and Missing Payload Servers - are regarded as a subsystem. Whittle Expires July 23, 2010 [Page 11] Internet-Draft Ivip DB Fast Push January 2010 3. Definition of Terms 3.1. SPI - Scalable PI space Once Ivip is operational, a growing subset of the global unicast addresses will be handled by ITRs tunnelling the packets to an ETR, which delivers the packets to the destination. This subset is used by end-user networks and provides portability, multihoming and inbound traffic engineering in a manner which is highly scalable - does not overly burden DFZ routers. SPI space is "mapped" by Ivip and this mapping system can divide it into smaller sections than is possible with BGP in the DFZ - a 256 IP address granularity for IPv4, due to a widely enforced convention on the lengths of routes which are accepted. The granularity with which Ivip maps SPI space - dividing it into micronets (described below) is single IP addresses for IPv4, and /64 prefixes for IPv6. 3.1.1. Conventional global unicast address space This is global unicast address space as it is used today. With Ivip, this will be a subset of the full unicast space - the part which is not used for SPI space. The LISP term for this is "RLOC" space. 3.2. MAB - Mapped Address Block A MAB is a BGP advertised prefix which is used as SPI space. DITRs (Default ITRs in the DFZ) all over the Net advertise this prefix, tunnelling the packets to ETRs according to the current mapping for the destination address of each packet. A MAB could, in principle, be as large as a /8. Larger MABs are preferred in general, because each one burdens the BGP system with only a single advertisement, but includes the SPI space of potentially hundreds of thousands of end-user networks. However, for reasons discussed below - including load sharing between ITRs and ease of initially loading snapshots of the mapping database - it may be best if MABs are more typically in the /12 to /17 range for IPv4. MABs do contribute to the load on the DFZ's BGP control plane, and involve one more route in the RIB and FIB of all DFZ routers. However, a MAB typically supports the address needs of thousands or tens of thousands of end-user networks. This ratio is how Ivip or any other successful core-edge separation architecture solves the routing scaling problem. Without such an architecture, each of these end-user networks would either require their own route (AKA "prefix") Whittle Expires July 23, 2010 [Page 12] Internet-Draft Ivip DB Fast Push January 2010 in the DFZ, or not be able to obtain address space which was portable and suitable for multihoming and inbound TE. 3.3. UAB - User Address Block Each MAB typically contains address space which has been assigned by some means to many (perhaps tens of thousands) separate end-users. A UAB is a contiguous range of addresses within a MAB which is assigned to one end-user. UABs are important divisions for the RUAS company, but UABs are not specifically mentioned or needed in the mapping update packets handled by Replicators. Nor are UABs relevant to the operation of QSDs, QSCs (caching query servers), ITRs or ETRs. A MAB could be assigned entirely to one end-user - as might be the case if the end-user converted a prefix of theirs which was previously conventional PI space to be managed as SPI space by the Ivip system. Generally speaking, MABs are ideally large (short prefixes) and each contains space for multiple end-users. Generally, MABs are owned or at least administered by MAB companies, who rent SPI space to end-user networks. Each MAB must have its mapping handled by a single RUAS. The company which operates the MAB may have its own RUAS. If not, it will contract the services of an RUAS to handle mapping distribution for this MAB. Ivip is intended to support dozens of RUASes, perhaps a hundred or so - though if there was a need, more than this could be accommodated. An end-user might have multiple UABs in a MAB, UABs in multiple MABs from the same company or UABs in MABs from multiple MAB companies. For simplicity, this ID assumed each end-user has a has a single UAB. UABs are specified by starting address and length, in units as mentioned above: IPv4 addresses or IPv6 /64s. A MAB's boundaries are always on power-of-two boundaries of these units, since it is a prefix advertised in the DFZ. UABs and micronets have arbitrary starting points and lengths - they are not at all constrained by binary "prefix" boundaries. 3.4. Micronet Following Bill Herrin's suggestion, the term "micronet" refers to a range of SPI space for which all addresses have the same mapping. In LISP, these are known as EID prefixes. In Ivip, a micronet need not be on binary boundaries - it is specified by a starting address and a length, in units of single IPv4 addresses or IPv6 /64 prefixes. An end-user could use their entire UAB as a single micronet, or they could split it into as many micronets as they wish, and change these divisions dynamically. Whittle Expires July 23, 2010 [Page 13] Internet-Draft Ivip DB Fast Push January 2010 Any micronet which is mapped to zero (its ETR address is 0.0.0.0 in IPv4) will cause ITRs to drop any packets addressed to this micronet. A micronet can be defined within the whole or part of a contiguous range of address space which is currently mapped to zero, by the FMS carrying an update message specifying the new micronet's starting address, its length, and a non-zero address for its mapping. (Future work: decide exactly what instructions are needed and which sequences of operations are allowable for making new micronets in place of existing ones.) 3.5. RUAS - Root Update Authorisation System Multiple RUASes collectively generate the total stream of mapping update messages. Each RUAS is responsible for one or more MABs. There may be a dozen to a hundred or so RUASes. Greater numbers of RUAS companies is good for competition and innovation. Prior to 2010-01-18 it looked technically difficult to have more than a dozen or so RUASes. With the simplified layer 0 Replicator arrangement, there can be as many RUASes as each (or most) layer 0 Replicators have DTLS sessions with. So in principle, if there was a need for several thousand RUASes, I am sure the Replicator software could be made to handle this number of DTLS sessions. Each RUAS receives mapping updates either directly from end-user networks (or their appointed Multihoming Mapping companies) - or indirectly via intermediate organisations, each of which runs a UAS. 3.6. UAS - Update Authorisation System A UAS is the system of an organisation which accepts mapping change commands from end-users, and conveys them directly - or perhaps indirectly via another UAS - to the RUAS which handles the relevant MAB. An RUAS which accepts mapping update commands from end-users does so via its own UAS system. A UAS accepts upstream input from end-users and/or other UASes. It generates output to downstream RUASes and/or other UASes. One UAS may have relationships with multiple RUASes. A MAB may be assigned to an RUAS and control of parts of this may be delegated to multiple UASes. A single UAS may work only with a single RUAS, or with multiple and perhaps all RUASes. Whether the MAB itself is administratively assigned (by an RIR, or some national Internet Registry) to the UAS or to the RUAS is not important in a technical sense. End-users will choose address space according to the RUAS (and any UASes) it depends upon with care, because the reliability of this MAB's address space will forever be dependent on these organisations. Whittle Expires July 23, 2010 [Page 14] Internet-Draft Ivip DB Fast Push January 2010 If the MAB is not operated by an RUAS company, then the company or organisation which operates it can choose any RUAS to handle its mapping. In this case, while an end-user network may choose to rent its SPI space from this particular MAB operating company, in part based on the reputation of the RUAS company currently chosen by the MAB operating company, the operating company could at any time select another RUAS company. If it did so, it would presumably arrange for whatever UAS system its SPI-renting customers used to work with the new RUAS. Assuming this is the case, then the end-user networks would not perceive any change, or alter however they control their mapping. The number of RUASes will probably be limited to some degree, such as dozens or a hundred or so, enable them to efficiently and reliably work together with their jointly operated system of level 0 and 1 Replicators to create a single stream of updates for the entire Ivip system. The ability of companies with UASes to act as agents for RUAS companies and/or to have their own MABs which they contract a RUAS to handle the mapping for, will enable a large number of organisations to compete in the rental of SPI space. 3.7. UMUC - User Mapping Update Command (I apologise for the muddy sounding acronym. Finding short, unused, meaningful, pronounceable acronyms which have not already acquired meanings in the IETF is quite a challenge!) A UMUC is whatever action the end-user performs on one or more different user-interfaces of whatever UAS they use to change the mapping of their one or more micronets. The system would also be able to tell the user the current mapping and also confirm that a requested change to the mapping was acceptable. In other words, the system lets end-user networks (and/or whichever Multihoming Monitoring company they contract to control the mapping of their micronets) to "see" (server-to-human and server-to-server) how their UAB is broken into micronets and what ETR addresses those micronets are mapped to. The UAS system could also provide diagnostics such as testing the reachability of their network via one or more ETR addresses. The system would also enable trialling mapping changes and altered micronet boundaries without actually executing the changes - so the end-user network operators can manually test their proposed changes are valid, before actually making them. QSDs will only accept certain kinds of updates, and it is vital that the mapping updates are applied in the order they are sent - and that these updates are in themselves valid. For instance, it will Whittle Expires July 23, 2010 [Page 15] Internet-Draft Ivip DB Fast Push January 2010 probably be mandatory for micronets to be mapped to an ETR address of 0.0.0.0 before being split or joined. This rule will probably apply firstly to mapping updates arriving in QSDs and being applied to update the local copy of a MAB's mapping database, but also to mapping updates sent by QSDs to any querier which previously received mapping for a micronet whose mapping has just been changed. The querier could be a QSD or an ITR. It will be important for the UAS to ensure the update commands it sends to the RUAS are valid according to these constraints. In addition to testing proposed changes for validity, the UAS system should be able to combine multiple updates into a single set, to be executed in order, but at the same time. The complete set would be sent on the FMS as part of a single message. Ideally the message would be in a single payload of a packet, but if not, then the data format will recognise a complete set of updates are spread over two or more payloads, and ensure the complete message is ready before executing it. For instance, mapping an 8-long micronet's ETR address to zero, and splitting it into three smaller micronets and then setting the ETR address of each. This would involve 17 commands. When testing proposed changes, or deciding whether to accept changes which have been ordered with the end-user network's credentials, the UAS system would generate an error if the mapping was to a disallowed address - multicast, SPI space, private address space or to some other prefixes to which the Ivip system does not support the tunnelling of packets. Similarly, and error would be generated if the end-user attempted to change the mapping for some address space outside their UAB, or if they defined a new micronet within that space with non-zero mapping, or which overlapped some addresses for which the mapping was currently non-zero. For the sake of discussion, it will be assumed that all UMUCs have passed these validity tests at the UAS and are for valid mapping addresses - so a UMUC is a successfully accepted update command from the end-user, or some person or system or with the end-user's credentials. There could be many methods by which this command is communicated to the UAS, including HTTPS web forms with username and password authentication. SSL sessions might be more suitable for automated mapping change systems, such as those of a Multihoming Monitoring company which the end-user authorises to control the mapping of some or all of their UAB. In addition to authentication, the command takes the form of the starting address of the micronet, the length of the micronet, and a single ETR IP address to which this micronet will have its mapping Whittle Expires July 23, 2010 [Page 16] Internet-Draft Ivip DB Fast Push January 2010 changed to. 3.8. SUMUC - Signed User Mapping Update Command This is the information contained in a UMUC, signed by the UAS which accepted it from the user (or by some other UAS), being handed down the tree to another UAS or to the RUAS of the tree, so that the recipient UAS/RUAS can verify the signature and regard the UMUC as authoritative. 3.9. MABUS - Update Stream specific to one MAB This is a stream of data by which the real-time updates to the mapping data for any one MAB are conveyed. For the purposes of discussion, the RUASes and the Launch system are assumed to work in a synchronized fashion, generating a body of updates for each MAB which are gathered together in some way over a short period of time. Prior to 2010-01-18, I assumed the whole FMS would operate on one-second cycles. Now, the core of the FMS - the Replicator system - is asynchronous and the best thing would be for RUASes to sent packets along it in a reasonably even manner, but coordinated so as not to exceed some agreed total maximum data rate in any period such as 0.1 seconds. Mapping changes are typically not urgent to the point of not being able to wait a second or so. So it would make sense for an RUAS to bundle multiple updates for one MAB together, before sending them to the FMS, either alone in a packet payload, or together with updates for other MABs. For the purposes of discussion, we can imagine each RUAS buffering changes for any one MAB for up to a second in order to collect them together. Of course, for some MABs, hours or even days may pass without a mapping change. This discussion is intended to explore the more demanding scenarios. Each RUAS will generate one MABUS for each of its MABs. So each second or so, the RUASes collectively generate a variable length body of update information for every MAB in the Ivip system. Some or many of these may contain no updates. The MABUS includes mapping changes (altering ETR addresses of existing micronets), changes to micronet boundaries and snapshot messages (described above). The data format would be extensible for purposes not yet anticipated. 3.10. Level 0 Replicators A small (such as 8) number of widely dispersed Replicators which receive packets from all the RUASes on a continual basis, and where Whittle Expires July 23, 2010 [Page 17] Internet-Draft Ivip DB Fast Push January 2010 each one also sends a stream of whatever it received to each other one. This is a "fully meshed" set of Replicators. These are the only ones to receive packets from RUASes and the only ones to drive Replicators in the other levels. 3.11. Level 1 and greater Replicators A cross-linked, tree-like, system of Replicators form a redundant, reliable, high-speed distribution system for delivering mapping updates to full database ITRs and Query Servers all over the Net. Each Replicator receives one or more (typically two) streams of update packets from an upstream Replicator or Launch server. These two source streams should come from widely topologically separated sources, ideally over two separate physical links. For instance a Replicator in Berlin might receive its update streams from London and Berlin, two sources in Berlin which are in different ISP networks, or in any combination which minimises the likelihood that both sources will be disrupted by any one fault. The Replicator identifies the DTLS payloads of each packet by the "Fresh / Repeat" algorithm, which is described in: [I-D.whittle-ivip-fpr]. The first time a packet with a particular payload arrives at a Replicator, it is detected as being "Fresh" and then the payload is replicated as DTLS packets to all the downstream devices, which can be Replicators, QSDs or MPSes. When another packet with the same payload arrives later, as it probably will from the other input stream, the second one is recognised as a "Repeat" and no further action is taken with it. At present I am assuming each Replicator will receive typically two streams and send typically 20 streams. However, it may be possible to have many more output streams, such as 50 or 100. Replicators could be implemented in routers, but are probably best implemented in ordinary software on a GNU-Linux/BSD etc. COTS (Commercial Off The Shelf) server. Replicators do not cache information and need no hard drive storage. A server performing as a QSD could also operate as a Replicator. 3.12. QSD - Query Server with full Database QSDs get a full feed of updates from one or more Replicators. When they boot, they download individual snapshot files for each MAB in the Ivip system. QSDs respond immediately to queries from nearby ITRs and from caching Query Servers (QSCs) - and send notifications to these if mapping Whittle Expires July 23, 2010 [Page 18] Internet-Draft Ivip DB Fast Push January 2010 data changes for a micronet which was the subject of a recent query. QSDs have no routing or traffic handling functions. In a full-scale billion-plus micronet deployment they need a lot of memory, so the best way to implement a QSD is probably on an ordinary server with one or more gigabit Ethernet interfaces. No hard drive is required, except perhaps for logging purposes. 3.13. QSC - Query Server with Cache A QSC could be implemented in a router or more likely a COTS server. It does not route packets, and its memory and computational requirements will be modest compared to those of a QSD. There is no need for a full feed of updates from the Replicator system. However, each QSD must be able to get mapping information from one or more upstream QSDs - or via upstream QSCs which themselves access upstream QSDs. The easiest way to implement a QSC would be software on a modest server, which would only need a hard drive for logging purposes. Whittle Expires July 23, 2010 [Page 19] Internet-Draft Ivip DB Fast Push January 2010 4. Update Authorities and User Interfaces This section is a detailed discussion of the fast push mapping distribution system itself, starting with the systems which accept commands from end-users (or their authorised representatives or systems) and prepare the information to be fanned out worldwide via the level 0 Replicators. This is the early stage of an ambitious design, so a number of options are contemplated. This section of the system may not need IETF standardised protocols, since only a small number of organisations need to interact to make it work. The Replicators and the data format of mapping updates do need to be standardized. The purpose of exploring the RUAS and Launch server systems is to estimate the difficulty of constructing them - and hopefully to show that an approach like this is feasible and desirable. There may well be easier approaches than the ones explored here. Probably the closest thing to them would be the large scale systems for managing DNS, such as for .com and other major TLDs. I don't know anything about these and people with experience in such systems could probably design the UAS, RUAS and perhaps Launch server systems better than I could. The real-time nature of these systems of controlling ITR behavior has no precedent. Generally, the system should work on a continual basis. However, if there is a technical problem or the system is stopped for a few minutes to do an upgrade or whatever, the Internet is not going to grind to a halt. In that downtime, end-user networks which experience a multihoming failure will have to wait for their connectivity to be restored. Likewise, end-user networks which send mapping changes for inbound TE will have to wait. The effect on TTR mobility would be minor, since mapping changes are not required when the MN changes its physical connections, including when moving to an entirely different access network. The delay in mapping changes means that those few MNs which have chosen a new, closer, TTR will need to wait for traffic to be tunneled to that new TTR - meaning they will need to keep up the tunnel to the old, and now more distant, TTR for these minutes. Normally, with mapping changes getting to ITRs in a few seconds, the MN could terminate the tunnel to the old TTR within a few seconds of the ITRs beginning their tunneling to the new TTR. The final authority to control mapping information is fully devolved to end-users, who by means of a username and password or some other authentication method, are able to issue commands to define micronets within their UAS, and to map each micronet to any ETR address. Whittle Expires July 23, 2010 [Page 20] Internet-Draft Ivip DB Fast Push January 2010 However the physical authority to control the mapping of all Mapped space within a single MAB rests with a single RUAS. That RUAS may be acting for a UAS who is administers a MAB. The RUAS may administer it - perhaps on behalf of another company - and may delegate control of parts of it to one or more UASes. The RUAS may have relationships directly to the end-users of this MAB, through its own UAS. Here we discuss the flow of information and trust between these various entities, in real-time, so that every second or so each RUAS assembles a body of update information for each of its MABs. In the diagrams below, each RUAS or UAS is depicted as a single entity. Each such entity acts as a single functional block, but would typically be implemented as a redundant system over several servers. 4.1. RUAS Outputs 4.1.1. Update packets to level 0 Replicators Each RUAS is largely autonomous in when it generates packets to be sent to level 0 Replicators. Ideally it would spread its packets out smoothly in time. Ideally it would send fewer, larger, packets than more numerous small ones. In future work I intend to describe a means by which the RUASes collectively manage the data capacity of the FMS. One aspect of this is usage fees of some kind. Since the FMS is a shared resource, which burdens Replicators, QSDs and MPSes all over the world according to the packets it carries, there needs to be an arrangement whereby RUASes don't send packets for no good reason. Since RUASes will be charging end-user networks, directly or indirectly, for each mapping change, there will probably be some kind of traffic-based usage fees or settlement system amongst the RUASes which collectively run the first two or more levels of the Replicator system. Exactly how this will be done commercially does not need to be defined. What matters is that the technical elements can feasibly be used in a way which supports a shared, cooperative, effort to run the system reliably and in a way that no RUAS places unreasonable burdens on other parties. There would probably need to be some kind of agreement, consortium or the like for governing the FMS. The design presented here is to show that such a system could work well, not depend on any one RUAS or device, and that it could support a large enough number of RUAS companies, with RUAS systems and the level 0 Replicators, physically dispersed in many countries. Another aspect is the moment-to-moment management of the total volume of packets sent. This would be partly a question of the number of Whittle Expires July 23, 2010 [Page 21] Internet-Draft Ivip DB Fast Push January 2010 packets and mainly a question of their total length - in bits per second over some short time period such as 0.1 seconds or so. While data rates would grow over the years, at any one point in time, the whole FMS system would have some kind of specification for the peak data rate of the packets it carries. If this was 100kbps, then each Replicator which accepts two input streams would need to ensure its data links from the two upstream replicators could, in general, handle this data rate with minimal chance of packet loss. The operators of Replicators, QSDs and MPSes need some guidance on peak bandwidth, and the only way to ensure the level 0 Replicators do not send out greater than this bandwidth is some kind of real-time demand balancing arrangement between the RUASes. RUASes will probably have widely varying needs to send updates, and these may change with time of day, due to a flurry of multihoming mapping changes resulting from a network outage or for any other reason. At each point in time, each RUAS needs a "quota" - a quantity of data, in bytes, which is the limit of the total packets it is allowed to send in the next time period, which may be 0.1, 0.2 or some other fraction of a second. If the RUAS needs to send more packets than this, it should buffer the data, request a higher quota, and only send the packets if and when it has received a higher quota. Since the quota represents the right to use this shared resource, and the sending of packets involves the actual use of this right, it is likely that some kind of market forces will govern how the capacity of the system is divided, moment-to-moment. There could be many ways of arranging this, and it doesn't need to be standardised by the IETF. The RUAS companies will need to work together, choose who to accept as new RUAS companies, decide how to share the burdens of any common infrastructure etc. 4.1.2. MAB snapshots Every few minutes (or some other time period, as chosen by the RUAS, but with some reasonable maximum defined by a BCP) the RUAS makes a copy of the complete mapping information for a MAB. Snapshots for each MAB are independent of each other, and so can be done with different frequencies. The snapshot is in a format which needs to be standardized, so it can be downloaded and understood by any QSD, now and in the future. This data format needs to be extensible to cover new kinds of mapping information and other functions not yet anticipated - which will be ignored by devices which are not capable of these functions. Whittle Expires July 23, 2010 [Page 22] Internet-Draft Ivip DB Fast Push January 2010 The exact format for this is for future work, but for instance would begin with some identifying information about the MAB, a block defining that the following data concerns IPv4 micronet mapping information (and snapshot announcements), with the possibility of other blocks containing different kinds of data. Binary format would probably be best, and the file could then be compressed with gzip etc. Each such file will be given a distinctive name, according to a standardised format, which indicates at least the MAB starting address and length, and the time of the snapshot. The snapshot process will take a second or two to complete from the time it is initiated, and the resulting file will be copied to a number of servers, ideally located in a variety of locations around the Net. Each such server would be run by the RUAS directly, or as part of all RUASes working together. The servers can probably be conventional HTTP servers, so that QSDs can download the snapshots when needed. There is scope for some careful design with DNS so that there is an automatic structure in the domain names of these servers, enabling an expandable system to be automatically used by QSDs without manual configuration. These files will be publicly available, and need to be made available for somewhat longer than the cycle time of snapshots. So with a ten minute snapshot cycle, the previous snapshot should be available for a while - probably 10 minutes or so - after the new one is available. Snapshots are downloaded by QSDs when they boot, and if they suffer a disruption in mapping updates which necessitates a reload of this part of the complete mapping database. To facilitate this, MABs should not be too large in terms of IPv4 addresses or IPv6 /64s - or at least should not contain too many micronets - which would make individual snapshot files excessively large. At boot time, or when re-synching, the QSD will monitor the update streams for each MAB until a snapshot announcement is found. It will then buffer all subsequent updates and download the snapshot as soon as it is available. Once the snapshot has arrived, and been unpacked to RAM, the buffered updates are applied to it. Then, this MAB's part of the mapping database is up-to-date and the QSD can being using it to answer queries. (During the re-synching operation, the QSD will need to tell a querier it can't answer the query, or may buffer the query and send the same query to another QSD, passing on the response when it arrives. Whittle Expires July 23, 2010 [Page 23] Internet-Draft Ivip DB Fast Push January 2010 In order to reduce total path lengths for these file downloads, and likewise for retrieving missing packets from the same servers, it would be desirable if each QSD in a given location could access a nearby snapshot server. It may be desirable to have every snapshot of every MAB in a single server, or a single set of servers which are accessed by geographically close QSDs. Anycast is not a good technology for this, since file retrieval is best done via TCP sessions. The servers need to be on conventional addresses, rather than SPI addresses, so the QSDs can access them without needing to use ITRs which themselves depend on mapping. Likewise, any DNS servers involved in this server system need to be strictly on conventional addresses. Each QSD needs to be configured with, or to automatically discover, two or more such servers - at least one of which is relatively close - so the data can be found despite one server being down. From the point of view of the QSC, seeking an update for a given MAB of a particular RUAS, the address to request the file from could be made up from the RUAS identifier yyyy which is contained in the snapshot announcement (in the stream of mapping updates), concatenated with a locally configured "xxxxx" and "ipv4.ivipservers.net". In the event that this server was unavailable one or more locally configured alternatives to this initial "xxxxx" value could be tried - including one or more for nearby countries. The most significant 24 bits of the MAB's starting address (probably 48 bits for IPv6, assuming this is the granularity of BGP advertisements) for would be transformed into a text string such as 150.101.072. A similar transformation of the precise time of the snapshot would result in a second text string, and these would be used to reliably identify the appropriate directory and file in the server. 4.1.3. Missing Payload Servers (MPSes) Until 2010-01-18 I planned QSDs to download the payloads of any packets they missing from one of several HTTP servers, as described above for snapshot files - where those servers would be run by each RUAS. This may be possible and desirable, but please see [I-D.whittle-ivip-fpr] for a description of a distributed arrangement of Missing Payload Servers which QSDs could access to obtain any payloads which did not arrive via their typically two input streams from level ~4 Replicators. ISPs and larger end-user networks would run these MPSes and they would be linked by HTTP or HTTPS so each could query the other, Whittle Expires July 23, 2010 [Page 24] Internet-Draft Ivip DB Fast Push January 2010 obtaining payloads each one was missing. These TCP-based links are not subject to any PMTU constraints, since the payloads of any length can be sent via HTTP or some other query-response protocol. QSDs would query one or more MPSes as needed, with persistent or temporary HTTP or HTTPS sessions. To the extent that missing packets result from local outages, is it more likely that a topologically distant MPS will have the payloads a local MPS or QSD is most likely to want. So HTPP or HTTPS links across oceans and continents would naturally be used by ISPs which wanted to run MPSes - for mutual benefit. 4.2. Authentication of RUAS-generated data Careful consideration must be given to how QSDs can quickly and reliably ensure that the information they receive ostensibly from each RUAS is genuine. The DTLS links between Replicators and to QSDs will prevent an attacker injecting bogus payloads into the FMS. But there's no way a QSD could be entirely sure that all its upstream Replicators, which could be quite numerous (2 above, 2 above each of them, 2 above each of them etc.) are not under the control of an attacker. Being able to direct traffic to an attacker's site, by means of altering the mapping information in an ITR, is such a threat to security, and such an attractive proposition for attackers, that some kind of digital signing of the mapping update information will be required. 4.2.1. Snapshot and missing packet files Each RUAS has a key pair and signs the MAB snapshot and missing packet files with its private key. QSDs can verify the signature with the RUAS's public key, subject to a PKI arrangement of certificates, or some other simpler arrangements. Both these types of files are only handled occasionally, so the overhead in performing crypto operations is insignificant. 4.2.2. Mapping updates This principle does not apply to the update information contained in packets received from the Replicator system. The system needs to be highly secure against attack, because even a second or two of an ITR mapping packets to the attacker's site constitutes an unacceptable breach. Sometimes, possibly frequently, the RUAS will send a single packet, Whittle Expires July 23, 2010 [Page 25] Internet-Draft Ivip DB Fast Push January 2010 and the QSD needs to be able to authenticate this information independent of any which follows a second or two later, because it needs to use the information immediately to update its local copy of the mapping database. So there will frequently be need to authenticate individual packets. There are multiple ways of solving this problem. I doubt anyone would argue that it is so difficult as to warrant the abandonment of the entire fast-push, local query server concept. With more work later, I believe a satisfactory method can be found of the QSD ensuring the updates are authentic before applying them. 4.3. RUAS - UAS interconnection This section depicts a single tree of delegated responsibility for the user control of mapping of one MAB. The Root UAS at the base of the tree is run by Company X - RUAS-X. RUAS-X could be authoritative for other MABs, and each such tree of delegation may have the same set of other UAS systems, or it could be different. Each delegation tree is separate from the delegation trees of other MABs, even if they look similar, because the tree includes specific subsets of the whole MAB address range as one of the defining characteristics of its branches and leaves. The initial action which leads to the database being changed is a user generated (manually or by the user's equipment or by a system authorised by the user) UMUC (User Mapping Update Command). For authorising and feeding UMUCs to the RUAS-X, there is a tree as depicted in Figure 1. Delegation of authority flows up the tree as the total address range of the MAB is split at each branching junction. This tree structure involves data, in the form of SUMUCs (Signed User Mapping Updated Commands) flowing down towards the root of the tree. (Data would also flow up the tree so each user- interface leaf could tell end-users what their current mapping was, could test their requests against constraints etc.) The idea is that RUAS-X could delegate control of one or more subsets of the MAB's total range of addresses to some other system, which in turn could delegate control to other systems. There would be no absolute limit on the height (usually called depth) of these hierarchies. The RUAS maintains the master database, for each of its MABs, of what the mapping, division into micronets etc. actually *is*. This information is used to inform UASes of the current state, which they can convey to end-users and use to check the validity of requests from these end-users. This information is also used to generate snapshot files. As the mapping in the master database is changed, this gives rise to actual changes which must be assembled into Whittle Expires July 23, 2010 [Page 26] Internet-Draft Ivip DB Fast Push January 2010 MABUSes to be sent to the level 0 Replicators in the near future. The servers which handle the end-user interaction needs to be one of the leaves of this tree structure, so as not to burden the RUAS-X database servers themselves with details of user interaction. This enables various companies to give different kinds of control for the mapping of the SPI space their branch of the tree controls. Figure 1 does not show RUAS-X having any user interface servers, but it could. The simplest arrangement would be the RUAS having simply a user- interface server and no tree of other UASes. There would need to be IETF standardised methods by which some server could execute a UMUC with the user-interface servers of any of these UASes. This standardisation would be especially important for multihoming, because some reasonably trusted company could run an automated monitoring system, and have the credentials (username, password, key etc.) stored in their system so their system can change the mapping of one or more micronets the moment one link was detected to be faulty. It is vital that there be a standardised method by which all multihoming monitoring companies could send these mapping change commands (and queries about the current state of mapping) to UASes. Also, the company (such as X, Y or Z in Figure 1) which controls a particular range of the Mapped space may offer such a multihoming monitoring system itself. The tree in this example controls an MAB with the address range 20.0.0.0 to 20.3.255.255. In this example, company X has been assigned by an RIR the entire range 20.0.0.0 to 20.3.255.255. Company X leases to Y a quarter of this: 20.1.0.0 to 20.1.255.255. These divisions are on binary boundaries, but they need not be. It would be just as possible for X to delegate to Y an arbitrary subset of the whole range, or the entire range - or just one IPv4 address or IPv6 /64. X's Root Update Authorisation Server (RUAS) has a private key for signing all the MAB snapshot files it periodically creates and makes available. The same key would be used for signing the mapping change information for each MAB which are sent to the level 0 Replicators and so to all QSDs. In this example, company Y delegates control of some of its space to company Z, and Z has an end-user U, who needs to control the mapping of a UAB containing one or more micronets in Z's range. Whittle Expires July 23, 2010 [Page 27] Internet-Draft Ivip DB Fast Push January 2010 User-R User-S User-T User-U Multihoming \ \ | | Monitoring \ \ | | Inc. \ ................. / \----. Web interface .---/ . other protocols . . etc. . ....UAS-Z........ | Other companies | like Y and Z | /-----<----/ | | \ | / | | \|/ | | UAS-Y \ | | \ | /----<-----/ \ | / \|/ RUAS-X Root Update Authorisation Server company X | \ | \ V \->-[ Multiple web servers for MAB snapshot ] | | Other RUASes like RUAS-X, each authoritative | for mapping one or more MABs and producing | regular MAB snapshots and update streams to | which are sent to all level 0 Replicators. \ \ | | | / \ | | | / \ | | | / \ | | | / \ | | | / \ | | | | | | | | | V V V V V | | | | | Each line depicts 8 streams of packets with identical payloads - one stream for each of the 8 level 0 Replicators. Figure 1: Delegation tree of UASes above one RUAS. Multiple RUASes all driving their mapping updates to every level 0 Replicator. These fan the packets out to hundreds of thousands of QSDs all over the world, in a second or so. Whittle Expires July 23, 2010 [Page 28] Internet-Draft Ivip DB Fast Push January 2010 Z has various interfaces by which U can do this, with its own arrangements for authentication, for monitoring a multihoming system and making changes automatically etc. Ideally there might be one or more automated, host-to-server, IETF-standardised protocols so all end users and their appointed multihoming monitoring companies could have standardised software for talking to whichever company's servers they use to control the mapping of their IP address(es). When user-U (or a device or system with user-U's credentials) changes the mapping of their micronet via a web interface this is achieved via Z's website, authenticating him-, her- or it-self, by whatever means Z requires. This causes UAS-Z to generate a signed copy of this update command (a SUMUC) and to send it to UAS-Y. This may include multiple commands to be executed in order. The simplest SUMUC would be a change to the ETR address of an existing micronet. This would consist of three items (assuming IPv4 for simplicity): A starting address for which micronet this update covers, the number of IP addresses covered by the micronet to be changed (>=1) (or alternatively the last address of the micronet), and a new mapping value - a 32 bit ETR address. The SUMUC could also consist of a time in the future the update should be executed. In that case, it would be stored by RUAS-X and sent to the FMS at the appointed time. Mapping change commands would also include commands to join and split micronets. Sequences of these commands would be sent, in order - and the UAS should check their validity before putting them into a SUMUC. So a SUMUC consists of one or multiple mapping change commands concerning a particular micronet, or perhaps a set of micronets. The commands will be executed in order, but as if at once. If the SUMUC consists simply of changing a micronet's ETR address, including zeroing it, then this will be applied by every QSD and updates sent to any ITRs which need it. Multiple such changes all together in the one SUMUC would cause the same effects, for multiple micronets. However, if the changes involved a sequence of changes affecting the same SPI addresses, the QSD will update its queriers, which could be ITRs or QSCs, to the final state of the mapping after the changes. For instance a sequence of changes could zero two micronets (set their ETR address to 0.0.0.0) and then join them into one micronet. The resulting micronet could then be split into five micronets and each one mapped to a different ETR address. The QSD may have a querier which is caching the mapping for the first original micronet, but not the other. It will send that querier updates which define the new mapping arrangements for exactly that range of SPI addresses Whittle Expires July 23, 2010 [Page 29] Internet-Draft Ivip DB Fast Push January 2010 which the original response covered. This avoids the ITR (or the QSC, if that is the querier) having to be told about a larger amount of SPI space than it was told about in the initial reply. As noted previously, the caching time for these newly defined micronets, each of which will now be in the cache of the ITR or QSC, will be flushed from the cache at the same time as the originally cached micronet would have been. UAS-Y trusts this SUMUC because it can authenticate UAS-Z's signature. It strips off the signature and adds its own, before passing the SUMUC down to the next level: RUAS-X. RUAS-X likewise has a copy of UAS-Y's public key and within a fraction of a second of U initiating the UMUC, the master copy of this MAB's database, in RUAS-X is altered accordingly. (This would be a distributed, redundant, database system.) Authority is delegated up the tree, because UAS-Y will only accept update commands if they are signed by one of its branch UASes, and for the particular address range that UAS has been authorised to control. User-U may have given their username and password etc. to Multihoming Monitoring Inc. so this company can monitor their multihoming links and change the mapping as soon as one link goes down. UAS-Z doesn't know or care who actually makes the change - as long as they can authenticate themselves for whatever micronet they want to change the mapping of. UAS-Z would keep an audit trail of all interactions such as with User-U or Multihoming Monitoring Inc. Whittle Expires July 23, 2010 [Page 30] Internet-Draft Ivip DB Fast Push January 2010 5. Common information to be sent by the FMS In future work I will consider what common information all QSDs need, such as to reliably gain the basic information about the current state of Ivip-mapped SPI space. The most important things are the identities of the RUASes, how each RUAS is represented in the 7 bit (for instance) "ruas" field in the FPR header of each packet in the FMS, and the exact details of each current MAB. This will include which RUAS is responsible for the mapping of which MAB. One way of doing this is for QSDs to download it periodically via HTTPS from one or several servers which are somehow trusted and operated by either a consortium of the RUAS companies, or by individual RUASes. Another way, would be for information such as this to be periodically sent on the FMS itself. Probably the best way is the downloaded file approach, with a regular schedule by which each day, QSDs would download the latest information. MABs could be added to the Ivip system on a day-by-day basis. There's no need to expect QSDs to set up another MAB mapping database on the basis of a command to this effect which arrives on the FMS itself. Some kind of distributed and secure rsync arrangement is probably a good method of doing this. Whittle Expires July 23, 2010 [Page 31] Internet-Draft Ivip DB Fast Push January 2010 6. The Fast Payload Replication system Please refer to [I-D.whittle-ivip-fpr] for all details of this, the most critical, global, part of Ivip's FMS. Whittle Expires July 23, 2010 [Page 32] Internet-Draft Ivip DB Fast Push January 2010 7. Scaling limits The Replicator system is scalable to any size simply by adding Replicators. Assuming two input streams for each Replicator, N output streams gives an N/2 amplification of stream numbers per level. N could be quite high in the early years of introduction, when the number of micronets and updates is small by comparison with the design target of one to ten billion micronets, with accompanying update rates driven by their use for inbound TE for multihomed non- mobile end-user networks and by mobile devices selecting new TTRs. First, a maximal IPv4 example will be considered. Assume a billion micronets, most of them for single IP addresses. Presumably most of these will be for individual end-users, at home or with mobile devices. The update rate will be relatively low for multihoming the home and office-based micronets. The update rate due to inbound TE is impossible to predict. Being able to steer traffic dynamically to maximise utilization of multiple links is economically highly attractive. Market mechanisms will tend to set prices for updates which balance competing concerns. If the price is too low, there will be more of them and the FPR system will need to be improved to cope with them - so the price would rise to either reduce the number, or pay for the upgrades. It is possible that the RUASes could collectively set prices low enough to make a profit running their operation and many of the Replicators - levels 0 and 1 at least, and perhaps level 2 or 3 too - with a very high volume of TE updates. TE updates are the class of updates with the most elastic demand. Multihoming updates are needed urgently when they are needed, but most of the time, for any one end- user network, none are needed. TTR mobility updates are probably somewhat elastic. If it is expensive to choose a nearby TTR, then people will make do with a distant one for longer, or indefinitely. There is a potentially large market for TE changes, because if an end-user network made lots of them, they may be able to make much better use of less expensive links. If RUASes collectively set mapping update prices so low that the volume rose to quite a high level, it is possible that ISPs and end- user networks which run QSDs may feel less and less inclined to accept all these updates - without some financial encouragement from the RUASes who are making money from the updates. If this grew to the point where those operating QSDs found they had to spend money upgrading their QSDs just to cope with the volume, then there would be the possibility that they could instead program Whittle Expires July 23, 2010 [Page 33] Internet-Draft Ivip DB Fast Push January 2010 their QSDs to ignore the most frequent updates which had patterns resembling TE updates. Then, in order for the RUASes to be able to continue charging for these TE updates, the RUASes might need to pay QSD operators to accept such a high level of updates. This would probably be excessively expensive considering the number of ISPs and larger end- user networks which would be running QSDs. So RUASes would be under strong pressure to limit the total rate of updates to a level the great majority of QSD operators are happy with. The price of updates will not deter their use for multihoming service restoration - and this would represent a small proportion of total updates. Higher prices per update would reduce the number for TE, in a highly elastic manner. Likewise, higher prices per update would cause mobile users (or more directly the TTR companies, who are paying for each update) not to change TTRs as often. So overall, it is impossible to state with confidence what update rates might be expected. Even with the entire Earth's population owning a mobile device with its own micronets, if we pick some figure, such as 1000 km, within which there is no significant benefit in choosing a closer TTR, then a WAG (Wild-Ass Guess) could be based on airline passenger numbers. If we assume that each such trip would be long enough to require a new TTR, then we would get some very approximate worst-case figure. Statistics from the International Air Transport Association [IATA-2009] indicate that commercial airlines carried 2.271 billion passengers in 2008. I have not been able to find estimates for the number of people travelling large distances by road or train, but it is reasonable to assume these are relatively small compared to the numbers of airline passengers. Most travel by car and train involves trips short enough, with a return trip home, that there will be no need to use a closer TTR during the whole trip. Truck drivers crossing continents might be an exception, but the number of such trips would be small compared to the 2 billion airline passenger figure. There could be growth in passenger numbers and it is possible that on long trips, the aircraft's satellite link would connect to several ground stations, with the MNs in the aircraft therefore (ideally) changing their mapping to a new TTR near the ground station. (This is explored in [TTR Mobility]. There are various ways of extrapolating these figures, such as with population growth. For simplicity, I will double the 2 billion figure and use this to roughly include all mapping changes due to multihoming service restoration and TE. So I have WAG of 4 billion mapping changes a Whittle Expires July 23, 2010 [Page 34] Internet-Draft Ivip DB Fast Push January 2010 year. This is about 128 updates a second. The raw data for change to an IPv6 micronet's ETR address is 32 bytes: 64 bits for the micronet's starting /64, another 64 bits for its length or end, and 128 bits for the ETR address. 128 of these a second is 4k bytes a second - 32kbps. There would be peaks and troughs, and there could be peaks due to a major outage driving many end-user networks to switch ETRs for multihoming service restoration. This is a low data rate in the scheme of things. VoIP calls typically run at 16, 32 or 64kbps for the actual voice data, plus considerable overhead due to IP and other headers. If there were 5 or 10 billion mobile devices, each with a micronet, many of these would keep using the same TTR from one year to the next. There would be a mapping change when the micronet was assigned to a given handset, and then another when the handset was no longer used, or replaced by another. So there would also be a significant background level of administrative mapping changes with billions of micronets for mobile devices. It is hard to imagine a scenario in which the update rate would require prohibitive volumes of data, even by today's standard, for any substantial ISP. The flow of update packets would be somewhat greater than this raw data rate due to the need for packing them into some kind of robust format, having hashes of them with digital signatures etc. The total amount of mapping data coming into an ISP would be 2 to 4 times this due to the need for feeds from two or more Replicators. Still, by the times such high levels of adoption could occur, the bandwidth they require will surely not present a significant difficulty for any ISP, or for larger end-user networks which want to run their own ITRs and wish to have their own QSDs, rather than relying on the QSDs of their ISPs. Whittle Expires July 23, 2010 [Page 35] Internet-Draft Ivip DB Fast Push January 2010 8. Managing Replicators Replicators should be easy to create and deploy. Any substantial server with the requisite software, in a suitable location, will do the job - but it should be well secured against attackers gaining root access. A successful system will require some mechanisms which ensure reliable operation with a minimal amount of configuration and ongoing management. In the current model, each Replicator normally receives feeds from two upstream Replicators, and generates some figure N feeds for downstream devices. Each Replicator should be able to request and quickly gain a replacement feed from another upstream Replicator if one of those it is using becomes unavailable, or unreliable. This requires that Replicators in general be operating below capacity, so that when others in their level fail, they can take up the slack. This needs to be locally configured beforehand, with upstream Replicators of organisations which have agreed to provide the feeds, and with downstream Replicators of organisations who have requested them. It is possible to imagine a sophisticated, distributed, management system for the Replicator network. This could be developed over time, since for initial deployment, considerable manual configuration and less automation would be acceptable. Whittle Expires July 23, 2010 [Page 36] Internet-Draft Ivip DB Fast Push January 2010 9. Security Considerations This ID mentions some authentication and security problems and possible solutions to them, but full consideration of security can only occur when the architecture is fleshed out in greater detail. Whittle Expires July 23, 2010 [Page 37] Internet-Draft Ivip DB Fast Push January 2010 10. IANA Considerations For future work. Whittle Expires July 23, 2010 [Page 38] Internet-Draft Ivip DB Fast Push January 2010 11. Informative References [I-D.whittle-ivip-arch] Whittle, R., "Ivip (Internet Vastly Improved Plumbing) Architecture", draft-whittle-ivip-arch-04 (work in progress), January 2010. [I-D.whittle-ivip-fpr] Whittle, R., "Fast Payload Replication mapping distribution for Ivip", draft-whittle-ivip-fpr-00 (work in progress), January 2010. [I-D.whittle-ivip-glossary] Whittle, R., "Glossary of some Ivip and scalable routing terms", draft-whittle-ivip-glossary-00 (work in progress), January 2010. [IATA-2009] "Fact sheet: industry statistics", September 2009, . [TTR Mobility] Whittle, R. and S. Russert, "TTR Mobility Extensions for Core-Edge Separation Solutions to the Internets Routing Scaling Problem", August 2008, . Whittle Expires July 23, 2010 [Page 39] Internet-Draft Ivip DB Fast Push January 2010 Author's Address Robin Whittle First Principles Email: rw@firstpr.com.au URI: http://www.firstpr.com.au/ip/ivip/ Whittle Expires July 23, 2010 [Page 40]