Network Working Group R. Whittle Internet-Draft First Principles Intended status: Experimental January 18, 2010 Expires: July 22, 2010 Fast Payload Replication mapping distribution for Ivip draft-whittle-ivip-fpr-00.txt Abstract Fast Payload Replication (FPR) is a technique for fanning out the payloads of individual packets to large numbers of recipients. By trading off efficiency for robustness, the system can be made highly tolerant of random packet loss or loss of connection from some upstream Replicators. FPR is simpler and less efficient than Reliable Multicast or Secure Multicast, but can operate on a global scale over the DFZ. It is a host-to-host arrangement and is independent of routers and network topology. Packets are DTLS encrypted so spoofed packets cannot enter the Replicator system. Since it is not completely robust against packet or link loss, or secure against an attack which compromises a Replicator, the basic FPR should be supplemented with Missing Payload Servers and end-to- end authentication of received data in order to make an entirely robust and secure system. FPR is being developed as part of a global fast-push mapping distribution system for the Ivip core-edge separation scalable routing architecture. It should be able to fan out information to hundreds of thousands of recipients, worldwide, in less than a second. FPR may have other applications. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. Whittle Expires July 22, 2010 [Page 1] Internet-Draft Fast Payload Replication January 2010 The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on July 22, 2010. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the BSD License. Whittle Expires July 22, 2010 [Page 2] Internet-Draft Fast Payload Replication January 2010 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1. Potentially high data volume . . . . . . . . . . . . . . . 13 2.2. Independent of routers and network structure . . . . . . . 13 2.3. Simple UDP (DTLS) only operation . . . . . . . . . . . . . 13 2.4. Good but not perfect robustness . . . . . . . . . . . . . 14 2.5. Flexible trade-off of efficiency for robustness . . . . . 14 2.6. Robustness against DoS may be achieved with private network links . . . . . . . . . . . . . . . . . . . . . . 15 3. Non-goals . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1. Not intended to provide end-to-end security . . . . . . . 16 3.2. No autodiscovery or monitoring . . . . . . . . . . . . . . 16 3.3. No attempt to automatically adapt to varying PMTUs . . . . 16 3.4. Not intended for mass market consumer applications . . . . 17 4. Streams of packets for Replicators and QSDs . . . . . . . . . 19 5. Packet payloads and identification . . . . . . . . . . . . . . 23 6. The Fresh vs. Repeat Algorithm . . . . . . . . . . . . . . . . 26 7. RUAS functionality . . . . . . . . . . . . . . . . . . . . . . 28 8. Replicator Functionality . . . . . . . . . . . . . . . . . . . 29 9. QSD Functionality . . . . . . . . . . . . . . . . . . . . . . 31 10. Further elaborations . . . . . . . . . . . . . . . . . . . . . 33 10.1. Missing Payload Servers (MSPs) . . . . . . . . . . . . . . 33 10.2. Delaying the output of Replicators . . . . . . . . . . . . 35 10.3. Private network links to avoid DoS attacks . . . . . . . . 36 11. Security Considerations . . . . . . . . . . . . . . . . . . . 39 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 40 13. Informative References . . . . . . . . . . . . . . . . . . . . 41 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 42 Whittle Expires July 22, 2010 [Page 3] Internet-Draft Fast Payload Replication January 2010 1. Introduction This is a fresh document written quickly in an effort to support the RRG debate - so please excuse the lack of sub-headings and any roughs spots. This ID explores a combination of techniques which as far as I know is novel. It may be useful for various other applications than the one it was developed for: the central part of a fast-push mapping distribution system, on a global level, for the Ivip core-edge elimination architecture. [I-D.whittle-ivip-arch] [I-D.whittle-ivip-db-fast-push] Ivip is intended to solve the routing scaling problem for both IPv4 and IPv6 - whilst also being a good basis for the TTR Mobility architecture. A system such as Ivip could also support TTR mobility, irrespective of its support for routing scalability. [TTR Mobility] FPR stands for "Fast Payload Replication", although "Directed Flooding Packet Payload Replication" would also be appropriate. (It seems that the acronym "FPR" is only used in the IETF for "Frame Policing Ratio" in [RFC3133].) FPR is intended to be suitable for a business environment where multiple companies combine to create a shared infrastructure, with no single point of failure. The units of FPR data replication are payloads contained within individual UDP packets. The simplest way to implement FPR will probably be to use DTLS [RFC4347] encryption and authentication between the sources of packets, the Replicators, and the destination devices which use these payloads. It would also be possible to use IPSec Authentication Header or a specifically written authentication arrangement to protect these devices from accepting spoofed packets. For simplicity, the use of DTLS is assumed in the following discussion. While the following discussion of FPR does address other possible uses, the focus is on FPR as a component of the Ivip fast-push mapping distribution system. In this application, the final destination of the payloads are full mapping database query servers, known as QSDs (Query Server with a full Database) and the source of the packets are multiple devices known as RUASes (Root Update Authorization Servers). For simplicity, IPv4 is assumed in the examples, but FPR is equally suitable for use with IPv6. In a later section I discuss another network element - a Missing Payload Server (MPS). This would extend the basic FRP system of Replicators to provide the QSDs with a distributed set of servers Whittle Expires July 22, 2010 [Page 4] Internet-Draft Fast Payload Replication January 2010 from which to obtain payloads which did not arrive in packets from the QSD's multiple upstream Replicators. Whittle Expires July 22, 2010 [Page 5] Internet-Draft Fast Payload Replication January 2010 [RUAS-1] One of 20 or so RUASes, each of which drives its packets | | | to all three level 0 Replicators. Each level 0 | | V Replicator receives a complete set of data to be sent | V | to hundreds of thousands of QSDs. V | | | | \--------->---------\ Three fully meshed level 0 | | \ Replicators drive each other | \--->---\ \ and send streams to 20 level | \ \ 1 Replicators. Each stream | /------------<-->--------\ \ contains packets with | / \ \ \ payloads from all RUASes. [R0-0 ]--<-->--[R0-1 ]--<-->--[R0-2 ] //|||\\ //|||\\ //|||\\ | / | \ /-<--/ | 30 level 1 Replicators each | / | \/ V receive two streams of | /--<--/ | /\-->--\ | packets from an upstream V / V / \ | level 0 Replicator. | / | / \ | [R1-00] [R1-01] [R1-02] Each drives 20 streams to //|||\\ //|||\\ //|||\\ one of 300 level 2 | \/---<--/ \ Replicators. | /\ \ | | / | / | /---<-----/--------------<---------------------[R1-29] V / / \ \ | | / | / / /-------------<------------[R1-12] | / | / \ \ | | / V / \->-\ /--<---[R1-07] | / | / \ / [R2-000] [R2-001] [R2-002] Each level 2 Replicator //|||\\ //|||\\ //|||\\ drives 20 streams to a level 3 Replicator. etc. etc. etc. 3,000 level 3 Replicators. etc. etc. etc. 30,000 level 4 Replicators: \ | | / [R4-10472] [R4-27610] 300,000 QSDs each receive //||||\\ //||||\\ two streams from level 4 \ / Replicators. [QSD] Figure 1: Five levels of Replicators drive hundreds of thousands of QSDs. Whittle Expires July 22, 2010 [Page 6] Internet-Draft Fast Payload Replication January 2010 Figure 1 depicts a system with 3 level 0 Replicators, 30 level 1 Replicators, 300 level 2 Replicators etc. MPSes (Missing Payload Servers) are not shown. The Ivip system would probably involve 5 to 8 level 0 Replicators, but 3 makes for a clearer diagram. The amplification factor is how many output streams a Replicator sends divided by how many it receives. This will vary depending on local choices, but I have shown all Replicators producing 20 streams and all level 1 and greater Replicators consuming two. The amplification factor per level may be higher than this so fewer levels may be required to drive the required numbers of QSDs. In the long-term future, it is possible that hundreds of thousands of QSDs in ISP and larger end-user networks will receive mapping updates, in order to serve the ITRs in those networks. This ID explores the design of such a large-scale system. When Ivip is introduced, the whole system would be much simpler, such as with 3 Level 0 Replicators, higher amplification factors due to the initially low rate of updates and fewer levels due to this and the initially lower number of QSDs to be driven. The principle is that the level 0 Replicators are fully meshed and (except for any lost packets, dead links or failure in one of these Replicators) receive the full set of packets to be sent out to all level 1 Replicators. If a level 1 Replicator has a packet missing from one of its upstream level 0 Replicators, then it will usually be able to obtain the same payload from the equivalent packet from its other upstream level 0 Replicator. A level 1 Replicator which is missing a packet from both its sources will not be able to send this packet's payload to its 20 downstream devices. However, depending on how the cross-linking is structured, generally those Replicators will obtain the payload that packet carried from the equivalent packet from their other source. This principle continues to the end - where the QSD is generally able to cope with a missing packet by using the payload of the equivalent packet from the other upstream level 4 Replicator. The maximum length of the packets needs to be chosen so as not to violate any Path MTU from one Replicator to the next, and to the final recipient devices. The data to be carried needs to be split into individual blocks, one for each payload, of around 1300 bytes. This suits Ivip reasonably well, since it will frequently be the case that an RUAS has either no Whittle Expires July 22, 2010 [Page 7] Internet-Draft Fast Payload Replication January 2010 updates to send in a given period of time, such as 0.3 seconds, and sometimes may have enough updates to fill several payloads. Consequently, the RUAS only sends a packet when it needs to, and the length and number of the packets reflect the amount of data to be sent. Ivip mapping updates apply to a particular MAB (Mapped Address Block) a DFZ-advertised prefix which encompasses a block of SPI (Scalable Provider Independent) address space. Each MAB is split into many (potentially hundreds of thousands) of arbitrary length ranges of address space called "micronets". Each micronet is mapped to a single ETR (Egress Tunnel Router) address. The most common mapping update is to change the ETR address of an existing micronet. Other updates join and split micronets, or announce that the RUAS which controls this MAB has made a snapshot of the full state of the MAB's mapping - which a QSD can download for the purpose of initialising its copy of the mapping database, or to overcome any errors which have somehow accumulated in the section of it which concerns this MAB. Within a short period of time, such as 100ms, there may be a series of mapping updates concerning a particular MAB. All mapping updates for a given MAB always arrive in payloads from one RUAS. If a complete series arrives in a single payload, then this is relatively simple. If this payload is missing from the streams of packets which arrive from the QSD's two or more upstream Replicators, then the QSD will obtain the missing payload within a few seconds from a MPS. Then this series of changes will be applied to the MAB, and the slight delay will be of no consequence. If the series of changes spans two or more payloads, and one or more of these is missing, then the situation is more complex. The QSD generally can't apply any changes to the MAB until it has the full series which was transmitted effectively at the same time. This is because the series may involve zeroing the mapping of micronets, splitting and joining them and then setting the mapping of the resulting micronet to a new ETR address. The series of changes is best applied all at once, so the QSD needs to buffer this MAB's changes and only apply them once it has the missing payloads. In either case, where updates to the mapping of one MAB are delayed, the QSC must buffer subsequently received updates and apply them in order. This is for the same reason: that some updates alter the structure of the micronets and so cannot be applied out of order. This illustrates that the Ivip system will work fine with all or most updates being applied within a second or so of them being sent by the RUAS, but that in the small number of cases where there is a few Whittle Expires July 22, 2010 [Page 8] Internet-Draft Fast Payload Replication January 2010 seconds delay due to a missing payload, this is of no serious negative consequence. No end-user network absolutely relies on ITRs changing their tunneling behavior within a second or two of a mapping update being sent. It suffices that most will do this, and on rare occasions some ITRs will change a few seconds later. Even if all changes to ITR tunneling behaviour were suspended for a minute or two, or some ITRs lagged behind the rest by a few minutes, no serious harm would occur. The worst outcome would be a consequent delay in multihoming service restoration (which for a minute or two is highly undesirable, but not disastrous) and likewise delays in traffic engineering changes or packets to mobile devices being sent to a new TTR, rather than the old one. In the mobile case, this simply means the mobile node needs to maintain its tunnel to the old TTR for a few minutes longer than it otherwise would. (Such changes in mapping only occur if the mobile node moves 1000km or so - not ever time it gains a new access network or IP address.) The foregoing discussion illustrates that it is good enough for FPR to generally be very fast and robust, but for it to sometimes involve delays of a few seconds, or more rarely, a few minutes. The total data-rate of the "complete stream" of packets handled by the FPR system needs to be carefully bounded in order to ensure that all devices in the system will not be overloaded. In Ivip, there will be multiple, asynchronous, independent sources of packets which drive the input to the system "level 0 Replicator" part of the system. In this ID, these are assumed to be a set of RUASes (Root Update Authorization Servers). The exact number of these is not particularly important, but in practice there might be a few dozen. These would need to coordinate their packet rate to ensure that at no instant was the FPR expected to carry more than some data rate of packets. In the fully designed Fast-Push Mapping System, there may be an additional source of packets which is not an RUAS. Nonetheless, the discussion below anticipating a dozen or so 20 or so RUASes is sufficient to explore the current FPR design. FPR is secure against an attacker sending spoofed packets, since these will not pass the DTLS software which accepts them into each Replicator and QSD. FPR is not secure against an attacker gaining control of one or more Replicators. In order to achieve end-to-end integrity, the final recipient device (in Ivip, a QSD) will need to be able to authenticate each payload worth of mapping data, or larger bodies of mapping data assembled from multiple payloads. This will probably be Whittle Expires July 22, 2010 [Page 9] Internet-Draft Fast Payload Replication January 2010 via a public key signature of that payload or larger body of data, which is included in the payload data stream - using the public key of the RUAS which sent the payloads. Likewise, in order to be able to ensure confidentiality against an attacker who can snoop packets being sent by Replicators, the application data must be protected by encryption. In the present design, DTLS both encrypts and authenticates the payload of each packet. Confidentiality is not required for Ivip. (I have not yet determined if there is a DTLS mode which supplies only authentication.) Since FPR is not absolutely robust against link loss or random packet loss, a complete system which delivers data entirely robustly must supplement FPR with some method by which the recipient can request missing packets. This could be external to the FPR system, but in a later section I explore the possibility of integrating Missing Payload Servers into the FPR system of Replicators. Forward Error Correction is another method of coping with a certain level of lost packets, but this involves considerable complexity and overhead. It also involves handling data in long block lengths which are not suitable for Ivip. FPR's use in Ivip is a critical part of the core-edge separation system. The FPR part of the mapping distribution system is a single global-scale system and it is intended to run reliably, continually, delivering data to potentially hundreds of thousands of QSDs all over the world. A halt in its operation for a few seconds or minutes would not be disastrous, but would delay multihoming service restoration, mapping changes for inbound TE, and the ability of mobile nodes to choose a closer TTR. The Ivip system won't fail if the FPR system stops for minutes or even tens of minutes. Nonetheless, the FPR system is intended to operate continually, indefinitely, despite its individual component Replicators being taken in and out of service and the connections between them being changed from time to time. FPR is most suited to an application which requires very fast replication of information, including perhaps on a global scale, where it is important that the data generally arrive quickly, but that occasional lost packets and consequent delays obtaining replacements will not be a problem. While FPR is much less efficient than ordinary multicast - in which a single stream is replicated into multiple streams at one or more points in the distribution system. Efficiency is traded off directly for greater robustness against packet loss. Whittle Expires July 22, 2010 [Page 10] Internet-Draft Fast Payload Replication January 2010 FPR does not rely on conventional multicast protocols or router capabilities. It should be possible to implement an FPR Replicator as a user space daemon on any server. There is nothing particularly surprising about the outcomes of the FPR arrangement. However, this combination of capabilities is a crucial component of Ivip. FPR's capability to convey "mapping" information in essentially real-time from end-user networks (or entities they authorise to control their mapping) to hundreds of thousands of QSDs in ISP and large end-user networks all over the world enables Ivip to achieve at least two major benefits compared to other core-edge elimination systems, most prominently LISP [I-D.ietf-lisp]. Firstly, there is no need for ITRs (Ingress Tunnel Routers) to have to choose between multiple ETRs (Egress Tunnel Routers) - since the end-user network is able to control the ITR tunneling behavior in real-time. (ITRs receive mapping from QSDs in response to queries, and QSDs send updates to ITRs if they receive changed mapping via the FPR system.) Secondly, this modularly separates control of ITR behavior from the core-edge separation scheme itself, enabling the end-user network to control the ITRs for whatever purposes they desire, and with whatever techniques and information they employ. Without a real-time global mapping distribution system, the other core-edge elimination architectures to date cannot control ITRs directly, and so must build all the system's reachability testing and decision-making capabilities into each ITRs and give end-users control via more complex mapping which includes multiple ETR addresses. This discussion of core-edge separation architectures is of no direct relevance to FPR as a subsystem. However, it illustrates that FPR's particular capabilities are crucial to being able to make some attractive architectural choices in a core-edge separation scheme. Reliable Multicast [RFC2887] would not be as suitable, since it involves a single stream of packets, whereas in Ivip, FPR will fan out multiple independent streams, one from each of 20 or so RUASes. Reliable Multicast involves long blocks of data for its Forward Error Correction arrangement, which would introduce delays in the sending and receiving of application data. In Ivip, this would delay and very much complicate the reception of data, particularly when the data rate from each RUAS is low. Neither Reliable Multicast or Secure Multicast [RFC3740] are robust Whittle Expires July 22, 2010 [Page 11] Internet-Draft Fast Payload Replication January 2010 against lost packets and dead links, while FPR can be used in a way which gives a much higher degree of robustness against these. As far as I know, most multicast protocols assume the use of routers at specific parts of the network. FPR is intended to operate without reference to routers or the address structures inherent in networks. FPR Replicators can be implemented in servers on arbitrary stable global unicast addresses. The structure of the links between Replicators has no reliance on network topology and can be arbitrarily chosen. FPR is intended to work reliably with links across the DFZ and so be able to scale well to a global distribution of recipient devices. In summary, FPR is relatively simple and may complement established multicast protocols rather than exceed their performance in the applications they are best suited to. At least with Ivip, FPR's apparently unique capabilities will enable a larger system to be designed in ways which would not be possible - or at least not as easy - with existing techniques. I intend to write the requisite software for FPR - code for a Replicator - later in 2010. Whittle Expires July 22, 2010 [Page 12] Internet-Draft Fast Payload Replication January 2010 2. Goals 2.1. Potentially high data volume FPR should be able to handle relatively high data volumes. The limiting factor with DTLS is likely to be the ability of Replicator software to send the output streams each with its own DTLS protection. If a customised authentication arrangement was used instead, then each Replicator could send essentially identical packets to all its downstream devices, saving on the separate cryptographic processing of each stream which would be inherent in DTLS or IPsec. At present, I am unsure of the efficiency of using DTLS to produce large numbers, such as 20 or 50 output streams of packets. With 4 core 64 bit CPUs clocked at close to 3GHz, I would not be surprised if a modern COTS (Commercial Off The Shelf) server could fill a gigabit Ethernet link. However, this remains to be determined. The average data rates per stream with Ivip would be fractions of a megabit per second, even with the largest imaginable deployment. 2.2. Independent of routers and network structure While FPR could be implemented in routers, it is intended to be implemented as software in a server. The use of DTLS means that a user-daemon with inbuilt DTLS capabilities can be operated on any server, since there are no special demands on the operating system. In a global system, Replicators can be on any stable global unicast address. In a private network the addresses need only be stable. In all cases, the passage of packets between Replicators is controlled directly and in no way depends on the topology or addressing structure of the network. This makes FPR suitable for a global packet replication system with links across the DFZ. 2.3. Simple UDP (DTLS) only operation Since DTLS sessions are set up via the same UDP ports which are used for data transfer, the entire Replicator could use a single UDP port. This should facilitate recipients being behind NAT, since the recipient device makes the DTLS link to its upstream Replicators. Replicators themselves cannot be behind NAT, since the DTLS session could not be established to them. Replicators, at least for Ivip, are generally meant to be at well-connected data centers where the multiple links to other data centers can be used to ensure physical diversity of the streams being sent to any one Replicator. Whittle Expires July 22, 2010 [Page 13] Internet-Draft Fast Payload Replication January 2010 2.4. Good but not perfect robustness If a recipient receives streams of packets two upstream Replicators, and both of these feeds are disrupted in some way, then the recipient will not get some packet payloads. FPR has no NACK, but in a later section I discuss a system of "Missing Payload Servers". The aim is to make FPR delivery of packets for any recipient with reasonably good network links (such as by the streams arriving via two physical links from Replicators with different topological locations) highly robust against individual packet losses or the failure or unreachability of an upstream Replicator. The purpose is to make missing payload recovery a rare enough event that the occasional delays and extra traffic it involves are not significant problems. There can be no perfectly robust system, of course, in the event that all links from outside sources are disrupted at the same time. 2.5. Flexible trade-off of efficiency for robustness By choosing how many input streams each Replicator or recipient device has, and by choosing these to arrive from Replicators near and far (geographically and topologically) it should be possible to achieve a wide-range of compromises between efficiency and robustness. These choices can be made at a local level, for each particular Replicator or recipient. While the discussion below generally assumes each will receive two feeds, it will be possible to configure them to receive more than this number of feeds. For instance, a Replicator or Recipient which can be given five feeds, each arriving over a different physical link, each from a Replicator whose location in the network is topologically different from the others. Each such upstream Replicator should, ideally, have feeds from other upstream Replicators are at least partially diverse with respect to each other. Then, the ability of the recipient to receive all packet payloads could be extremely robust against random packet losses and against outages in routers, data-links and other Replicators - at the cost of requiring outputs from more upstream Replicators and paying for the bandwidth of their multiple incoming streams. Whittle Expires July 22, 2010 [Page 14] Internet-Draft Fast Payload Replication January 2010 2.6. Robustness against DoS may be achieved with private network links No device on the open Internet can be reliably protected against a flood of packets generated by botnets. In order to minimise the damage such an attack could have on an FPR system, the higher layer Replicators (closer to layer 0, and including layer 0) of the inverted tree structure would need to be linked by private network links. At some point in the Replication hierarchy, where Replicators are sufficiently numerous, the links to the next level (numerically higher, but lower in the inverted tree) could be carried by the public Internet. A DDoS attack with a given bandwidth capacity would only be able to affect a subset of the Replicators at that level, depending on how many there are at that level and whether their input capacity was 100Mbps or 1Gbps. Depending on all the factors, it would be possible to ensure that even the largest botnet DoS attacks has little impact on the delivery of data to recipients, if there are one or more layers of Replicators below this. Costly private networks links between Replicators is a significant expense, but will probably be justified for Ivip in order to ensure this critical piece of Internet infrastructure can only be partly affected by the largest DoS attacks. Whittle Expires July 22, 2010 [Page 15] Internet-Draft Fast Payload Replication January 2010 3. Non-goals 3.1. Not intended to provide end-to-end security While Replicators only receive feeds from upstream replicators they are configured to use, and which accept their credentials, and while DTLS protects the payloads of packets between the Replicators and from the Replicators to the recipient devices, FPR does not provide end-to-end security against either alteration of the data or snooping of its contents. This is because the recipient has no way of knowing that all the upstream Replicators it relies upon are not under the control of an attacker. A single such compromised Replicator could drive packets to most or all of its downstream Replicators by sending out packets with the identification numbers expected from the genuine source a little earlier than the genuine packets. The use of DTLS to protect packets sent from Replicators to other Replicators and to recipient devices is intended primarily to prevent any of these accepting a spoofed packet generated by an attacker who does not control any Replicators. This protects against attackers injecting their own packets with bogus payloads. 3.2. No autodiscovery or monitoring The current description is for the basic functions of Replicators and later Missing Payload Server. In some applications it may be desirable for the Replicators to automatically choose their upstream and downstream Replicators. In almost any practical system, some kind of diagnostic functions would be needed in order to evaluate performance and debug problems. Such capabilities are for future work. 3.3. No attempt to automatically adapt to varying PMTUs To be deployed across today's DFZ, all packets would need to be less than 1500 bytes long. I will assume 1470 bytes, for convenience, as a PMTU which can reasonably be expected in any DFZ path, because I have observed Google servers sending unfragmentable packets of this length. [DFZ-unfrag-1470] The FPR system of Replicators has no PMTUD capabilities - and any PMTU problem encountered by a packet will not result in an RFC 1191 Packet Too Big message being sent beyond the upstream Replicator which sent the packet. Replicators would ignore such a message. The Missing Payload Servers receive streams of packets just like Whittle Expires July 22, 2010 [Page 16] Internet-Draft Fast Payload Replication January 2010 Replicators and QSDs, so they need to be located where there are no local PMTU restrictions which would prevent the reception of packets of the chosen maximum length. Missing Payload Servers communicate with each other, and handle requests from QSDs, via TCP - which does not involve any special MTU constraints. In Ivip, is it likely that some Replicators, Missing Payload Servers and QSDs will be located in end-user networks which use SPI (Scalable Provider Independent) addresses. Packets addressed to SPI addresses will pass through an ITR and ETR. (Replicators may include an inbuilt ITR function so the packets it sends don't have to go to any separate ITR.) If encapsulation is the method used for ITR to ETR tunneling then for IPv4, this involves a 20 byte IP-in-IP header. So the maximum length of a packet which could be handled by the FPR system in this scenario - a UDP packet with DTLS header and payload - is 1450 bytes. In an ISP or end-user network today where gigabit Ethernet interfaces are always used and where all MTUs support ~9kbyte jumbo-frames, it would be possible to run an FPR network with ~9kbyte packets. If a 1450 byte FPR system was successfully operating over the DFZ, at some time in the future, when all DFZ paths and likewise paths between all Replicators and recipients could support ~9kbyte packets, there could be a transition to using these larger packets. Replicators will handle ~9kbyte packets and in principle the same Replicators could begin handling the larger packets without any need for reconfiguring the entire system. If the numbering systems by which the packet payloads are identified did not overlap, and if the Replicators had the capacity, the same system of Replicators could handle the 1460 byte packets and ~9kbyte packets simultaneously. These larger packets would involve a different way of splitting up the data to be transmitted. Recipient devices (QSDs) may have software which copes automatically with different packet formats, but a more likely scenario is that the switch to jumboframes in the future would be accompanied by somewhat different ways of carrying the data - and so by the need for updated recipient software. 3.4. Not intended for mass market consumer applications At each point - a Replicator, Missing Payload Server or QSD - redundancy is bought by increasing the incoming bandwidth, according to how many upstream Replicators are used. This is expensive for high data-rate applications and so FPR is not intended as a system for delivering audio or video material to mass-market end-users. Whittle Expires July 22, 2010 [Page 17] Internet-Draft Fast Payload Replication January 2010 It is intended for recipients in ISP networks where the two or more feeds can be chosen to arrive via different physical links, different peering points and different border routers - so the physical diversity available in these settings can be directly employed to provide increased robustness. Since FPR lacks PMTUD capability, it is best used in scenarios where the location of Replicators and recipients is stable and carefully planned, with regard to any PMTU limitations which may affect them. Whittle Expires July 22, 2010 [Page 18] Internet-Draft Fast Payload Replication January 2010 4. Streams of packets for Replicators and QSDs All packets discussed below are those which pass the DTLS authentication process, and are presented to the FPR code as DTLS payloads, each consisting of an FPR header and FPR payload. (For simplicity, much of this discussion assumes that these packets are only received by Replicators and QSDs. However, Missing Payload Servers will also receive streams of packets, in exactly the same way.) In all cases, the receiving device does not distinguish between packets which arrive from one incoming stream from those which arrive from another. This information is available from the DTLS software, but is not important to how the device processes the payloads of each incoming packet. In this discussion the term QSD (Ivip full-database query server) is used to denote the devices which receive the packets and put their payloads to use, rather then sending the payloads to others, as Replicators do. This helps explain the FPR system's role within Ivip. If FPR was used for another purpose, the packets would be received and used by some other device. It would be possible for a single device to function as both a Replicator and a QSD. This may make sense during initial Ivip introduction. However, in a fully deployed Ivip system, with the QSD handling many requests from ITRs (directly and via caching QSCs) and with the QSD having a significant workload receiving the packets and processing them to update its database, separate servers for the QSD and Replicator functions would be the best approach. The Replicator and QSD code would share some common elements - for the reception and processing of incoming packets. Replicators and QSDs are both required to receive multiple streams of packets. While they may operate with a single stream, two would be a typical number to receive and they may be required to receive many more. These statements apply also to the code for the Missing Payload Server. A QSD or Missing Payload Server only receives streams from upstream Replicators. 2 streams would be a typical number, but perhaps as many 5 could be used to maximise robustness. A level 1 or greater Replicator receives typically two or more streams from upstream Replicators in the numerically lower numbered level - which is "above" (upstream) in the inverted tree structure. Level 0 Replicators receive streams from one or potentially many sources of packets. In Ivip, the sources are multiple RUASes. They Whittle Expires July 22, 2010 [Page 19] Internet-Draft Fast Payload Replication January 2010 also receive a stream from every other level 0 Replicator. The FPR system handles the sum of the unique payloads sent by all RUASes. For instance, in a given time period such as 100ms, RUAS-0 sends streams of packets to all five level 0 Replicators, with each stream containing 7 packets with a set of 7 unique DTLS payloads (FPR headers and FPR payloads). While the packets received by one level 0 Replicator are all different from those received by another, due to DTLS encryption, each level 0 Replicator receives 7 packets from RUAS-0, and the DTLS payloads of the packets received by one level 0 Replicator are identical to the DTLS payloads of received by each other level 0 Replicator. The purpose of RUAS-0 sending five streams containing the same DTLS payloads, one stream to each level 0 Replicator, is to maximise the fault tolerance of the system. If one or two level 0 Replicators are down, or if they can't be reached from RUAS-0, then there will be no loss of data being sent to the QSDs. Even if RUAS-0 was only able to send its 7 packets to a single level 0 Replicator, or if a single set of 7 was sent to various level 0 Replicators (such as packet 0 to R0-0, packet 1 and 2 to R0-3 and the rest to R0-5) then the system would still deliver all payloads to the QSDs. This is due to the level 0 Replicators being "fully meshed". Every one has an output stream to every other one. So as long as at least one packet with a given payload arrives at any level 0 Replicator, within a fraction of a second, all other level 0 Replicators will receive it as well. In all cases, the receiving device (Replicator or QSD) establishes the DTLS session with the source of the packets. To continue with the five level example of Figure 1, QSDs establish their DTLS sessions with level 4 Replicators. Typically two would be a good choice, but more could be used. The QSD is configured to use particular level 4 Replicators and the DTLS session can only be established if each level 4 Replicator accepts the username and password provided by the QSD. Similarly, Replicators at levels 4, 3, 2 and 1 establish DTLS sessions with Replicators at the level above (numerically one less). Each layer 0 Replicator establishes a DTLS session with each other layer 0 Replicator, and with each RUAS. In this discussion, it is assumed that there is a strict layering of Replicators. While layer 0 is fully meshed, there is no meshing of other layers - no layer 3 Replicator receives a stream of packets from any other layer 3 Replicator. Also, no Replicator at levels 1 or greater is shown accepting a stream from Replicators at any level other than the one above. The diagram shows the Replicator system Whittle Expires July 22, 2010 [Page 20] Internet-Draft Fast Payload Replication January 2010 ending at level 4, and with the next level being composed entirely of QSDs. There is nothing to prevent a QSD being driven partly or wholly by streams from Replicators in levels other then 4. Nor is there anything to prevent a Replicator getting some of its streams from levels other than the one above. For instance, it would be possible, in principle, to cross-connect all level 3 Replicators. However, due to their large number (3,000) this would be impractical and inefficient. The strict layering of Replicators is not absolutely required, and it may make sense to have a Replicator driven by streams from a level 2 and a level 3 Replicator. It would also be possible to take a stream from a level 4 Replicator and feed it to a level 3 or 2 Replicator. This cannot result in the equivalent of "routing loops", since most or all of the packets which arrive from this link will contain payloads which the level 3 or 2 Replicator has already received - so those packets will not lead to any further action. The advantage of doing this, from a level 4 Replicator which is dependent on different level 3 or 2 Replicators than those streams are received from, is to provide diversity. If those streams from the directly used level 3 or 2 Replicators are disrupted, it is unlikely that there will be the same disruption in the stream received from the topologically distant level 4 Replicator. I have presented FPR in a strictly layered arrangement because this is easier to depict and is theoretically the most efficient way of fanning out information. However the details of connections between Replicators and QSDs is not technically constrained by the FPR system, and can be chosen freely to trade off bandwidth and computing resources for robustness according to local conditions. For instance, if a level 3 Replicator in Sydney Australia has incoming streams from level 2 Replicators in Sydney and Singapore, analysis of the connections might reveal that these two operate from three or four level 1 Replicators which are not ideally diverse in a topological sense, with their origins being mainly in the USA. Assuming that the higher level Replicator outputs are more difficult to obtain access to than those of the lower levels, it would be possible to have a third stream feed this level 3 Replicator, from a level 3 or 4 Replicator in Russia, which has most of its incoming streams arriving from Europe. Typically, the packets arriving from the Russian Replicator would arrive later than those from the higher level Sydney and Singapore Replicators, and so would be ignored. However, if there was a network outage which affected both the Sydney and Singapore Replicators, even for a fraction of a second, the payloads in the packets arriving from Russia would be used Whittle Expires July 22, 2010 [Page 21] Internet-Draft Fast Payload Replication January 2010 automatically. Each stream sent to a QSD, Missing Payload Server or a level 1 or greater Replicator is, under ideal circumstances (no packet loss), a "complete stream" in that its packets contains a complete set of DTLS payloads which every QSD, ideally, will receive at least one of. This is also true of the streams each level 0 Replicator receives from other level 0 Replicators. If there is a single external source of packets, then ideally, that source will send a separate "complete" stream of packets to every level 0 Replicator. Due to the fully- meshed flooding arrangement of the level 0 Replicators, then - assuming there were no packet losses - it would suffice for the single external source to send a single complete stream to just one level 0 Replicator. Alternatively, the single source could send a complete stream in various subsets, each to a different level 0 Replicator. When the FPR system is used in Ivip, it is intended to receive packets from multiple external sources - each an RUAS system. Ideally, every RUAS will send its subset of the complete stream to every level 0 Replicator. In this case, the "complete" stream is the sum of all packets sent by all external sources. In fact, it would suffice (assuming again no packet losses) for each external source to send just a single set of packets to just one level 0 Replicator, or scattered to various level 0 Replicators, because the payload of each packet one will flood to all the other level 0 Replicators. At the highest level - level 0 - the FPR system involves brute-force flooding and fully-meshed redundancy to ensure that in ordinary circumstances every level 0 Replicator receives the "complete stream" - either directly from the one or more external sources, or from its level 0 peers. For a global, real-time, system such as Ivip, I anticipate that 4 to 8 level 0 Replicators would suffice. Each would be in a geographically and topologically different location, and they would all be meshed by private network links which would, ideally, be geographically and topologically diverse. In a section below I discuss the use of private networks to protect against DoS attacks. Whittle Expires July 22, 2010 [Page 22] Internet-Draft Fast Payload Replication January 2010 5. Packet payloads and identification Replicators and QSDs decipher each received packet from upstream Replicators (or for the level 0 Replicators, from RUASes and other level 0 Replicators) and use the first 32 bits of the DTLS payload to decide what to do with the entire DTLS payload. The options are to use it because it is deemed to be "Fresh" payload - a Replicator replicating it, or a QSD using the payload to update its database - or to "ignore" it, because the device has already received a packet with the same payload, meaning it is deemed to be a "Repeat" payload. Some or potentially all bits of the FPR header are used by the Replicator or QSD to decide whether the Replicator or QSD has already received a packet with the same payload. This section describes this process in principle, and the next describes one way this process could be implemented in Replicators, Missing Payload Servers and QSDs. FPR's units of replication and flooding are payloads of packets. If IPv4 packets are limited to 1450 bytes then there are 1422 bytes available after the IP and UDP headers. If the DTLS header involves an overhead is 50 bytes then 1372 bytes remain as the "DTLS payload". (I have not yet researched DTLS in sufficient detail to determine exactly what the overhead would be. This would depend on choice of encryption algorithm and Message Authentication Code.) At the start of the DTLS payload, a fixed number of bits must be devoted to identifying the packet. I will refer to this as the "FPR header" and for now assume it is 32 bits. The remainder of the DTLS payload is the "FPR payload" - and is available for application data. With these assumptions, the application data is contained in the 1368 byte FPR payload. The FPR software in the QSD (or other recipient device, if FPR is used for a purpose other than Ivip) should make the FPR header bits available along with the FPR payload, since it may be helpful in processing the FPR payload. In the current design, the only function of the "FPR header" is to enable each Replicator and QSD to use this header to decide whether each incoming packet, once deciphered from its DTLS form, is either a "fresh" or a "repeat" packet with the timeframe T. The exact algorithm for this decision is described in the following section below. For now, the definitions are loosely: Whittle Expires July 22, 2010 [Page 23] Internet-Draft Fast Payload Replication January 2010 Fresh: No packet with FPR header bits identical to this packet's FPR header bits has been received in the recent time period T. (Therefore, the packet is assumed to contain a fresh FPR payload and so must be replicated.) Repeat: One or more packets with FPR header bits identical to this packet's FPR header bits value HAS been received in the recent time period T. (The first such packet was replicated, so subsequent packets, which presumably have the same FPR payload, are ignored.) The idea is that in any given scenario, there may be some mechanism by which a packet could be so delayed by the routing system and Replicators that it arrives some time D later than it otherwise would. The algorithm needs to identify any delayed packet with the same payload as one already received as a "Repeat" which can be ignored. There are various ways of achieving these goals. One approach is to use the 32 bit FPR header in the following manner. This is purely for Ivip. Other applications would choose to identify the payloads differently. (This is a preliminary exploration, to demonstrate one way of performing the algorithm.) 10 bits: epoch in 1 sec increments (epochsec): When the RUAS sends out packets with this payload, it sets these 10 bits according to the current time (epoch), quantized to seconds units. RUASes should agree on a common timebase, so all the packets sent at a particular time by all RUASes have the same, or +/-1, values for these bits. This wraps around every 17 minutes 4 secs. Maybe it would be better to have 20 or 32 bits here. This value is not currently used for the Fresh / Repeat algorithm, but it will be used by QSDs and Missing Payload Servers for identifying recent payloads. 7 bits: RUAS identifier (ruas): This identifies the RUAS which sent the payload, from 128 possible RUASes. 1 bit: Normal / Jumbo (nj) 0 means normal ~1500 byte packet size. 1 means ~9kbyte packet size. To support simultaneous reception of both types of packet, the RUAS will maintain separate sequence number counters for each set of packets. Whittle Expires July 22, 2010 [Page 24] Internet-Draft Fast Payload Replication January 2010 14 bits: Sequence number (seq): The RUAS sends out each payload with a sequence number which is one more than that used for the previously sent payload. The Fresh / Repeat algorithm doesn't rely on this sequential order, but it will help with the retrieval of missing payloads from Missing Payload Servers. These numbers wrap around every 16,384 payloads. The RUAS should not send more than 1000 packets a second, which is about 1.3 megabytes a second, assuming the packets are ~1500 bytes. So "seq" can't wrap around in less than 16.384 seconds. The next section explains how these bits are used. Whittle Expires July 22, 2010 [Page 25] Internet-Draft Fast Payload Replication January 2010 6. The Fresh vs. Repeat Algorithm This section describes an algorithm for deciding whether DTLS payload (FPR header and FRP payload) is "Fresh" or a "Repeat". I tried using sliding windows and ran into problems. This approach maintains a timer for each sequence number, for each RUAS. This would have been prohibitive in the past, but a quad core ~3GHz CPU would use only a small fraction of its power running these timers. There could be other ways of implementing this algorithm. The aim here is to show a practical approach, which may not be optimal. For each of the 128 RUASes, the software maintains an array of 2^14 timer (down-counter) variables for ~1500 byte packets another such array for ~9kbyte packets. (It will be many years before the DFZ supports ~9k byte packets, but the code should be ready to support a separate stream of such packets.) In the implementation below, the timer variables are 4 bits each, but I use only 3 bits. The 4 bit timer variables are in a multidimensional array, indexed on "ruas" (2^7), "nj" (2^1) and "seq" (2^14). So there are 2^22 4 bit timer variables, occupying 2 megabytes of RAM. This is a few cent's worth of DRAM. The whole array fits well within the L2 cache of modern multi-core CPUs, which is typically 8 megabytes. When a packet arrives and is successfully deciphered, the software looks at the four fields: "ruas", "nj" and "seq" in the FPR header. The software uses these to index into the array and read a particular timer variable. If the timer value is zero, then the payload is deemed to be "Fresh". This is because this payload is the first to be received from this RUAS with this "seq" number in the last 10 or so seconds. The software then sets the timer variable to 5. Later, after this RUAS has sent another 16,384 payloads, it will send another payload with the same "seq" value. But by then, more than 10 seconds will have elapsed and this timer value will have reached zero - so that new payload will be recognised as Fresh too. If the timer variable is non-zero, this payload is deemed to be a "Repeat" and no further action is taken on it, or the timer variable. This would occur if another packet with the same payload arrives less than 10 to 12 seconds after the first one. Meanwhile, a background process steps through all the timer values every 2 seconds - a million a second, which is a fraction of a CPU's Whittle Expires July 22, 2010 [Page 26] Internet-Draft Fast Payload Replication January 2010 worth of work, and modern chips have four CPUs. If the value is non- zero, it is decremented. If it is zero (which most of them will be) the variable is not changed. There is no need for locking these timer variables, since these two types of access are thread-safe. The payload handling code only writes the variable if it was zero and the timer code only writes to it if it was non-zero. Since the down-counting operation is asynchronous with respect to the payload handling code, it could be 0.0 to 2.0 seconds before the first decrement operation. The actual time required for the counter to reach zero after a Fresh packet is recognised will be between 10.0 and 12.0 seconds. This arrangement will reject as a "Repeat" any second occurrence of a payload which arrives up to 10 seconds after the first ("Fresh") one. It may reject one which arrives as much as 12 seconds later. I assume that the Replicators themselves do not delay packets and that the routing system would never deliver a packet with such delays which would amount to 10 seconds. If a longer time is required, this algorithm could be modified. Whittle Expires July 22, 2010 [Page 27] Internet-Draft Fast Payload Replication January 2010 7. RUAS functionality With the assumptions from the previous section, there can be up to 128 RUASes. Each can generate up to 1000 DTLS payloads per second. However, the total FPR system will have a specified maximum data rate, probably at a granularity of a short time such as a few milliseconds to a few tens of milliseconds. Therefore, there needs to be some arrangement by which the RUASes cooperate so the rate at which packets (really DTLS payloads) are replicated by the level 0 Replicators does not exceed this maximum. Each RUAS has its own section of the 2^22 bit numbering range to use as sequence numbers for its DTLS payloads for - the value it writes to the "ruas", "nj" and "seq" fields in the FPR header. For the ~1500 byte stream of payloads, the RUAS must cycle through the 2^14 range of "seq" sequentially. If it is also sending jumboframe packets, it will maintain an independent counter to set the "seq" bits in those payloads. As part of this cycling, the RUAS should not, within 12.0 seconds, generate two DTLS payloads with the same particular value for "seq" in their FPR headers, but with different FPR payloads. All RUASes should use a common timebase for setting the "epochsec" field in the payloads they generate. At a bare minimum, for the RUAS to successfully launch a DTLS payload, it must deliver a packet containing that payload to at least one level 0 Replicator. This is assuming all the level 0 Replicators are operating and that they are fully meshed - with each receiving a stream from the others. If the RUAS only delivered the payload to a single level 0 Replicator, which was not sending a stream to any other level 0 Replicator, but was sending streams to all its downstream level 1 Replicator, then depending on the interconnections at the various levels, this may not result in the payload being delivered to all QSDs. Therefore, the RUAS should ideally have a stream to each level 0 Replicator, to maximise the chance that most or all of these Replicators receive the payload directly, or from another such Replicator. The question of how RUASes format the data in the FPR payload, for the purposes of reassembly in the QSDs and so that QSDs can use end- to-end encryption to check its authenticity, is outside the scope of the FPR system. Whittle Expires July 22, 2010 [Page 28] Internet-Draft Fast Payload Replication January 2010 8. Replicator Functionality Most Replicators will need to receive two or perhaps a few more streams from upstream Replicators. Level 0 Replicators will receive many more streams. Firstly, they will receive a stream from each other level 0 Replicator. Secondly they will receive a stream from each RUAS. The same Replicator code should be usable at all levels, so Replicators in general should be capable of receiving over 100 input streams. This does not mean the total volume of packets would be 100 times the complete set of payloads the FPR system is replicating. If we assume an upper limit of 8 level 1 Replicators, then the worst case quantity of packets any Replicator must handle is 8 times the total actually being replicated. This would be when a level 0 replicator receives the total set collectively from the 100 or so RUASes and then receives the same set from each of the 7 streams from the other level 0 Replicators. So this provides a reasonable definition of how many DTLS sessions a Replicator may need to create to "upstream" devices - and the total volume of data it should be able to receive via these sessions. Except for monitoring purposes, the Replicator makes no distinction between DTLS payloads which arrive from any of its upstream sources. Each such payload is handled, as described above, by the Fresh / Delayed algorithm. Only payloads deemed Fresh require any further action. Each Fresh payload is replicated to all the downstream devices, each with its own DTLS protection, due to each such session having different session keys and states. Just as the incoming streams are unidirectional, so are the output streams. Apart from DTLS handshakes, each Replicator does not send packets upstream, or receive them from downstream, devices. The replication process does not alter the DTLS payload. There is no hop-count or checksum to check or update. The same DTLS payload is simply sent out via all downstream DTLS sessions. It would be best if this was scheduled to even out the flow of packets for each such session. So a DTLS payload would be sent out on session 0, then on session 1, etc. rather than sending two or more different DTLS payloads on any one session one after the other. When a Replicator is handling a jumboframe stream as well as the ordinary ~1500 byte stream, it maintains separate input and output sessions for the jumboframe packets. So the structure of links between Replicators for jumboframe packets could be identical to that Whittle Expires July 22, 2010 [Page 29] Internet-Draft Fast Payload Replication January 2010 for ~1500 byte packets, could be similar or could be entirely different. Therefore, a Replicator which handles both will need approximately double the DTLS sessions and of course bandwidth and CPU power to handle both. Whittle Expires July 22, 2010 [Page 30] Internet-Draft Fast Payload Replication January 2010 9. QSD Functionality The QSD receives incoming streams as just described for Replicators. However, a QSD would only receive all ~1500 byte streams, or all jumboframe streams. Therefore, its Fresh / Repeat algorithm only needs half the number of timer variables as a Replicator. QSDs don't receive streams from the numerous RUASes, and it is probably safe to assume that no-one would run a QSD with more than 8 input streams. So while a QSD is only required to handle up to 8 or so DTLS sessions, each of these streams would be a complete stream, so the incoming data rate requirement is the same as that of a Replicator - 8 times the total data rate of the complete stream. When Fresh DTLS payloads are received, their contents - the 32 bit FPR header and FPR payload is passed to the rest of the QSD software, and the mapping information in these payloads will be interpreted as will be described in [I-D.whittle-ivip-db-fast-push]. This processing will involve some kind of end-to-end integrity checking, involving the public key of the RUAS which sent the payload. With the above arrangement, the RUAS of the packet can easily be determined from the "ruas" field in the FPR header. Perhaps it will be possible to individually authenticate every payload - but I am concerned about devoting too much space in every payload to the required MAC bits. This concern would not apply to jumboframe payloads which are much longer. Checking each payload would be simpler, but more costly in terms of CPU resources and space used in each payload. Assembling information from multiple payloads into a larger block for authentication would be more efficient, but more complex. It also means that a missing payload will delay the use of information in other payloads. Exactly how the end-to-end authentication will be done is for future work. It depends more on the Ivip mapping system than on the FPR system itself, so I intend to explore this in the future in [I-D.whittle-ivip-db-fast-push]. QSDs will also need to recognise any missing packets and to download a replacement. The algorithm for this is for further work, but the 10 bit "epochsec" field will also be useful for this. Missing packets could be detected, after a second or two, by a gap in the sequence numbers of payloads from a given RUAS. Perhaps one form of request for missing packets might be to send two 32 bit values, containing the FPR headers of the successfully received payloads which bracket the assumed missing packets. The Whittle Expires July 22, 2010 [Page 31] Internet-Draft Fast Payload Replication January 2010 full 32 bits uniquely identifies each payload, and the 1 second resolution "epochsec" field will enable the Missing Payload Server to narrow down its search through its cache. Whittle Expires July 22, 2010 [Page 32] Internet-Draft Fast Payload Replication January 2010 10. Further elaborations The above is a reasonably exhaustive exposition on the early design phase of a simple, but flexible, data replication system. Here are some elaborations to be more fully developed in the future. 10.1. Missing Payload Servers (MSPs) I originally planned for QSDs to request payloads they did not receive from a handful of HTTP servers run by each RUAS. This could have scaling problems, so I have developed an alternative which is closely integrated with the basic FPR system of Replicators. In Ivip, the RUAS will be making snapshots of the mapping information for each MAB (Mapped Address Block) on a regular basis, such as every 5 minutes or so. It will make these snapshots (in a compressed form) available via several HTTP servers so QSDs all over the world can download them during initialization. If a QSD was more than a few minutes behind with missing payloads, then it would be better for it to download the most recent snapshot instead and apply the updates it has received since that snapshot was made. So the missing packet server probably only needs to handle packets in the last 5 or so minutes. This fits well with the 10 bit "epochsec" field in the FPR header. I haven't yet decided how a QSD can specify which missing payload(s) it wants. One method may be to send the 32 bit FPR headers of the last payload received before the missing payloads and of the first payload received afterwards. I considered using a UDP protocol for requesting and receiving missing payloads from MSPs, but chose TCP instead, probably HTTP or HTTPS over TCP. This avoids any PMTU problems and removes the need for acks, resending queries and responses. TCP also avoids difficulties inherent in lightweight UDP protocols where the MSP could be used to amplify small query packets with spoofed source addresses into larger responses to DoS a victim. An MPS (Missing Packet Server) is a COTS server running software with an input stage identical to that of a Replicator or QSD. That is, it uses DTLS to receive two or more streams from Replicators and it uses the Fresh / Repeat algorithm to ignore all but the first appearance of a new payload. An MPS at a particular location would receive a stream from one or more physically and topologically close Replicators and ideally from some physically and topologically distant Replicators. "Topology" in this case means not just the underlying DFZ topology, but also that Whittle Expires July 22, 2010 [Page 33] Internet-Draft Fast Payload Replication January 2010 distant Replicator's location in the "topology" of upstream Replicators. The aim is to receive at least one local stream, which is inexpensive - probably from a Replicator in the same data center - and one or a few streams from distant Replicators. This is so that in the event of the local Replicator suffering an outage and so missing some packets, it is likely that the distant one will not be missing the same packets. If the local outage, such a complete loss of connectivity for a few seconds, or significant packet loss due to congestion, also affects the ability of the MPS to receive packets from distant Replicators, then the same packets may be lost. A simple workaround for this is to have the distant Replicator delay its stream by ten seconds or so. Such delayed outputs from a Replicator should only be used to drive QSDs and MSPs - never another Replicator. MPSes do not need to interpret the packets in order to update a mapping database, as does a QSD. The MPS does not need to interpret the payloads at all, or perform end-to-end authentication on their contents. The MPS only needs to store complete DTLS payloads for ten minutes or so and be able to provide them to requesters. The requesters will be either QSDs or other MPSes. So an MPS is a relatively light-weight network element. It may be quite busy at times responding to queries and sending out payloads, but most of the time, it is storing payloads in a simple fashion, and is not required to do any work on their contents. The request protocol does not need to be secure, since Ivip mapping information is public information. However, each MPS may wish to restrict its queriers to those which match an ACL. By some means TBD, each QSD and MPS could be configured to use several MPSes - including perhaps a distant one which is unlikely to be affected by any brief local outage which caused this QSD to be missing some packets. (Folks in North America, Siberia and Africa would have reason to give each other access to their MPSes!) In ordinary operation, each MPS would have a complete list of recent packets. If it was missing some packets, it would determine this by looking at the FPR headers and finding a gap in the "seq" numbers recently received for each RUAS. It would be scalable for each MPS to maintain a TCP connection with another MPS so the two could use the one link to request and deliver missing packets in both directions. Therefore, the MPSes could be arranged in multiple partially meshed groups - or these could be connected and so form a single global network of MPSes. The request Whittle Expires July 22, 2010 [Page 34] Internet-Draft Fast Payload Replication January 2010 protocol would probably need an option to cancel a request. For instance, an MPS in Los Angeles might first request one or more missing packets from an MPS in New York. But if the NY MPS replies that it too is missing these packets, the LA MPS might request them from an MPS in Beijing - which responds that it has them, and starts sending them. The LA MPS will then want to cancel the request to the NY MPS. HTTP or HTTPS is probably a good protocol for this purpose. QSDs would use the same protocol for querying MSPs. Whether the QSD starts an HTTP(S)-TCP connection when it needs missing packets, or whether it maintains such a connection in readiness, would be a matter for local policy. In this scenario, MPSes form an interdependent network, which will be highly robust. Most MPSes will have all the recent packets. Those which don't will automatically obtain them from other MPSes within a few seconds. An ISP which runs one or a few QSDs could run a missing packet server for all of them, with long-lasting sessions to a few other MSPs in nearby and distant ISPs. Alternatively, the QSDs could use one or more MSPs operated by other organisations, perhaps on a commercial basis. Since an MSP is simply software running on a COTS server, they are not expensive or difficult to deploy. It would be possible to run an MPS on the same host as a QSD, but if they are using streams from the same Replicators, then there will be a high correlation between the sets of packets which each function misses. Therefore, it makes sense for each QSD to use a nearby MSP, and then a distant one, rather than to run an MSP at the same site which will need to make much the same queries of other MSPs as the QSC would. 10.2. Delaying the output of Replicators If a QSD or MSP relied on a single upstream physical link, or a router or other device which might be subject to transient disruption, then having multiple streams from upstream Replicators will not necessarily ensure the QSD gets all the payloads which are sent. This is because the disruption will likely affect all such streams, which will be carrying much the same payloads at the same time. A possible workaround is to have one or more of the streams delayed at its source - in the output function of its Replicator. If one such stream was delayed by 5 seconds, then it would typically be able to deliver every payload which was not delivered during a 4 second disruption. Whittle Expires July 22, 2010 [Page 35] Internet-Draft Fast Payload Replication January 2010 So it may be desirable for delays such as this to be an option when a QSD or MSP requests a stream from a Replicator. A Replicator does not see a QSD request any differently from the request from another Replicator. So the question arises as to whether the stream from one Replicator to another should be delayed - and if so, by how much. It should be reasonably safe for a QSD or MPS to receive a stream with a delay of a few seconds, since the QSD does not propagate the payloads any further. The time could be locally chosen so that when added to a reasonable estimate of the longest delay affecting packets going into that Replicator, that there is still a safety margin within the minimum timeout of the QSDs timers for the purposes of Fresh / Repeat detection. To delay the packets received by a Replicator would be much more problematic. These delayed payloads could be propagated to other Replicators - and these delays could be added to by similar arrangements between other Replicators. Then, the total delay might exceed the limits of the Fresh / Repeat algorithm and QSDs and Replicators would mistake older payloads for ones which were actually sent 15 seconds or so later. (This could be prevented with a more elaborate algorithm, which also uses the 10 "epochsec" bits, but I think this raises further complications.) This would only occur with the highest allowed data-rates which might not occur in practice until the system was being used intensively - many years in the future. I think that delaying a stream to a Replicator could in principle improve its robustness if its two streams were likely to be subject to the same brief disruptions. However, it would be better to locate Replicators at data centres with multiple physical links and to try to ensure that the streams are most likely to arrive over diverse links. QSDs will cope with missing packets, and the aim of the Replicator system is to minimise the number of packets they miss. However, Replicators should not contribute to delays which might disrupt the ability of QSDs and other Replicators to correctly distinguish a Fresh packet from a Repeat. 10.3. Private network links to avoid DoS attacks The Replicator system as described above is a promising method of fanning out information to a very large number of recipient devices all over the world, in fractions of a second. While the system is distributed and has no single point of failure, if it was used for a purpose as important as distributing mapping for a core-edge separation system such as Ivip, it would no-doubt be threatened by DoS attacks in the form of gigabits per second of packets directed from large numbers of hacked botnet PCs. Whittle Expires July 22, 2010 [Page 36] Internet-Draft Fast Payload Replication January 2010 Internet protocols are intended to operate on the open Internet. However, the use of FPR may be a partial exception. Some root nameservers are toughened against DoS by being distributed to multiple high-bandwidth sites using anycast. In principle, by having enough fully meshed level 0 Replicators, the same goal could be achieved - for an attack to succeed, it would need to overwhelm all, or almost all of the devices at the same time. To some extent this can be achieved with Replicators, but it would probably be best to toughen the system against DoS attacks by linking the RUASes, the level 0 Replicators and at least the level 1 Replicators over private network links with assured bandwidth and no possibility of being affected by packets arriving from the Internet. In this case, the RUASes and level 0 Replicators may have private addresses. The level 1 Replicators may also have private addresses on their input side - the part which makes DTLS links to level 0 Replicators. In this model, the output addresses of the level 1 Replicators would be on public addresses so level 2 Replicators could establish sessions with them. Probably these level 1 Replicators would have two separate gigabit Ethernet ports - one for the private address and upstream links and the other for the public addresses and downstream links. The downstream public address of the level 1 Replicators might be the target of a DoS attack, but once the sessions have been established, Replicators do not need to receive any packets on those DTLS sessions. So a DoS attempt there would have little or no effect. Instead, a DoS attack would need to focus on the level 2 Replicators. Ideally, depending on the capacity of the attackers, these would be so numerous that an attack could only disrupt a subset of them. Even then, due to the cross-linked nature of the Replicator system, the impact of that attack on QSDs may be greatly diluted due to level 3 and 4 Replicators working fine from streams arriving from level 2 Replicators which were not targeted. Using private network links to fully mesh the level 0 Replicators, and for their streams to the level 1 Replicators, is a non-trivial matter. However, the benefits of a fast push mapping distribution system core-edge separation scheme for the Internet in general are immense - so this expense is therefore worth considering. If the first two levels are carefully optimised so that there are, for instance, 5 to 8 level 0 Replicators (only two or three are needed for highly reliable operation) and 50 to 100 or level 1 Replicators (of which quite a few could be dead without significantly Whittle Expires July 22, 2010 [Page 37] Internet-Draft Fast Payload Replication January 2010 disrupting streams to QSDs) then this system could drive 1000 to 2000 or perhaps more level 2 Replicators. This would probably make the system largely immune to DoS attacks - but of course the exact details would need to be considered at the time of deployment. Whittle Expires July 22, 2010 [Page 38] Internet-Draft Fast Payload Replication January 2010 11. Security Considerations For future work, but see notes above about the need for end-to-end authentication, and hardening against DoS attacks. Whittle Expires July 22, 2010 [Page 39] Internet-Draft Fast Payload Replication January 2010 12. IANA Considerations [To do.] Whittle Expires July 22, 2010 [Page 40] Internet-Draft Fast Payload Replication January 2010 13. Informative References [DFZ-unfrag-1470] Whittle, R., "Google sends 1470 byte unfragmentable packets", August 2008, . [I-D.ietf-lisp] Farinacci, D., Fuller, V., Meyer, D., and D. Lewis, "Locator/ID Separation Protocol (LISP)", draft-ietf-lisp-05 (work in progress), September 2009. [I-D.whittle-ivip-arch] Whittle, R., "Ivip (Internet Vastly Improved Plumbing) Architecture", draft-whittle-ivip-arch-04 (work in progress), January 2010. [I-D.whittle-ivip-db-fast-push] Whittle, R., "Ivip Mapping Database Fast Push", draft-whittle-ivip-db-fast-push-03 (work in progress), January 2010. [RFC2887] Handley, M., Floyd, S., Whetten, B., Kermode, R., Vicisano, L., and M. Luby, "The Reliable Multicast Design Space for Bulk Data Transfer", RFC 2887, August 2000. [RFC3133] Dunn, J. and C. Martin, "Terminology for Frame Relay Benchmarking", RFC 3133, June 2001. [RFC3740] Hardjono, T. and B. Weis, "The Multicast Group Security Architecture", RFC 3740, March 2004. [RFC4347] Rescorla, E. and N. Modadugu, "Datagram Transport Layer Security", RFC 4347, April 2006. [TTR Mobility] Whittle, R. and S. Russert, "TTR Mobility Extensions for Core-Edge Separation Solutions to the Internets Routing Scaling Problem", August 2008, . Whittle Expires July 22, 2010 [Page 41] Internet-Draft Fast Payload Replication January 2010 Author's Address Robin Whittle First Principles Email: rw@firstpr.com.au URI: http://www.firstpr.com.au/ip/ivip/ Whittle Expires July 22, 2010 [Page 42]