Internet Engineering Task Force A. Ford, Ed. Internet-Draft Roke Manor Research Intended status: Informational C. Raiciu Expires: December 24, 2010 University College London S. Barre Universite catholique de Louvain J. Iyengar Franklin and Marshall College June 22, 2010 Architectural Guidelines for Multipath TCP Development draft-ietf-mptcp-architecture-01 Abstract Endpoints are often connected by multiple paths, but TCP restricts communications to a single path per transport connection. Resource usage within the network would be more efficient were these multiple paths able to be used concurrently. This should enhance user experience through improved resilience to network failure and higher throughput. This document outlines architectural guidelines for the development of a Multipath Transport Protocol, with references to how these architectural components come together in the Multipath TCP (MPTCP) protocol. This document also lists certain high level design decisions that provide foundations for the MPTCP design, based upon these architectural requirements. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 24, 2010. Ford, et al. Expires December 24, 2010 [Page 1] Internet-Draft MPTCP Architecture June 2010 Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Ford, et al. Expires December 24, 2010 [Page 2] Internet-Draft MPTCP Architecture June 2010 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 5 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 1.3. Reference Scenario . . . . . . . . . . . . . . . . . . . . 5 2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1. Functional Goals . . . . . . . . . . . . . . . . . . . . . 6 2.2. Compatibility Goals . . . . . . . . . . . . . . . . . . . 7 2.2.1. Application Compatibility . . . . . . . . . . . . . . 7 2.2.2. Network Compatibility . . . . . . . . . . . . . . . . 7 2.2.3. Compatibility with other network users . . . . . . . . 8 3. An Architectural Basis For MPTCP . . . . . . . . . . . . . . . 9 4. A Functional Decomposition of MPTCP . . . . . . . . . . . . . 10 5. High-Level Design Decisions . . . . . . . . . . . . . . . . . 12 5.1. Sequence Numbering . . . . . . . . . . . . . . . . . . . . 12 5.2. Reliability . . . . . . . . . . . . . . . . . . . . . . . 13 5.3. Buffers . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.4. Signalling . . . . . . . . . . . . . . . . . . . . . . . . 15 5.5. Path Management . . . . . . . . . . . . . . . . . . . . . 15 5.6. Connection Identification . . . . . . . . . . . . . . . . 16 5.7. Network Layer Compatibility . . . . . . . . . . . . . . . 16 5.8. Congestion Control . . . . . . . . . . . . . . . . . . . . 17 6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 7. Security Considerations . . . . . . . . . . . . . . . . . . . 17 8. Interactions with Applications . . . . . . . . . . . . . . . . 17 9. Interactions with Middleboxes . . . . . . . . . . . . . . . . 18 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 19 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 13.1. Normative References . . . . . . . . . . . . . . . . . . . 20 13.2. Informative References . . . . . . . . . . . . . . . . . . 20 Appendix A. Implementation Architecture . . . . . . . . . . . . . 21 A.1. Functional Separation . . . . . . . . . . . . . . . . . . 21 A.1.1. Application to default MPTCP protocol . . . . . . . . 21 A.1.2. Generic architecture for MPTCP . . . . . . . . . . . . 24 A.2. PM/MPS interface . . . . . . . . . . . . . . . . . . . . . 25 Appendix B. Changelog . . . . . . . . . . . . . . . . . . . . . . 26 B.1. Changes since draft-ietf-mptcp-architecture-00 . . . . . . 26 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27 Ford, et al. Expires December 24, 2010 [Page 3] Internet-Draft MPTCP Architecture June 2010 1. Introduction As the Internet evolves, demands on Internet resources are ever- increasing, but often these resources (in particular, bandwidth) cannot be fully utilised due to protocol constraints both on the end- systems and within the network. If these resources could instead be used concurrently, end user experience could be greatly improved. Such enhancements would also reduce the necessary expenditure on network infrastructure which would otherwise be needed to create an equivalent improvement in user experience. By the application of resource pooling[2], these available resources can be 'pooled' such that they appear as a single logical resource to the user. The purpose of a multipath transport, therefore, is to make use of multiple available paths, through resource pooling, to bring two key benefits: o To increase the resilience of the connectivity by providing multiple paths, protecting end hosts from the failure of one. o To increase the efficiency of the resource usage, and thus increase the network capacity available to end hosts. Multipath TCP (MPTCP)[3] is a set of extensions for TCP[4] that implements a multipath transport and achieves these goals by pooling multiple paths within a transport connection, transparent to the application. While multihoming and multipath functions have been implemented in transport protocols previously, notably SCTP[5], MPTCP is distinct in recognizing application and network compatibility goals that we believe are important for deployability of a multipath transport; we discuss these goals in more detail later in Section 2. This document makes three contributions: (i) it describes goals for a multipath transport - goals that MPTCP is designed to meet; (ii) it lays out an architectural basis for MPTCP's design - a discussion that applies to other multipath transports as well; and (iii) it discusses and documents high-level design decisions made in MPTCP's development, and considers their implications. Companion documents to this architectural overview are those which provide details of the protocol extensions[3], congestion control algorithms[6], and application-level considerations[7]. Put together, these components specify a complete Multipath TCP design. We note that specific components are replaceable with other protocols in accordance with the layer and functional decompositions discussed in this document. Please note this document is a work-in-progress and covers several Ford, et al. Expires December 24, 2010 [Page 4] Internet-Draft MPTCP Architecture June 2010 topics, some of which may be more appropriately moved to separate documents as this work evolves. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [1]. 1.2. Terminology Path: A sequence of links between a sender and a receiver, defined in this context by a source and destination address pair. Endpoint: A host either initiating or terminating a MPTCP connection. Multipath TCP (MPTCP): A modified version of the TCP [4] protocol that supports the simultaneous use of multiple paths between endpoints. Subflow: A flow of TCP packets operating over an individual path, which forms part of a larger MPTCP connection. MPTCP Connection: A set of one or more subflows combined to provide a single Multipath TCP service to an application at an endpoint. 1.3. Reference Scenario The diagram shown in Figure 1 illustrates a typical usage scenario for MPTCP. Two hosts, A and B, are communicating with each other. These endpoints are multi-homed and multi-addressed, providing two disjoint connections to the Internet. The addresses on each endpoint are referred to as A1, A2, B1 and B2. There are therefore up to four different paths between the two endpoints: A1-B1, A1-B2, A2-B1, A2-B2. +------+ __________ +------+ | |A1 ______ ( ) ______ B1| | | Host |--/ ( ) \--| Host | | | ( Internet ) | | | A |--\______( )______/--| B | | |A2 (__________) B2| | +------+ +------+ Figure 1: Simple MPTCP Usage Scenario The scenario could have any number of addresses (1 or more) on each Ford, et al. Expires December 24, 2010 [Page 5] Internet-Draft MPTCP Architecture June 2010 endpoint, so long as the number of paths available between the two endpoints is 2 or more (i.e. num_addr(A) * num_addr(B) > 1). The paths created by these address combinations through the Internet need not be entirely disjoint - shared bottlenecks will be addressed by the MPTCP congestion controller. Furthermore, the paths through the Internet may be interrupted by any number of middleboxes including NATs and Firewalls. Finally, although the diagram refers to the Internet, MPTCP may be used over any network where there are multiple paths that could be used concurrently. TBD - what further detail here would be useful? 2. Goals This section outlines primary goals that Multipath TCP aims to meet. These are broadly broken down into functional goals, which steer services and features that MPTCP must provide, and compatibility goals, which determine how MPTCP should appear to entities that interact with it. 2.1. Functional Goals In providing the use of multiple paths, MPTCP has the following two functional goals. o Improve Throughput: MPTCP MUST support the concurrent use of multiple paths. To meet the minimum performance incentives for deployment, an MPTCP connection over multiple paths SHOULD achieve no lesser throughput than a single TCP connection over the best constituent path. o Improve Resilience: MPTCP MUST support the use of multiple paths interchangeably for resilience purposes, by permitting packets to be sent and re-sent on any available path. It follows that, in the worst case, the protocol MUST be no less resilient than legacy TCP. As distribution of traffic among available paths and responses to congestion are done in accordance with resource pooling principles[2], a secondary effect of meeting these goals is that widespread use of MPTCP over the Internet should optimize overall network utility by shifting load away from congested bottlenecks and by taking advantage of spare capacity wherever possible. Furthermore, MPTCP SHOULD feature automatic negotiation of its use. A host supporting Multipath TCP that requires the other endpoint to do so too must be able to detect reliably whether this endpoint does Ford, et al. Expires December 24, 2010 [Page 6] Internet-Draft MPTCP Architecture June 2010 in fact support the next-generation protocol, using it if so, and otherwise automatically falling back to the legacy protocol. 2.2. Compatibility Goals In addition to the functional goals listed above, a Multipath TCP must meet a number of compatibility goals in order to support deployment in today's Internet. These goals fall into the following categories: 2.2.1. Application Compatibility Application compatibility refers to the appearance of MPTCP to the application both in terms of the API that can be used and the expected service model that is provided. MPTCP MUST follow the same service model as TCP [4]: in-order, reliable, and byte-oriented delivery. Furthermore, an MPTCP connection SHOULD provide the application with no worse throughput than it would expect from running a single TCP connection over any one of its available paths. A multipath-capable equivalent of TCP SHOULD retain backward compatibility with existing TCP APIs, so that existing applications can use the newer transport merely by upgrading the operating systems of the end-hosts. This does not preclude the use of an advanced API to permit multipath-aware applications to specify preferences, nor for users to configure their systems in a different way from the default, for example switching on or off the automatic use of MPTCP. 2.2.2. Network Compatibility Traditional Internet architecture slots network devices in the network layer and lower layers of the OSI 7-layer stack, where the layers above the network layer - the transport layer and upper layers - are instantiated only at the end-hosts. While this architecture, shown in Figure 2, was largely adhered to earlier, this layering no longer reflects the "ground truth" in the Internet with the proliferation of middleboxes[8]. Middleboxes routinely interpose on the transport layer; sometimes even completely terminating transport connections, thus leaving the application layer as the first real end-to-end layer, as shown in Figure 3. Ford, et al. Expires December 24, 2010 [Page 7] Internet-Draft MPTCP Architecture June 2010 +-------------+ +-------------+ | Application |<------------ end-to-end ------------->| Application | +-------------+ +-------------+ | Transport |<------------ end-to-end ------------->| Transport | +-------------+ +-------------+ +-------------+ +-------------+ | Network |<->| Network |<->| Network |<->| Network | +-------------+ +-------------+ +-------------+ +-------------+ End Host Router Router End Host Figure 2: Traditional Internet Architecture +-------------+ +-------------+ | Application |<------------ end-to-end ------------->| Application | +-------------+ +-------------+ +-------------+ | Transport |<------------------->| Transport |<->| Transport | +-------------+ +-------------+ +-------------+ +-------------+ | Network |<->| Network |<->| Network |<->| Network | +-------------+ +-------------+ +-------------+ +-------------+ Firewall, End Host Router NAT, or Proxy End Host Figure 3: Internet Reality Middleboxes that interpose on the transport layer result in loss of "fate-sharing"[9], that is, they often hold "hard" state that, when lost or corrupted, results in loss or corruption of the end-to-end transport connection. MPTCP MUST remain backward compatible with the Internet as it exists today, including being able to traverse predominant middleboxes such as firewalls, NATs, and performance enhancing proxies[8]. This requirement comes from recognizing middleboxes as a significant deployment bottleneck for any transport that is not TCP, and constrains MPTCP to appear as TCP does on the wire and to use established TCP extensions where necessary. To ensure end-to-endness of the transport, we further require MPTCP to preserve fate-sharing without making any assumptions about middlebox behavior. 2.2.3. Compatibility with other network users As a corollary to both network and application compatibility, the architecture must enable new Multipath TCP flows to coexist gracefully with existing legacy TCP flows, competing for bandwidth neither unduly aggressively or unduly timidly (unless low-precedence operation is specifically requested by the application, such as with LEDBAT). The use of multiple paths MUST not unduly harm users using single path TCP at shared bottlenecks, beyond the impact that would Ford, et al. Expires December 24, 2010 [Page 8] Internet-Draft MPTCP Architecture June 2010 occur from another single legacy TCP flow. 3. An Architectural Basis For MPTCP We now present one possible transport architecture that we believe can effectively support MPTCP's goals. The new Internet model described here is based on ideas proposed earlier in Tng ("Transport next-generation") [10]. While by no means the only possible architecture supporting multipath transport, Tng incorporates many lessons learned from previous transport research and development practice, and offers a strong starting point from which to consider the extant Internet architecture and its bearing on the design of any new Internet transports or transport extensions. +------------------+ | Application | +------------------+ ^ Application-oriented transport | | | functions (Semantic Layer) + - - Transport - -+ ---------------------------------- | | | Network-oriented transport +------------------+ v functions (Flow+Endpoint Layer) | Network | +------------------+ Existing Layers Tng Decomposition Figure 4: Decomposition of Transport Functions Tng loosely splits the transport layer into "application-oriented" and "network-oriented" layers, as shown in Figure 4. The application-oriented "Semantic" layer implements functions driven primarily by concerns of supporting and protecting the application's end-to-end communication, while the network-oriented "Flow+Endpoint" layer implements functions such as endpoint identification (using port numbers) and congestion control. These network-oriented functions, while traditionally located in the ostensibly "end-to-end" Transport layer, have proven in practice to be of great concern to network operators and the middleboxes they deploy in the network to enforce network usage policies[11] [12] or optimize communication performance[13]. Figure 5 shows how middleboxes interact with different layers in this decomposed model of the transport layer: the application-oriented layer operates end-to-end, while the network- oriented layer operates "segment-by-segment" and can be interposed upon by middleboxes. Ford, et al. Expires December 24, 2010 [Page 9] Internet-Draft MPTCP Architecture June 2010 +-------------+ +-------------+ | Application |<------------ end-to-end ------------->| Application | +-------------+ +-------------+ | Semantic |<------------ end-to-end ------------->| Semantic | +-------------+ +-------------+ +-------------+ +-------------+ |Flow+Endpoint|<->|Flow+Endpoint|<->|Flow+Endpoint|<->|Flow+Endpoint| +-------------+ +-------------+ +-------------+ +-------------+ | Network |<->| Network |<->| Network |<->| Network | +-------------+ +-------------+ +-------------+ +-------------+ Firewall Performance End Host or NAT Enhancing Proxy End Host Figure 5: Middleboxes in the new Internet model MPTCP's architectural design follows Tng's decomposition as shown in Figure 6. The MPTCP component, which provides application compatibility through the preservation of TCP-like semantics of global ordering of application data and reliability, is an instantiation of the "application-oriented" Semantic layer; whereas the legacy-TCP component, which provides network compatibility by appearing and behaving as a TCP flow in network, is an instantiation of the "network-oriented" Flow+Endpoint layer. +--------------------------+ +-------------------------+ | Application | | Application | +--------------------------+ +-------------------------+ | Semantic | | MPTCP | |--------------------------| + - - - - - + - - - - - + | Flow+Endpt | Flow+Endpt | | TCP | TCP | +--------------------------+ +-------------------------+ | Network | Network | | IP | IP | +--------------------------+ +-------------------------+ Figure 6: MPTCP mapping to Tng As a protocol extension to TCP, MPTCP thus explicitly acknowledges middleboxes in its design, and specifies a protocol that operates at two scales: the MPTCP component operates end-to-end, while it allows the TCP component to operate segment-by-segment. 4. A Functional Decomposition of MPTCP Having laid out the goals to be met and the architectural basis for MPTCP, we now provide a functional decomposition MPTCP's design. The MPTCP component relies upon (what appear to the network to be) standard TCP sessions, termed "subflows", to provide the underlying Ford, et al. Expires December 24, 2010 [Page 10] Internet-Draft MPTCP Architecture June 2010 transport per path, and as such these retain the network compatibility desired. MPTCP as described in [3] carries MPTCP- specific information in a TCP-compatible manner, although this mechanism is separate from the actual information being transferred so could evolve in future revisions. Figure 7 illustrates the layered architecture. +-------------------------------+ | Application | +---------------+ +-------------------------------+ | Application | | MPTCP | +---------------+ + - - - - - - - + - - - - - - - + | TCP | | Subflow (TCP) | Subflow (TCP) | +---------------+ +-------------------------------+ | IP | | IP | IP | +---------------+ +-------------------------------+ Figure 7: Comparison of Standard TCP and MPTCP Protocol Stacks Situated below the application, the MPTCP extension manages multiple TCP subflows below it and must implement the following functions: o Path Management: This is the function to detect and use multiple paths between two endpoints. In the case of the MPTCP design [3], this feature is implemented using multiple IP addresses at least one of the endpoints. Although this does not guarantee path diversity, and there may be shared bottlenecks, this is a simple mechanism that can be used with no additional features in the network. The path management features of the MPTCP protocol are the mechanisms to signal alternative addresses to endpoints, and mechanisms to set up new subflows attached to an existing MPTCP connection. o Packet Scheduling: This function breaks the bytestream received from the application into segments which are transmitted on one of the available lower subflows. The MPTCP design makes use of a data sequence mapping, associating packets sent on different subflows to a connection-level sequence numbering, thus allowing packets sent on different subflows to be correctly re-ordered at the receiver. The packet scheduler is dependent upon information about the availability of paths exposed by the path management component, and then makes use of the subflows to transmit these packets. o Subflow (single-path TCP) Interface: A subflow component takes segments from the packet-scheduling component and transmits them over the specified path, ensuring detectable delivery to the endpoint. Detection of delivery is necessary to allow the Ford, et al. Expires December 24, 2010 [Page 11] Internet-Draft MPTCP Architecture June 2010 congestion control protocol to attribute packet delivery or loss to the right path. Note that the packet scheduling component does not embed enough information in packets to allow this to happen: segments with the same connection-level sequence number can be transmitted over multiple paths, i.e. as retransmissions or just to increase redundancy. MPTCP uses TCP underneath for network compatibility; TCP ensures in-order, reliable delivery. TCP adds its of sequence numbers to the segments; these are used to detect and retransmit lost packets. o Congestion Control: This function manages congestion control across the subflows. As specified, this congestion control algorithm must ensure that a MPTCP connection does not unfairly take more bandwidth than a single path TCP flow would take at a shared bottlneck. An algorithm to support this is specified in [6]. These functions fit together as follows. The Path Management looks after the discovery (and if necessary, initialisation) of multiple paths between two endpoints. The Packet Scheduler then receives packets from the application for the network and does the necessary operations on them (such as adding a data-level sequence number) before sending to a subflow. The subflow then adds its own sequence number, acks, and passes them to network. The receiving subflow re- orders data and passes it to the MPTCP component, which performs connection level re-ordering, removes the segment boundaries and sends it to the application. Finally, the congestion control component exists as part of the packet scheduling, in order to schedule which packets should be sent at what rate on which subflow. 5. High-Level Design Decisions There is seemingly a wide range of choices when designing a multipath extension to TCP. However, the goals as discussed earlier in this document constrain the possible solutions, leaving relative little choice in many areas. Here, we outline high-level design choices that draw from the architectural basis discussed earlier in Section 3, and their implications for the MPTCP design. 5.1. Sequence Numbering MPTCP uses two levels of sequence spaces: a connection level sequence number, and another sequence number for each subflow. This permits connection-level segmentation and reassembly, and retransmission of the same part of connection-level sequence space on different subflow-level sequence space. Ford, et al. Expires December 24, 2010 [Page 12] Internet-Draft MPTCP Architecture June 2010 The alternative approach would be to use a single connection level sequence number, which gets sent on multiple subflows. This has two problems: first, the individual subflows will appear to the network as TCP sessions with gaps in the sequence space; this in turn may upset certain middleboxes such as intrusion detection systems, or certain transparent proxies, and would go against the network compatibility goal. Second, the sender cannot attribute packet losses or receptions to the correct path when the same packet is sent on multiple paths, in the case of retransmissions. The sender must be able to tell the receiver how to reorder the data, for delivery to the application. The sender does so by telling the receiver how subflow-level data (carying subflow sequence numbers) maps at connection level, which we refer to as Data Sequence Mapping. This mapping takes the form (data seq, subflow seq, length), i.e. for a given number of bytes (the length), the subflow sequence space beginning at the given sequence number maps to the connection-level sequence space (beginning at the given data seq number). This architecture does not mandate a mechanism for signalling such information, and it could conceivably have various sources. One option would be to use existing fields in the TCP segment (such as subflow seqno, length) and only add the data sequence number to each segment, for instance as a TCP option. This is, however, vulnerable to middleboxes that resegment or assemble data, since there is no specified behaviour for coalescing TCP options. If one signalled (data seqno, length), this would still be vulnerable to middleboxes that coalesce segments and do not correctly coalesce the options. Because of these potential issues, the current specification of MPTCP mandates that the full mapping should be sent to the other end. To reduce the overhead, it would be permissable for the mapping to be sent periodically and cover more than a single segment. It could also be excluded entirely in the case of a connection before more than one subflow is used, where the data-level and subflow-level sequence space is the same. 5.2. Reliability Under normal behaviour, MPTCP can use the data sequence mapping and subflow ACKs to decide when a connection-level segment was received. This has certain implications on end-to-end semantics. It means that once a packet is acked at subflow level it cannot be discarded in the re-order buffer at the connection level. Secondly, unlike in standard TCP, a receiver cannot simply drop out-of-order segments if needed (for instance, due to memory pressure). Ford, et al. Expires December 24, 2010 [Page 13] Internet-Draft MPTCP Architecture June 2010 Furthermore, it is possible to conceive of some cases where connection-level acknowledgements could improve robustness. Consider a subflow traversing a transparent proxy: if the proxy acks a segment and then crashes, the sender will not retransmit the lost segment on another subflow, as it thinks the segment has been received. The connection grinds to a halt despite having other working subflows, and the sender would be unable to determine the cause of the problem. Finally, as an optimisation, it may be feasible for a connection- level acknowledgement to be transmitted over the shortest RTT path, potentially reducing send buffer requirements (see Section 5.3). Therefore, to provide a fully robust multipath TCP solution, MPTCP SHOULD feature explicit connection-level acknowledgements. Regarding retransmissions, it MUST be possible for a packet to be retransmitted on a different subflow to that on which it was originally sent. This is one of MPTCP's core goals, in order to maintain integrity during temporary or permanent subflow failure, and this is enabled by the dual sequence number space. The scheduling of retransmissions will have significant impact on MPTCP user experience. The current MPTCP specification suggests that data outstanding on subflows that have timed out should be rescheduled for transmission on different subflows. This behaviour aims to minimize disruption when a path breaks, and uses the first timeout as indicators. More conservative versions would be to use second or third timeouts for the same packet. When packet loss is detected and corrected with fast retransmit, retransmission on different subflows may still be desirable in certain cases, for instance to reduce the receive buffer requirements. However, in all cases with retransmissions on different subflows, the lost packets SHOULD still be sent on the path that lost them. This is currently believed to be necessary to maintain subflow integrity, as per the network compatiblity goal. By doing this, throughput will be wasted, and it is unclear at this point what the optimal retransmit strategy is. 5.3. Buffers Receive Buffer: ideally, a subflow failing should not affect the throughput of other working subflows. However, the receive buffer has limited size: if a flow times out, the other subflows will quickly fill the receive buffer with out-of-order data, and will stall. Hence, receive buffer sizing is important for both robustness and throughput. The smallest receive buffer we need to avoid stalling under any Ford, et al. Expires December 24, 2010 [Page 14] Internet-Draft MPTCP Architecture June 2010 circumstances is max(RTO)*sum(BW). This is, for most multipath connections, too expensive. A more reasonable size is proportional to max(RTT)*sum(BW) which ensures subflows don't stall when fast retransmit works. Also, depending on how the implementation behaves, an additional sum(RTT*BW) might be needed for the individual re-order buffers of the TCP subflows. Send Buffer: the smallest send buffer we need is sum(BDP) across all paths; this is to hold data until it's acked at subflow level. If we didn't use a subflow level ack, and relied on a data-level ack, the send buffer would need to be as big as the receive buffer of the connection, max(RTT)*sum(BW). In practice, the senders will be web servers and receivers will be desktops or mobile servers. The send buffer size matters particularly for servers, which must be able to maintain a large number of ongoing connections. 5.4. Signalling Since MPTCP will use regular TCP streams as its transport mechanism, a MPTCP connection will also begin as a single TCP stream. Nevertheless, it must signal to the peer that it supports MPTCP and wishes to use it on this connection. As such, a TCP Option will be used to transmit this information, since this is the established mechanism for indicating additional functionality on a TCP session. On top of this, however, is signalling required during the operation of an MPTCP session, such as that for reassembly for multiple subflows, and for informing the other endpoint about potential other available addresses. It is not mandated by the architecture in what format this signalling should be transmitted. The current MPTCP protocol proposal suggests the use of TCP options for this signalling, however another approach would be to embed such information in the payload, and use type-length-value (TLV) encoding to separate signalling and payload data. 5.5. Path Management Currently, the network does not expose multiple paths between endpoints. Multipath TCP will use multiple addresses at one or both endpoints to get different paths to the destination. The hope is that these paths, whilst not necesarily entirely non-overlapping, will be sufficiently disjoint to allow multipath achieve improved throughput and robustness. Multiple different (source, destination) address pairs will thus be used as path selectors. Each path will be identified by a TCP 4-tuple (i.e. source address, destination address, source port, Ford, et al. Expires December 24, 2010 [Page 15] Internet-Draft MPTCP Architecture June 2010 destination port), thus allowing the extension of MPTCP to use such 4-tuples as path selectors if the network will route different ports over different paths (which may be the case with technologies such as ECMP). For increased chance of successfully setting up additional subflows (such as when one end is behind a firewall, NAT, or other restrictive middlebox), either endpoint should be able to add new subflows to a MPTCP connection. The modularity of path management will permit alternative mechanisms to be employed if appropriate in the future. 5.6. Connection Identification Therefore, each MPTCP connection should have a connection identifier at each endpoint, which is locally unique within that endpoint. In many ways, this is analogous to a port number in regular TCP. The manifestation and purpose of such an identifier is out of the scope of this architecture document. Legacy applications will not, however, have access to this identifier and in such cases a MPTCP connection will be identified by the 5-tuple of the first TCP subflow. It is out of the scope of this document, however, to define the behaviour of the MPTCP implementation if the first TCP subflow later fails. If there are legacy applications that make assumptions about continued existance of the initial address pair, their behaviour could be disrupted by carrying on regardless. It is expected that this is a very small, possibly negligible, set of applications, however. In the case of applications that have specifically asked to be bound to a particular address or interface, MPTCP will not be used. Since the requirements of applications are not clear at this stage, however, it is as yet unconfirmed what the best behaviour is. It will be an implementation-specific solution, however, and as such the behaviour is expected to be chosen by implementors once more research has been undertaken to determine its impact. 5.7. Network Layer Compatibility MPTCP's modifications remain at the transport layer, although some knowledge of the underlying network layer is required. MPTCP MUST work with IPv4 and IPv6 interchangeably, i.e. one MPTCP connection may operate over both IPv4 and IPv6 networks. Ford, et al. Expires December 24, 2010 [Page 16] Internet-Draft MPTCP Architecture June 2010 5.8. Congestion Control As already documented in network-layer compatibility requirements, the congestion control algorithms used by an MPTCP implementation must not harm other legacy users on shared bottlenecks. To achieve this, the congestion control algorithms on use on each subflow must be coupled in some way - a proposal for this is given in [6]. 6. Summary This document has provided a summary of the components that have been identified to provide a Multipath TCP solution, and described the high-level design decisions that have been used as a basis of the MPTCP specification. The suite of drafts that specify a complete MPTCP implementation, on top of this architectural overview, are as follows: o A specification of the MPTCP protocol [3], describing the on- and off-the-wire differences to regular TCP. o A specification of a coupled congestion control algorithm [6], that can be applied to the above protocol while meeting the goals for such an algorithm as specified in this document. o A document [7] that builds upon the application compatibility issues discussed in this document, explaining in more detail what if any changes an application may experience through the use of MPTCP. This document also provides a proposed API through which an application can influence the behaviour of the MPTCP protocol, as specified in the above drafts. 7. Security Considerations Please see [14] for a threat analysis of Multipath TCP. The threats analysed in this companion document are addressed as appropriate in the protocol design [3]. 8. Interactions with Applications Interactions with applications - incuding, but not limited to, performances changes that may be expected, semantic changes, and new features that may be requested of an API, are presented in [7]. Ford, et al. Expires December 24, 2010 [Page 17] Internet-Draft MPTCP Architecture June 2010 9. Interactions with Middleboxes As discussed in Section 2.2, it is a goal of MPTCP to be deployable today and thus compatible with the majority of middleboxes. This section summarises the issues that may arise with NATs, firewalls, proxies, intrusion detection systems, and other middleboxes that, if not considered in the protocol design, may hinder its deployment. This section is intended primarily as a description of options and considerations only. Protocol-specific solutions to these issues will be given in the companion documents. Multipath TCP will be deployed in a network that no longer provides just basic datagram delivery. A miriad of middleboxes are deployed to optimize various perceived problems with the Internet protocols: NATs primarily address space shortage [11], Performance Enhancing Proxies (PEPs) optimize TCP for different link characteristics [13], firewalls [12] and intrusion detection systems try to block malicious content from reaching a host, and traffic normalizers [15] ensure a consistent view of the traffic stream to IDSes and hosts. All these middleboxes optimize current applications at the expense of future applications. In effect, future applications must mimic existing ones if they want to be deployed. Further, the precise behaviour of all these middleboxes is not clearly specified, and implementation errors make matters worse, raising the bar for the deployment of new technologies. The following list of middlebox classes documents behaviour that could impact the use of MPTCP. This list is used in [3] to describe the features of the MPTCP protocol that are used to mitigate the impact of these middlebox behaviours. o NATs: Network Address Translators decouple the endpoint's local IP address with that which is seen in the wider Internet when the packets are transmitted through a NAT. This adds complexity, and reduces the chances of success, when signalling IP addresses. o PEPs: Performance Enhancing Proxies, which aim to improve the performance of protocols over low-performance (e.g. high latency or high error rate) links. As such, they may "split" a TCP connection and behaviour such as proactive ACKing may occur. As with NATs, it is no longer guaranteed that one endpoint is communicating directly with another. o Traffic Normalizers: These aim to eliminate ambiguities and potential attacks at the network level, and amongst other things are unlikely to permit holes in sequence space. Ford, et al. Expires December 24, 2010 [Page 18] Internet-Draft MPTCP Architecture June 2010 o TCP Options: many middleboxes are in a position to drop packets with unknown TCP options, or strip those options from the packets. o Segmentation/Colescing: middleboxes (or even something as close to the end host as TCP Segmentation Offloading) may change the packet boundaries from those which the sender intended. It may do this by splitting packets, or coalescing them together. This leads to two major impacts: we cannot guarantee where a packet boundary will be, and we cannot say for sure what a middlebox will do with TCP options in these cases (they may be repeated, dropped, or sent only once). o Firewalls: on top of preventing incoming connections, firewalls may also attempt additional protection such as sequence number randomization. o Intrusion Detection Systems: IDSs may look for traffic patterns to protect a network, and may have false positives with MPTCP and drop the connections during normal operation. For future MPTCP- aware middleboxes, they will require the ability to correlate the various paths in use. 10. Acknowledgements Alan Ford, Costin Raiciu and Sebastien Barre are supported by Trilogy (http://www.trilogy-project.org), a research project (ICT-216372) partially funded by the European Community under its Seventh Framework Program. The views expressed here are those of the author(s) only. The European Commission is not liable for any use that may be made of the information in this document. 11. Contributors The authors would like to acknowledge the contributions of Mark Handley and Bryan Ford to this document. 12. IANA Considerations None. 13. References Ford, et al. Expires December 24, 2010 [Page 19] Internet-Draft MPTCP Architecture June 2010 13.1. Normative References [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 13.2. Informative References [2] Wischik, D., Handley, M., and M. Bagnulo Braun, "The Resource Pooling Principle", ACM SIGCOMM CCR vol. 38 num. 5, pp. 47-52, October 2008, . [3] Ford, A., Raiciu, C., and M. Handley, "TCP Extensions for Multipath Operation with Multiple Addresses", draft-ietf-mptcp-multiaddressed-00 (work in progress), June 2010. [4] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981. [5] Stewart, R., "Stream Control Transmission Protocol", RFC 4960, September 2007. [6] Raiciu, C., Handley, M., and D. Wischik, "Coupled Multipath- Aware Congestion Control", draft-raiciu-mptcp-congestion-01 (work in progress), March 2010. [7] Scharf, M. and A. Ford, "MPTCP Application Interface Considerations", draft-scharf-mptcp-api-01 (work in progress), March 2010. [8] Carpenter, B. and S. Brim, "Middleboxes: Taxonomy and Issues", RFC 3234, February 2002. [9] Carpenter, B., "Internet Transparency", RFC 2775, February 2000. [10] Ford, B. and J. Iyengar, "Breaking Up the Transport Logjam", ACM HotNets, October 2008. [11] Srisuresh, P. and K. Egevang, "Traditional IP Network Address Translator (Traditional NAT)", RFC 3022, January 2001. [12] Freed, N., "Behavior of and Requirements for Internet Firewalls", RFC 2979, October 2000. [13] Border, J., Kojo, M., Griner, J., Montenegro, G., and Z. Shelby, "Performance Enhancing Proxies Intended to Mitigate Ford, et al. Expires December 24, 2010 [Page 20] Internet-Draft MPTCP Architecture June 2010 Link-Related Degradations", RFC 3135, June 2001. [14] Bagnulo, M., "Threat Analysis for Multi-addressed/Multi-path TCP", draft-ietf-mptcp-threat-02 (work in progress), March 2010. [15] Handley, M., Paxson, V., and C. Kreibich, "Network Intrusion Detection: Evasion, Traffic Normalization, and End-to-End Protocol Semantics", Usenix Security 2001, 2001, . Appendix A. Implementation Architecture This section provides suggestions for an architecture to implement an extensible, modular multipath transport protocol. A.1. Functional Separation This section describes a generic view of the internal implementation of a Multipath TCP, through which the technical components specified in the companion documents can fit together. It shows how an implementation could be built that permits extensibility between components without changing the external representation. We first show the functional decomposition of an MPTCP solution that is completely contained in the transport layer. That solution is described in more details in [3]. Then we generalize the approach to allow good extensibility of that solution. A.1.1. Application to default MPTCP protocol Although, in the default approach, MPTCP is fully contained in the transport layer, it can still be divided into two main modules. One manages the scheduling of packets as well as congestion control. The other one manages the control of paths. The interface between the two is dealt with thanks to a Path Index. As shown in Figure 8, the Path Manager announces to the MultiPath Scheduler what paths can be used trough path indices, and maintains the mapping between that value and the particular action that it must apply to use the path (an example of such a mapping is in Table 1). In the case of the built-in Path Manager, the action is to replace an address/port pair with another one, in such a way that another path is used across the Internet to forward that packet. Ford, et al. Expires December 24, 2010 [Page 21] Internet-Draft MPTCP Architecture June 2010 Control plane <-- | --> Data plane +---------------------------------------------------------------+ | Multipath Scheduler (MPS) | +---------------------------------------------------------------+ ^ | | | | [A1,B1,|pA1,pB1] |For conn_id | | | | +-------------+ |Paths 1->4 can be | | Data packet |<--Path idx:3 |used. | +-------------+ attached | | | by MPS | | V +--------------------------------------------\------------------+ | Path Manager (PM) \[A1,B1]->[A1,B2] | +--------------------------------------------------\------------+ / \ | \ /-----------------------------\ | /"\ /"\ /"\ /"\ | rewriting table: || | | | | | | | | | Subflow id <--> network_id || | | | | | | | | | || | | | | | | | | | [see table below] || | | | | | | | | | || \./ \./ \./ \./ +------------------------------+| path1 path2 path3 path4 Figure 8: Functional separation of MPTCP in the transport layer The MultiPath Scheduler only deals with abstract paths, represented by numbers. It only sees one address pair throughout the communication, that we call the connection identifier. However, the MultiPath Scheduler must be able to perform per-subflow congestion control, and thus to distinguish between the subflows. This leads to define a subflow identifier, that consists of the usual transport identifier extended with the path index: . The following options, described in [3], are managed by the MultiPath Scheduler. o MULTIPATH CAPABLE (MPC): Tell the peer that we support MPTCP. Note that the MPC option also holds a token, which is necessary only if the built-in Path Manager is used. In the next section we describe the generalized case, where the token can be ignored by the receiver if another path manager is used. o DATA SEQUENCE NUMBER (DSN): Identify the position of a set of bytes in the meta-flow. o DATA FIN (DFIN): Terminate a meta-flow. Ford, et al. Expires December 24, 2010 [Page 22] Internet-Draft MPTCP Architecture June 2010 An implementation MUST use those options even if another Path Manager than the default one is implemented. The Path manager applies a particular technology to give the MPS the possibility to use several paths. The built-in MPTCP Path Manager uses multiple IPv4 addresses as its mean to influence the forwarding of packets through the Internet. When the MPS starts a new connection, the PM chooses a token that will be used to identify the connection. This is necessary to allow the PM applying the correct path index to incoming packets. An example mapping table is given hereafter: +-----------------+---------------+---------+-----------------+ | connection id | subflow id | token | Network id | +-----------------+---------------+---------+-----------------+ | | | token_1 | | | | | token_1 | | | | | token_1 | | | | | token_1 | | | | | token_2 | | | | | token_2 | | +-----------------+---------------+---------+-----------------+ Table 1: Example mapping table for built-in PM Table 1 shows an example where two connections are ongoing. One is identified by token_1, the other one with token_2. Since addresses are rewritten by the path manager, the attachment to the right connection is achieved thanks to the token, which is used at connection establishment and subflow establishment. It is then remembered. The first column holds the information that is exposed to the applications, while the last column shows the information that is actually written in packets that will fly through the network. We note that additionnally to the addresses, ports can be rewritten, which contributes to supporting NATs. The table also shows the role of the token, which is to attach various combinations of ports and addresses to a single connection. The token is specific to the built-in path manager, and can be ignored if another path manager is used. An implementation of the built-in path manager MUST implement the following options (defined in more details in [3]): o Add Address (ADDR): Announce a new address we own o Remove Addresse (REMADDR): Withdraw a previously announced address o Join Connection (JOIN): Attach a new subflow to the current connection Ford, et al. Expires December 24, 2010 [Page 23] Internet-Draft MPTCP Architecture June 2010 Those options form the default MPTCP Path Manager, based on declaring IP addresses, and carries control information in TCP options. An implementation of Multipath TCP can use any Path Manager, but it MUST be able to fallback to the default PM in case the other end does not support the custom PM. Alternative Path Managers may be specified in separate documents in the future. A.1.2. Generic architecture for MPTCP Now that the functional decomposition has been shown for MPTCP with the built-in Path Manager, we show how that architecture can be generalized to allow the implementation of other Path Managers for MPTCP. A general overview of the architecture is provided in Figure 9. The Multipath Scheduler (MPS) learns about the number of available paths through notifications received from the Path Manager (PM). From the point of view of the Multipath Scheduler, a path is just a number, called a Path Index. Notifications from the PM to the MPS MAY contain supporting information about the paths, if relevant, so that the MPS can make more intelligent decisions about where to route traffic. When the Multipath Scheduler initiates a communication to a new host, it can only send the packets to the default path. But since the Path manager is layered below the MPS, it can detect that a new communication is happening, and tell the MPS about the other paths it knows about. Ford, et al. Expires December 24, 2010 [Page 24] Internet-Draft MPTCP Architecture June 2010 Control plane <-- | --> Data plane +---------------------------------------------------------------+ | Multipath Scheduler (MPS) | +---------------------------------------------------------------+ ^ | | | | [A1,B1,|pA1,pB1] | | | |Announcing new | +-------------+ |paths. (referred | | Data packet |<--Path idx:3 |to as path indices) | +-------------+ attached | | | by MPS | | V +--------------------------------------------\------------------+ | Path Manager (PM) \__________zzzzz | +--------------------------------------------------------\------+ / \ | \ /---------------------------\ | /"\ /"\ /"\ | subflow_id Action | | | | | | | | | xxxxx | | | | | | | | | yyyyy | | \./ \./ \./ | zzzzz | | path1 path2 path3 +---------------------------+ Figure 9: Overview of MPTCP architecture From then on, it is possible for the MPS to associate a Path Index with its packets, so that the Path Manager can map this Path Index to a particular action (see table in the lower left part of Figure 9). The particular action depends on the network mechanism used to select a path. Examples are address rewriting, tunnelling or setting a path selector value inside the packet. Note that the Path Index is not supposed to be written inside the packet, but instead associated with it, internally to the implementation. The applicability of the architecture is not limited to the MPTCP protocol. While we define in this document an MPTCP MPS (MPTCP Multipath Scheduler), other Multipath Schedulers can be defined. For example, if an appropriate socket interface is designed, applications could behave as a Multipath Scheduler and decide where to send any particular data. In this document we concentrate on the MPTCP case, however. A.2. PM/MPS interface The minimal set of requirement for a Path Manager is as follows: o Outgoing untagged packets: Any outgoing packet flowing through the Path Manager is either tagged or untagged (by the MPS) with a path Ford, et al. Expires December 24, 2010 [Page 25] Internet-Draft MPTCP Architecture June 2010 index. If it is untagged, the packet is sent normally to the Internet, as if no multi-path support were present. Untagged packets can be used to trigger a path discovery procedure, that is, a Path Manager can listen to untagged packets and decide at some time to find if any other path than the default one is useable for the corresponding host pair. Note that any other criteria could be used to decide when to start discovering available paths. Note also that MPS scheduling will not be possible until the Path Manager has notified the available paths. The PM is thus the first entity coming into action. o Outgoing tagged packets: The Path Manager maintains a table mapping path indices to actions. The action is the operation that allows using a particular path. Examples of possible actions are route selection, interface selection or packet transformation. When the PM sees a packet tagged with a path index, it looks up its table to find the appropriate action for that packet. The tag is purely local. It is removed before the packet is transmitted. o Incoming packets: A Path Manager MUST ensure that each incoming path is mapped unambiguously to exactly one outgoing path. Note that this requirement implies that the same number of incoming/ outgoing paths must be established. Moreover, a PM MUST tag any incoming path with the same Path Index as the one used for the corresponding outgoing path. This is necessary for MPTCP to know what outgoing path is acknowledged by an incoming packet. o Module interface: A PM MUST be able to notify the MPS about the number of available paths. Such notifications MUST contain the path indices that are legal for use by the MPS. In case the PM decides to stop providing service for one path, it MUST notify the MPS about path removal. Additionnaly, a PM MAY provide complementary path information when available, such as link quality or preference level. Appendix B. Changelog B.1. Changes since draft-ietf-mptcp-architecture-00 o Added middlebox compatibility discussion (Section 9). o Clarified path identification (TCP 4-tuple) in Section 5.5. o Added brief scenario and diagram to Section 1.3. Ford, et al. Expires December 24, 2010 [Page 26] Internet-Draft MPTCP Architecture June 2010 Authors' Addresses Alan Ford (editor) Roke Manor Research Old Salisbury Lane Romsey, Hampshire SO51 0ZN UK Phone: +44 1794 833 465 Email: alan.ford@roke.co.uk Costin Raiciu University College London Gower Street London WC1E 6BT UK Email: c.raiciu@cs.ucl.ac.uk Sebastien Barre Universite catholique de Louvain Pl. Ste Barbe, 2 Louvain-la-Neuve 1348 Belgium Phone: +32 10 47 91 03 Email: sebastien.barre@uclouvain.be Janardhan Iyengar Franklin and Marshall College Mathematics and Computer Science PO Box 3003 Lancaster, PA 17604-3003 USA Phone: 717-358-4774 Email: jiyengar@fandm.edu Ford, et al. Expires December 24, 2010 [Page 27]