Network Working Group T. Pauly Internet-Draft E. Kinnear Intended status: Informational Apple Inc. Expires: December 27, 2018 June 25, 2018 TCP Encapsulation Considerations draft-pauly-tcp-encapsulation-00 Abstract Network protocols other than TCP, such as UDP, are often blocked or suboptimally handled by network middleboxes. One strategy that applications can use to continue to send non-TCP traffic on such networks is to encapsulate datagrams or messages within in a TCP stream. However, encapsulating datagrams within TCP streams can lead to performance degradation. This document provides guidelines for how to use TCP for encapsulation, a summary of performance concerns, and some suggested mitigations for these concerns. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 27, 2018. Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must Pauly & Kinnear Expires December 27, 2018 [Page 1] Internet-Draft TCP Encapsulation June 2018 include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Motivations for Encapsulation . . . . . . . . . . . . . . . . 3 2.1. UDP Blocking . . . . . . . . . . . . . . . . . . . . . . 3 2.2. UDP NAT Timeouts . . . . . . . . . . . . . . . . . . . . 3 3. Encapsulation Formats . . . . . . . . . . . . . . . . . . . . 3 3.1. Multiplexing Flows . . . . . . . . . . . . . . . . . . . 4 4. Deployment Considerations . . . . . . . . . . . . . . . . . . 5 5. Performance Considerations . . . . . . . . . . . . . . . . . 5 5.1. Loss Recovery . . . . . . . . . . . . . . . . . . . . . . 6 5.1.1. Concern . . . . . . . . . . . . . . . . . . . . . . . 6 5.1.2. Mitigation . . . . . . . . . . . . . . . . . . . . . 6 5.2. Bufferbloat . . . . . . . . . . . . . . . . . . . . . . . 7 5.2.1. Concern . . . . . . . . . . . . . . . . . . . . . . . 7 5.2.2. Mitigation . . . . . . . . . . . . . . . . . . . . . 8 5.3. Head of Line Blocking . . . . . . . . . . . . . . . . . . 8 5.3.1. Concern . . . . . . . . . . . . . . . . . . . . . . . 8 5.3.2. Mitigation . . . . . . . . . . . . . . . . . . . . . 9 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 8. Informative References . . . . . . . . . . . . . . . . . . . 9 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction TCP streams are sometimes used as a mechanism for encapsulating datagrams or messages, which is referred to in this document as "TCP encapsulation". Encapsulation may be used to transmit data over networks that block or suboptimally handle non-TCP traffic. The current motivations for using encapsulation generally revolve around the treatment of UDP packets (Section 2). Implementing a TCP encapsulation strategy consists of mapping datagram messages into a stream protocol, often with a length-value record format (Section 3). While these formats are described here as applying to encapsulating datagrams in a TCP stream, the formats are equally suited to encapsulating datagrams within any stream abstraction. For example, the same format may be used for both raw TCP streams and TLS streams running over TCP. Pauly & Kinnear Expires December 27, 2018 [Page 2] Internet-Draft TCP Encapsulation June 2018 2. Motivations for Encapsulation The primary motivations for enabling TCP encapsulation that will be explored in this document relate mainly to the treatment of UDP packets on a given network. UDP can be used for real-time network traffic, as a mechanism for deploying non-TCP transport protocols, and as a tunneling protocol that is compatible with Network Address Translators (NATs). 2.1. UDP Blocking Some network middleboxes block any IP packets that do not appear to be used for HTTP traffic, either as a security mechanism to block unknown traffic or as a way to restrict access to whitelisted services. Network applications that rely on UDP to transmit data will be blocked by these middleboxes. In this case, the application can attempt to use TCP encapsulation to transmit the same data over a TCP stream. 2.2. UDP NAT Timeouts Other networks may not altogether block non-TCP traffic, but instead make other protocols unsuitable for use. For example, many Network Address Translation (NAT) devices will maintain TCP port mappings for long periods of time, since the end of a TCP stream can be detected by the NAT. Since UDP packet flows do not signal when no more packets will be sent, NATs often use short timeouts for UDP port mappings. Thus, applications can attempt to use TCP encapsulation when long-lived flows are required on networks with NATs. 3. Encapsulation Formats The simplest approach for encapsulating datagram messages within a TCP stream is to use a length-value record format. That is, a header consisting of a length field, followed by the datagram message itself. For example, if an encapsulation protocol uses a 16-bit length field (allowing up to 65536 bytes of datagram payload), it will use a format like the following: Pauly & Kinnear Expires December 27, 2018 [Page 3] Internet-Draft TCP Encapsulation June 2018 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ Datagram Payload ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The format of the length header field could be longer or shorter depending on the needs of the protocol. 16 bits is most appropriate when encapsulating datagrams that would otherwise be sent directly in IP packets, since the payload length field for an IP header is also 16 bits. The length field must be specified to either include itself in the length of the entire record, or to only describe the length of the payload field. The protocol used for encapsulating IKE and ESP packets in TCP [RFC8229] does include the length field itself in the length of the record. This may be slightly easier for implementations to parse out records, since they will not need to add the length of the length field when finding record offsets within a stream. 3.1. Multiplexing Flows Since TCP encapsulation is used to avoid failures caused by NATs or firewalls, some implementations re-use one TCP port or one established TCP stream for multiple kinds of encapsulated traffic. Using a single port or stream allows re-use of NAT bindings and reduces the chance that a firewall will block some flows, but not others. If multiple kinds of traffic are multiplexed on the same listening TCP port, individual streams opened to that port need to be differentiated. This may require adding a one-time header that is sent on the stream to indicate the type of encapsulated traffic that will follow. For example, TCP encapsulated IKE [RFC8229] uses a stream prefix to differentiate its encapsulation strategy from proprietary Virtual Private Network (VPN) protocols. Multiplexing multiple kinds of datagrams, or independent flows of datagrams, over a single TCP stream requires adding a per-record type field or marker to the encapsulation record format. For ease of parsing records, this value should be placed after the length field of the record format. For example, various ESP packet flows are identified by the four-byte Security Parameter Index (SPI) that Pauly & Kinnear Expires December 27, 2018 [Page 4] Internet-Draft TCP Encapsulation June 2018 comprises the first bytes of the datagram payload, while IKE packets in the same TCP encapsulated stream are differentiated by using all zeros for the first four bytes. 4. Deployment Considerations In general, any new TCP encapsulation protocol should allocate a new TCP port. If TCP is being used to encapsulate traffic that is normally sent over UDP, then the the most obvious port choice for the TCP encapsulated version is the equivalent port value in the TCP port namespace. Simply using TCP instead of UDP may be enough in some cases to mitigate the connectivity problems of using UDP with NATs and other middleboxes. However, it may be useful to also add a layer of encryption to the stream using TLS to obfuscate the contents of the stream. This may be done for security and privacy reasons, or to prevent middleboxes from mishandling encapsulated traffic or ossifying around a particular format for encapsulation. 5. Performance Considerations Many encapsulation or tunnelling protocols utilize an underlying transport like UDP, which does not provide stateful features such as loss recovery or congestion control. Because encapsulation using TCP involves an additional layer of state that is shared among all traffic inside the tunnel, there are additional performance considerations to address. Even though this document describes encapsulating datagrams or messages inside a TCP stream, some protocols, such as ESP, themselves often encapsulate additional TCP streams, such as when transmitting data for a VPN protocol [RFC8229]. This introduces several potential sources of suboptimal behavior, as multiple TCP contexts act upon the same traffic. For the purposes of this discussion, we will refer to the TCP encapsulation context as the "outer" TCP context, while the TCP context applicable to any encapsulated protocol will be referred to as the "inner" TCP context. The use of an outer TCP context may cause signals from the network to be hidden from the inner TCP contexts. Depending on the signals that the inner TCP contexts use for indicating congestion, events that would otherwise result in a modification of behavior may go unnoticed, or may build up until a large modification of behavior is necessary. Generally, the main areas of concern are signals that Pauly & Kinnear Expires December 27, 2018 [Page 5] Internet-Draft TCP Encapsulation June 2018 inform loss recovery, Bufferbloat and delay avoidance, and head of line blocking between streams. 5.1. Loss Recovery 5.1.1. Concern The outer TCP context experiences packet loss on the network directly, while any inner TCP contexts present observe the effects of that loss on the delivery of their packets by the encapsulation layer. Furthermore, inner TCP contexts still observe direct network effects for any network segments that are traversed outside of the encapsulation, as is common with a VPN. In this way, the outer TCP context masks packet loss from the inner contexts by retransmitting encapsulated segments to recover from those losses. An inner context observes this as a delay while the packets are retransmitted rather than a loss. This can lead to spurious retransmissions if the recovery of the lost packets takes longer than the inner context's retransmission timeout (RTO). Since the outer context is retransmitting the packets to make up for the losses, the spurious retransmissions waste bandwidth that could be used for packets that advance the progress of the flows being encapsulated. A RTO event on an inner TCP context also hinders performance beyond generating spurious retransmissions, as many TCP congestion control algorithms dramatically reduce the sending rate after an RTO is observed. When recovery from a loss event on the outer TCP context completes, the network or endpoint on the other end of the encapsulation will receive a potentially large burst of packets as the retransmitted packets fill in any gaps and the entire set of pending data can be delivered. If content from multiple inner flows is shared within a single TCP packet in the outer context, the effects of lost packets from the outer context will be experienced by more than one inner flow at a time. However, this loss is actually shared by all inner flows, since forward progress for the entire encapsulation tunnel is generally blocked until the lost segments can be filled in. This is discussed further in Section 5.3. 5.1.2. Mitigation Generally, TCP congestion controls and loss recovery algorithms are capable of recovering from loss events very efficiently, and the inner TCP contexts observe brief periods of added delay without much penalty. Pauly & Kinnear Expires December 27, 2018 [Page 6] Internet-Draft TCP Encapsulation June 2018 A TCP congestion control should be selected and tuned to be able to gracefully handle extremely variable RTT values, which may already the case for some congestion controls, as RTT variance is often greatly increased in mobile and cellular networks. Additionally, use of a TCP congestion control that considers delay to be a sign of congestion may help the coordination between inner and outer TCP contexts. LEDBAT [RFC6817] and BBR [I-D.cardwell-iccrg-bbr-congestion-control] are two examples of delay based congestion control that an inner TCP context could use to properly interpret loss events experienced by the outer TCP context. Care must be taken to ensure that any TCP congestion control in use is also appropriate for an inner context to use on any network segments that are traversed outside of the encapsulation. Since any losses will be handled by the outer TCP context, it might seem reasonable to modify the the inner TCP contexts' loss recovery algorithms to prevent retransmissions, there are often network segments outside of the encapsulated segments that still rely on the inner contexts' loss recovery algorithms. Instead, spurious retransmissions can be reduced by ensuring that RTO values are tuned such that the outer TCP context will fully time out before any inner TCP contexts. 5.2. Bufferbloat 5.2.1. Concern "Bufferbloat", or delay introduced by consistently full large buffers along a network path [TSV2011] [BB2011], can increase observed RTTs along a network path, which can harm the performance of latency sensitive applications. Any spurious retransmissions sent on the network take place in queues that would otherwise be filled by useful data. In this case, any retransmission sent by an inner TCP context for a loss or timeout along the network segments also covered by the outer TCP context is considered to be spurious. This can pose a performance problem for implementations that rely on interactive data transfer. Additionally, because there may be multiple inner TCP contexts being multiplexed over a single outer TCP context, even a minor reduction in sending rate by each of the inner contexts can result in a dramatic decrease in data sent through the outer context. Similarly, an increase in sending rate is also amplified. Pauly & Kinnear Expires December 27, 2018 [Page 7] Internet-Draft TCP Encapsulation June 2018 5.2.2. Mitigation Great care should be taken in tuning the inner TCP congestion control to avoid spurious retransmissions as much as possible. However, in order to provide effective loss recovery for the segments of the network outside the tunnel, the set of parameters used for tuning needs to be viable both inside and outside the tunnel. Adjusting the retransmission timeout (RTO) value for the TCP congestion control on the inner TCP context to be greater than that of the out TCP context will often help to reduce the number of spurious retransmissions generated while the outer TCP context attempts to catch up with lost or reordered packets. In most cases, fast retransmit will be sufficient to recover from losses on network segments after the inner flows leave the tunnel, although loss events that trigger a full RTO on those last-mile segments will carry a higher penalty with such tuning. However, in many deployments, the last-mile segments will often observe lower loss rates than the first-mile segments, leading to a balance that often favors spurious retransmission avoidance on the first-mile over loss recovery speed on the last-mile. 5.3. Head of Line Blocking 5.3.1. Concern Because TCP provides in-order delivery and reliability, even if there are multiple flows being multiplexed over the encapsulation layer, loss events, spurious retransmissions, or other recovery efforts will cause data for all other flows to back up and not be delivered to the client. In deployments where there are additional network segments to traverse beyond the encapsulation boundary, this may mean that flows are not delivered onto those segments until recovery for the outer TCP context is complete. With UDP encapsulation, packet reordering and loss did not necessarily prevent data from being delivered, even if it was delivered out of order. Because TCP groups all data being encapsulated into one outer congestion control and loss recovery context, this may cause significant delays for flows not directly impacted by a recovery event. Reordering on the network will also cause problems in this case, as it will often trigger fast retransmissions on the outer TCP context, blocking all inner contexts from being able to deliver data until the retransmissions are complete. However, a well behaved TCP will reorder the data that arrived out of order and deliver it before the Pauly & Kinnear Expires December 27, 2018 [Page 8] Internet-Draft TCP Encapsulation June 2018 retransmissions arrive, reducing the detrimental impact of such reordering. 5.3.2. Mitigation One option to help address the head of line blocking would be to run multiple tunnels, one for throughput sensitive flows and one for latency sensitive flows. This can help to reduce the amount of time that a latency sensitive flow can possibly be blocked on recovery for any other flow. Latency sensitive flows should take extra care to ensure that only the necessary amount of data is in flight at any given time. Explicit Congestion Notification (ECN) ([RFC3168], [RFC5562]) could also be used to communicate between outer and inner TCP contexts during any recovery scenario. In a strategy similar to that taken by tunnelling of ECN fields in IP-in-IP tunnels [RFC6040], if an implementation supports such behavior, any ECN markings communicated to the outer TCP context by the network could be passed through to any inner TCP contexts transported by a given packet. Alternately, an implementation could elect to pass through such markings to all inner TCP contexts if a greater reduction in sending rate was deemed to be necessary. 6. Security Considerations Any attacker on the path that observes the encapsulation could potentially discard packets from the outer TCP context and cause significant delays due to head of line blocking. However, an attacker in a position to arbitrarily discard packets could have a similar effect on the inner TCP context directly or on any other encapsulation schemes. 7. IANA Considerations This document has no request to IANA. 8. Informative References [BB2011] "Bufferbloat: Dark Buffers in the Internet", n.d.. [I-D.cardwell-iccrg-bbr-congestion-control] Cardwell, N., Cheng, Y., Yeganeh, S., and V. Jacobson, "BBR Congestion Control", draft-cardwell-iccrg-bbr- congestion-control-00 (work in progress), July 2017. Pauly & Kinnear Expires December 27, 2018 [Page 9] Internet-Draft TCP Encapsulation June 2018 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, . [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. Ramakrishnan, "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, DOI 10.17487/RFC5562, June 2009, . [RFC6040] Briscoe, B., "Tunnelling of Explicit Congestion Notification", RFC 6040, DOI 10.17487/RFC6040, November 2010, . [RFC6817] Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind, "Low Extra Delay Background Transport (LEDBAT)", RFC 6817, DOI 10.17487/RFC6817, December 2012, . [RFC8229] Pauly, T., Touati, S., and R. Mantha, "TCP Encapsulation of IKE and IPsec Packets", RFC 8229, DOI 10.17487/RFC8229, August 2017, . [TSV2011] "Bufferbloat: Dark Buffers in the Internet", March 2011. Authors' Addresses Tommy Pauly Apple Inc. One Apple Park Way Cupertino, California 95014 United States of America Email: tpauly@apple.com Eric Kinnear Apple Inc. One Apple Park Way Cupertino, California 95014 United States of America Email: ekinnear@apple.com Pauly & Kinnear Expires December 27, 2018 [Page 10]