Network Virtualization Overlays (nvo3) T. Herbert Internet-Draft Facebook Intended status: Standard track L. Yong Expires October 18, 2015 Huawei USA O. Zia Microsoft April 16, 2015 Generic UDP Encapsulation draft-ietf-nvo3-gue-00 Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on September 7, 2015. Copyright Notice Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents Herbert, Yong, Zia Expires October 2015 [Page 1] Internet Draft Generic UDP Encapsulation April 16, 2015 (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Herbert, Yong, Zia Expires October 2015 [Page 2] Internet Draft Generic UDP Encapsulation April 16, 2015 Abstract This specification describes Generic UDP Encapsulation (GUE), which is a scheme for using UDP to encapsulate packets of arbitrary IP protocols for transport across layer 3 networks. By encapsulating packets in UDP, specialized capabilities in networking hardware for efficient handling of UDP packets can be leveraged. GUE specifies basic encapsulation methods upon which higher level constructs, such tunnels and overlay networks for network virtualization, can be constructed. GUE is extensible by allowing optional data fields as part of the encapsulation, and is generic in that it can encapsulate packets of various IP protocols. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Packet formats . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1. GUE header preamble . . . . . . . . . . . . . . . . . . . . 5 2.2. GUE header . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3. Flags and optional fields . . . . . . . . . . . . . . . . . 7 2.4 Private data . . . . . . . . . . . . . . . . . . . . . . . . 8 3. Message types . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1. Control messages . . . . . . . . . . . . . . . . . . . . . 8 3.2. Data messages . . . . . . . . . . . . . . . . . . . . . . . 9 4. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.1. Network tunnel encapsulation . . . . . . . . . . . . . . . 10 4.2. Transport layer encapsulation . . . . . . . . . . . . . . . 10 4.3. Encapsulator operation . . . . . . . . . . . . . . . . . . 10 4.4. Decapsulator operation . . . . . . . . . . . . . . . . . . 10 4.5. Router and switch operation . . . . . . . . . . . . . . . . 11 4.6. Middlebox interactions . . . . . . . . . . . . . . . . . . 11 4.7. NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.8. Checksum Handling . . . . . . . . . . . . . . . . . . . . . 12 4.8.1. Checksum requirements . . . . . . . . . . . . . . . . . 12 4.8.2. GUE header checksum . . . . . . . . . . . . . . . . . . 12 4.8.3. UDP Checksum with IPv4 . . . . . . . . . . . . . . . . 12 4.8.4. UDP Checksum with IPv6 . . . . . . . . . . . . . . . . 13 4.9. MTU and fragmentation issues . . . . . . . . . . . . . . . 14 4.10 Congestion control . . . . . . . . . . . . . . . . . . . . 14 5. Inner flow identifier properties . . . . . . . . . . . . . . . 14 5.1. Flow classification . . . . . . . . . . . . . . . . . . . . 14 5.2. Inner flow identifier properties . . . . . . . . . . . . . 15 6. Motivation for GUE . . . . . . . . . . . . . . . . . . . . . . 16 7. Security Considerations . . . . . . . . . . . . . . . . . . . . 17 7.1. GUE security fields . . . . . . . . . . . . . . . . . . . . 18 7.2. GUE and IPsec . . . . . . . . . . . . . . . . . . . . . . . 18 8. IANA Consideration . . . . . . . . . . . . . . . . . . . . . . 18 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 Herbert, Yong, Zia Expires October 2015 [Page 3] Internet Draft Generic UDP Encapsulation April 16, 2015 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 19 10.1. Normative References . . . . . . . . . . . . . . . . . . . 19 10.2. Informative References . . . . . . . . . . . . . . . . . . 19 Appendix A: NIC processing for GUE . . . . . . . . . . . . . . . . 20 A.1. Receive multi-queue . . . . . . . . . . . . . . . . . . . . 21 A.2. Checksum offload . . . . . . . . . . . . . . . . . . . . . 21 A.2.1. Transmit checksum offload . . . . . . . . . . . . . . . 21 A.2.2. Receive checksum offload . . . . . . . . . . . . . . . 22 A.3. Transmit Segmentation Offload . . . . . . . . . . . . . . . 23 A.4. Large Receive Offload . . . . . . . . . . . . . . . . . . . 23 Appendix B: Privileged ports . . . . . . . . . . . . . . . . . . . 24 Appendix C: Inner flow identifier as a route selector . . . . . . 24 Appendix D: Hardware protocol implementation considerations . . . 24 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25 1. Introduction This specification describes Generic UDP Encapsulation (GUE) which is a general method for encapsulating packets of arbitrary IP protocols within User Datagram Protocol (UDP) [RFC0768] packets. Encapsulating packets in UDP facilitates efficient transport across networks. Networking devices widely provide protocol specific processing and optimizations for UDP (as well as TCP) packets. Packets for atypical IP protocols (those not usually parsed by networking hardware) can be encapsulated in UDP packets to maximize deliverability and to leverage flow specific mechanisms for routing and packet steering. GUE provides an extensible header format for including optional data in the encapsulation header. This data potentially covers items such as virtual networking identifier, security data for validating or authenticating the GUE header, congestion control data, etc. GUE also allows private optional data in the encapsulation header. This feature can be used by a site or implementation to define local custom optional data, and allows experimentation of options that may eventually become standard. Herbert, Yong, Zia Expires October 2015 [Page 4] Internet Draft Generic UDP Encapsulation April 16, 2015 2. Packet formats A GUE packet is comprised of a UDP packet whose payload is a GUE header followed by a payload which is either an encapsulated packet of some IP protocol or a control message (like an OAM message). A GUE packet has the general format: +-------------------------------+ | | | UDP/IP header | | | |-------------------------------| | | | GUE Header | | | |-------------------------------| | | | Encapsulated packet | | or control message | | | +-------------------------------+ The GUE header is variable length as determined by the presence of optional fields. 2.1. GUE header preamble The first byte of the GUE header provides the GUE protocol version number, indicator of a control or data message, and header length: 0 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |Ver|C| Hlen | +-+-+-+-+-+-+-+-+ Contents are: o Ver: GUE protocol version. The rest of the fields after the preamble are defined based on the version. This field is two bits allowing four possible values. o Control flag: When set indicates a control message, not set indicates a data message. o Hlen: Length in 32-bit words of the GUE header, including optional fields but not the first four bytes of the header. Computed as (header_len - 4) / 4. All GUE headers are a multiple Herbert, Yong, Zia Expires October 2015 [Page 5] Internet Draft Generic UDP Encapsulation April 16, 2015 of four bytes in length. Maximum header length is 132 bytes. 2.2. GUE header The header format for version 0x0 of GUE in UDP is: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source port | Destination port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0x0|C| Hlen | Proto/ctype | Flags |E| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ Fields (optional) ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Extension flags (optional) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ Extension fields (optional) ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ Private data (optional) ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The contents of the UDP header are: o Source port (inner flow identifier): This should be set to a value that represents the encapsulated flow. The properties of the inner flow identifier are described below. o Destination port: The GUE assigned port number, 6080. o Length: Canonical length of the UDP packet (length of UDP header and payload). o Checksum: Standard UDP checksum. The GUE header consists of: o Preamble byte: Version number (0x0), C bit, and header length. Herbert, Yong, Zia Expires October 2015 [Page 6] Internet Draft Generic UDP Encapsulation April 16, 2015 o Proto/ctype: When the C bit is set this field contains a control message type for the payload. When C bit is not set, the field holds the IP protocol number for the encapsulated packet in the payload. The control message or encapsulated packet begins at the offset provided by Hlen. o Flags. Header flags that may be allocated for various purposes and may indicate presence of optional fields. Undefined header flag bits must be set to zero on transmission. o 'E' Extension flag. Indicates presence of extension flags option in the optional fields. o Fields: Optional fields whose presence is indicated by corresponding flags. o Extension flags: An optional field indicated by the E bit. This field provides an additional set of 32 header bit flags for the header. o Extension fields: Optional fields whose presence is indicated by corresponding extension flags. o Private data: Optional private data. If private data is present it immediately follows that last field present in the header. The length of this data is determined by subtracting the starting offset from the header length. 2.3. Flags and optional fields Flags and associated optional fields are the primary mechanism of extensibility in GUE. There are sixteens flag bits in the primary GUE header with one being reserved to indicate that an optional extension flags field is present. The extension flags field contains an additional thirty-two flag bits. A flag may indicate presence of optional fields. The size of an optional field indicated by a flag must be fixed. Flags may be paired together to allow different lengths for an optional field. For example, if two flag bits are paired, a field may possibly be three different lengths. Regardless of how flag bits may be paired, the lengths and offsets of optional fields corresponding to a set of flags must be well defined. Optional fields are placed in order of the flags. New flags should be allocated from high to low order bit contiguously without holes. Flags allow random access, for instance to inspect the field Herbert, Yong, Zia Expires October 2015 [Page 7] Internet Draft Generic UDP Encapsulation April 16, 2015 corresponding to the Nth flag bit, an implementation only considers the previous N-1 flags to determine the offset. Flags after the Nth flag are not pertinent in calculating the offset of the Nth flag. Flags (or paired flags) are idempotent such that new flags should not cause reinterpretation of old flags. Also, new flags should not alter interpretation of other elements in the GUE header nor how the message is parsed (for instance, in a data message the proto/ctype field always holds an IP protocol number as an invariant). 2.4 Private data An implementation may use private data for its own use. The private data immediately follows the last field in the GUE header and is not a fixed length. This data is considered part of the GUE header and must be accounted for in header length (Hlen). The length of the private data must be a multiple of four and is determined by subtracting the offset of private data in the GUE header from the header length. Specifically: Private_length = (Hlen * 4) - Length(flags) Where "Length(flags)" returns the sum of lengths of all the optional fields present in the GUE header. When there is no private data present, length of the private data is zero. The semantics and interpretation of private data are implementation specific. A encapsulator and decapsulator MUST agree on the meaning of private data before using it. The private data may be structured as necessary, for instance it might contain its own set of flags and optional fields. If a decapsulator receives a GUE packet with private data, it MUST validate the private data appropriately. If a decapsulator does not expect private data from an encapsulator the packet MUST be dropped. If a decapsulator cannot validate the contents of private data per the provided semantics the packet MUST also be dropped. An implementation may place security data in GUE private data which must be verified for packet acceptance. 3. Message types 3.1. Control messages Control messages are indicated in the GUE header when the C bit is set. The payload is interpreted as a control message with type specified in the proto/ctype field. The format and contents of the control message are indicated by the type and can be variable length. Herbert, Yong, Zia Expires October 2015 [Page 8] Internet Draft Generic UDP Encapsulation April 16, 2015 Other than interpreting the proto/ctype field as a control message type, the meaning and semantics of the rest of the elements in the GUE header are the same as that of data messages. Forwarding and routing of control messages should be the same as that of a data message with the same outer IP and UDP header and GUE flags-- this ensures that a control message can be created which follows the same path as a data message. Control messages can be defined for OAM type messages. For instance, an echo request and corresponding echo reply message may be defined to test for liveness. 3.2. Data messages Data messages are indicated in GUE header with C bit not set. The payload of a data message is interpreted as an encapsulated packet of an IP protocol indicated in the proto/ctype field. The packet immediately follows the GUE header. Data messages are a primary means of encapsulation and can be used to create tunnels for overlay networks. 4. Operation The figure below illustrates the use of GUE encapsulation between two servers. Sever 1 is sending packets to server 2. An encapsulator performs encapsulation of packets from server 1. These encapsulated packets traverse the network as UDP packets. At the decapsulator, packets are decapsulated and sent on to server 2. Packet flow in the reverse direction need not be symmetric; GUE encapsulation is not required in the reverse path. +---------------+ +---------------+ | | | | | Server 1 | | Server 2 | | | | | +---------------+ +---------------+ | ^ V | +---------------+ +---------------+ +---------------+ | | | | | | | Encapsulator |-->| Layer 3 |-->| Decapsulator | | | | Network | | | +---------------+ +---------------+ +---------------+ The encapsulator and decapsulator may be co-resident with the corresponding servers, or may be on separate nodes in the network. Herbert, Yong, Zia Expires October 2015 [Page 9] Internet Draft Generic UDP Encapsulation April 16, 2015 4.1. Network tunnel encapsulation Network tunneling can be achieved by encapsulating layer 2 or layer 3 packets. In this case the encapsulator and decapsulator nodes are the tunnel endpoints. These could be routers that provide network tunnels on behalf of communicating servers. 4.2. Transport layer encapsulation When encapsulating layer 4 packets, the encapsulator and decapsulator should be co-resident with the servers. In this case, the encapsulation headers are inserted between the IP header and the transport packet. The addresses in the IP header refer to both the endpoints of the encapsulation and the endpoints for terminating the the transport protocol. 4.3. Encapsulator operation Encapsulators create GUE data messages, set the source port to the inner flow identifier, set flags and optional fields in the GUE header, and forward packets to a decapsulator. An encapsulator may be an end host originating the packets of a flow, or may be a network device performing encapsulation on behalf of servers (routers implementing tunnels for instance). In either case, the intended target (decapsulator) is indicated by the outer destination IP address. If an encapsulator is tunneling packets, that is encapsulating packets of layer 2 or layer 3 protocols (e.g. EtherIP, IPIP, ESP tunnel mode), it should follow standard conventions for tunneling of one IP protocol over another. Diffserv interaction with tunnels is described in [RFC2983], ECN propagation for tunnels is described in [RFC6040]. 4.4. Decapsulator operation A decapsulator performs decapsulation of GUE packets. A decapsulator is addressed by the outer destination IP address of a GUE packet. The decapsulator validates packets, including fields of the GUE header. If a packet is acceptable, the UDP and GUE headers are removed and the packet is resubmitted for IP protocol processing or control message processing if it is a control message. If a decapsulator receives a GUE packet with an unsupported version, unknown flag, bad header length (too small for included optional fields), unknown control message type, or an otherwise malformed header, it must drop the packet and may log the event. No error Herbert, Yong, Zia Expires October 2015 [Page 10] Internet Draft Generic UDP Encapsulation April 16, 2015 message is returned back to the encapsulator. Note that set flags in GUE that are unknown to a decapsulator MUST NOT be ignored. If a GUE packet is received by a decapsulator with unknown flags, the packet MUST be dropped. 4.5. Router and switch operation Routers and switches should forward GUE packets as standard UDP/IP packets. The outer five-tuple should contain sufficient information to perform flow classification corresponding to the flow of the inner packet. A switch should not normally need to parse a GUE header, and none of the flags or optional fields in the GUE header should affect routing. A router should not modify a GUE header when forwarding a packet. It may encapsulate a GUE packet in another GUE packet, for instance to implement a network tunnel. In this case the router takes the role of an encapsulator, and the corresponding decapsulator is the logical endpoint of the tunnel. 4.6. Middlebox interactions A middle box may interpret some flags and optional fields of the GUE header for classification purposes, but is not required to understand all flags and fields in GUE packets. A middle box should not drop a GUE packet because there are flags unknown to it. The header length in the GUE header allows a middlebox to inspect the payload packet without needing to parse the flags or optional fields. A middlebox may infer bidirectional connection semantics to a UDP flow. For instance a stateful firewall may create a five-tuple rule to match flows on egress, and a corresponding five-tuple rule for matching ingress packets where the roles of source and destination are reversed for the IP addresses and UDP port numbers. To operate in this environment, a GUE tunnel must assume connected semantics defined by the UDP five tuple and the use of GUE encapsulation must be symmetric between both endpoints. The source port set in the UDP header must be the destination port the peer would set for replies. 4.7. NAT IP address and port translation can be performed on the UDP/IP headers adhering to the requirements for NAT with UDP [RFC478]. In the case of stateful NAT, connection semantics must be applied to a GUE tunnel as described above. When using transport mode encapsulation and traversing a NAT, the IP addresses may be changed such that the pseudo header checksum used Herbert, Yong, Zia Expires October 2015 [Page 11] Internet Draft Generic UDP Encapsulation April 16, 2015 for checksum calculation is modified and the checksum will be found invalid at the receiver. To compensate for this, A GUE option can be added which contains the checksum over the source and destination addresses when the packet is transmitted. Upon receiving this option, the delta of the pseudo header checksum is computed by subtracting the checksum over the source and and destination addresses from the checksum value in the option. The resultant value is then added into checksum calculation when validating the inner transport checksum. 4.8. Checksum Handling This section describes the requirements around the UDP checksum and GUE header checksum. Checksums are an important consideration in that that they can provide end to end validation and protect against packet mis-delivery. The latter is allowed by the inclusion of a pseudo header that covers the IP addresses and UDP ports of the encapsulating headers. 4.8.1. Checksum requirements The potential for mis-delivery of packets due to corruption of IP, UDP, or GUE headers must be considered. One of the following requirements must be met: o UDP checksums are enabled (for IPv4 or IPv6). o The GUE header checksum is used. o Zero UDP checksums are used in accordance with applicable requirements in [GREUDP], [RFC6935], and [RFC6936]. 4.8.2. GUE header checksum The GUE header checksum provides a UDP-lite [RFC3828] type of checksum capability as an optional field of the GUE header. The GUE header checksum minimally covers the GUE header and a GUE pseudo header. The GUE pseudo header includes the corresponding IP addresses as well as the UDP ports of the encapsulating headers. This checksum should provide adequate protection against address corruption in IPv6 when the UDP checksum is zero. Additionally, the GUE checksum provides protection of the GUE header when the UDP checksum is set to zero with either IPv4 or IPv6. The GUE header checksum is defined in [GUECSUM]. 4.8.3. UDP Checksum with IPv4 For UDP in IPv4, the UDP checksum MUST be processed as specified in [RFC768] and [RFC1122] for both transmit and receive. An Herbert, Yong, Zia Expires October 2015 [Page 12] Internet Draft Generic UDP Encapsulation April 16, 2015 encapsulator MAY set the UDP checksum to zero for performance or implementation considerations. The IPv4 header includes a checksum which protects against mis-delivery of the packet due to corruption of IP addresses. The UDP checksum potentially provides protection against corruption of the UDP header, GUE header, and GUE payload. Enabling or disabling the use of checksums is a deployment consideration that should take into account the risk and effects of packet corruption, and whether the packets in the network are already adequately protected by other, possibly stronger mechanisms such as the Ethernet CRC. If an encapsulator sets a zero UDP checksum for IPv4 it SHOULD use the GUE header checksum as described in section 4.8.2. When a decapsulator receives a packet, the UDP checksum field MUST be processed. If the UDP checksum is non-zero, the decapsulator MUST verify the checksum before accepting the packet. By default a decapsulator SHOULD accept UDP packets with a zero checksum. A node MAY be configured to disallow zero checksums per [RFC1122]; this may be done selectively, for instance disallowing zero checksums from certain hosts that are known to be sending over paths subject to packet corruption. If verification of a non-zero checksum fails, a decapsulator lacks the capability to verify a non-zero checksum, or a packet with a zero-checksum was received and the decapsulator is configured to disallow, the packet MUST be dropped and an event MAY be logged. 4.8.4. UDP Checksum with IPv6 For UDP in IPv6, the UDP checksum MUST be processed as specified in [RFC768] and [RFC2460] for both transmit and receive. Unlike IPv4, there is no header checksum in IPv6 that protects against mis- delivery due to address corruption. Therefore, when GUE is used over IPv6, either the UDP checksum must be enabled or the GUE header checksum must be used. An encapsulator MAY set a zero UDP checksum for performance or implementation reasons, in which case the GUE header checksum MUST be used or applicable requirements for using zero UDP checksums in [GREUDP] MUST be met. If the UDP checksum is enabled, then the GUE header checksum should not be used since it is mostly redundant. When a decapsulator receives a packet, the UDP checksum field MUST be processed. If the UDP checksum is non-zero, the decapsulator MUST verify the checksum before accepting the packet. By default a decapsulator MUST only accept UDP packets with a zero checksum if the GUE header checksum is used and is verified. If verification of a non-zero checksum fails, a decapsulator lacks the capability to verify a non-zero checksum, or a packet with a zero-checksum and no GUE header checksum was received, the packet MUST be dropped and an Herbert, Yong, Zia Expires October 2015 [Page 13] Internet Draft Generic UDP Encapsulation April 16, 2015 event MAY be logged. 4.9. MTU and fragmentation issues Standard conventions for handling of MTU (Maximum Transmission Unit) and fragmentation in conjunction with networking tunnels (encapsulation of layer 2 or layer 3 packets) should be followed. Details are described in MTU and Fragmentation Issues with In-the- Network Tunneling [RFC4459] If a packet is fragmented before encapsulation in GUE, all the related fragments must be encapsulated using the same source port (inner flow identifier). An operator may set MTU to account for encapsulation overhead and reduce the likelihood of fragmentation. 4.10 Congestion control Per requirements of [RFC5405], if the IP traffic encapsulated with GUE implements proper congestion control no additional mechanisms should be required. In the case that the encapsulated traffic does not implement any or sufficient control, or it is not known rather a transmitter will consistently implement proper congestion control, then congestion control at the encapsulation layer must be provided. Note this case applies to a significant use case in network virtualization in which guests run third party networking stacks that cannot be implicitly trusted to implement conformant congestion control. Out of band mechanisms such as rate limiting, Managed Circuit Breaker, or traffic isolation may used to provide rudimentary congestion control. For finer grained congestion control that allow alternate congestion control algorithms, reaction time within an RTT, and interaction with ECN, in band mechanisms may warranted. DCCP may be used to provide congestion control for encapsulated flows. In this case, the protocol stack for an IP tunnel may be IP- GUE-DCCP-IP. Alternatively, GUE can be extended to include congestion control (related data carried in GUE optional fields). Congestion control mechanisms for GUE will be elaborated in other specifications. 5. Inner flow identifier properties 5.1. Flow classification A major objective of using GUE is that a network device can perform flow classification corresponding to the flow of the inner Herbert, Yong, Zia Expires October 2015 [Page 14] Internet Draft Generic UDP Encapsulation April 16, 2015 encapsulated packet based on the contents in the outer headers. Hardware devices commonly perform hash computations on packet headers to classify packets into flows or flow buckets. Flow classification is done to support load balancing (statistical multiplexing) of flows across a set of networking resources. Examples of such load balancing techniques are Equal Cost Multipath routing (ECMP), port selection in Link Aggregation, and NIC device Receive Side Scaling (RSS). Hashes are usually either a three-tuple hash of IP protocol, source address, and destination address; or a five-tuple hash consisting of IP protocol, source address, destination address, source port, and destination port. Typically, networking hardware will compute five-tuple hashes for TCP and UDP, but only three-tuple hashes for other IP protocols. Since the five- tuple hash provides more granularity, load balancing can be finer grained with better distribution. When a packet is encapsulated with GUE, the source port in the outer UDP packet is set to reflect the flow of the inner packet. When a device computes a five-tuple hash on the outer UDP/IP header of a GUE packet, the resultant value classifies the packet per its inner flow. To support flow classification, the source port of the UDP header in GUE is set to a value that maps to the inner flow. This is referred to as the inner flow identifier. The inner flow identifier is set by the encapsulator; it can be computed on the fly based on packet contents or retrieved from a state maintained for the inner flow. Examples of deriving an inner flow identifier are: o If the encapsulated packet is a layer 4 packet, TCP/IPv4 for instance, the inner flow identifier could be based on the canonical five-tuple hash of the inner packet. o If the encapsulated packet is an AH transport mode packet with TCP as next header, the inner flow identifier could be a hash over a three-tuple: TCP protocol and TCP ports of the encapsulated packet. o If a node is encrypting a packet using ESP tunnel mode and GUE encapsulation, the inner flow identifier could be based on the contents of clear-text packet. For instance, a canonical five- tuple hash for a TCP/IP packet could be used. 5.2. Inner flow identifier properties The inner flow identifier is the value set in the UDP source port of a GUE packet. The inner flow identifier should adhere to the following properties: Herbert, Yong, Zia Expires October 2015 [Page 15] Internet Draft Generic UDP Encapsulation April 16, 2015 o The value set in the source port should be within the ephemeral port range. IANA suggests this range to be 49152 to 65535, where the high order two bits of the port are set to one. This provides fourteen bits of entropy for the inner flow identifier. o The inner flow identifier should have a uniform distribution across encapsulated flows. o An encapsulator may occasionally change the inner flow identifier used for an inner flow per its discretion (for security, route selection, etc). Changing the value should happen no more than once every thirty seconds. o Decapsulators, or any networking devices, should not attempt any interpretation of the inner flow identifier, nor should they attempt to reproduce any hash calculation. They may use the value to match further receive packets for steering decisions, but cannot assume that the hash uniquely or permanently identifies a flow. o Input to the inner flow identifier is not restricted to ports and addresses; input could include flow label from an IPv6 packet, SPI from an ESP packet, or other flow related state in the encapsulator that is not necessarily conveyed in the packet. o The assignment function for inner flow identifiers should be randomly seeded to mitigate denial of service attacks. The seed may be changed periodically. 6. Motivation for GUE This section presents the motivation for GUE with respect to other encapsulation methods. A number of different encapsulation techniques have been proposed for the encapsulation of one protocol over another. EtherIP [RFC3378] provides layer 2 tunneling of Ethernet frames over IP. GRE [RFC2784], MPLS [RFC4023], and L2TP [RFC2661] provide methods for tunneling layer 2 and layer 3 packets over IP. NVGRE [NVGRE] and VXLAN [RFC7348] are proposals for encapsulation of layer 2 packets for network virtualization. IPIP [RFC2003] and Generic packet tunneling in IPv6 [RFC2473] provide methods for tunneling IP packets over IP. Several proposals exist for encapsulating packets over UDP including ESP over UDP [RFC3948], TCP directly over UDP [TCPUDP], VXLAN, LISP [RFC6830] which encapsulates layer 3 packets, and Generic UDP Encapsulation for IP Tunneling (GRE over UDP)[GREUDP]. Generic UDP tunneling [GUT] is a proposal similar to GUE in that it aims to Herbert, Yong, Zia Expires October 2015 [Page 16] Internet Draft Generic UDP Encapsulation April 16, 2015 tunnel packets of IP protocols over UDP. GUE has the following discriminating features: o UDP encapsulation leverages specialized network device processing for efficient transport. The semantics for using the UDP source port as an identifier for an inner flow are defined. o GUE permits encapsulation of arbitrary IP protocols, which includes layer 2 3, and 4 protocols. This potentially allows nearly all traffic within a data center to be normalized to be either TCP or UDP on the wire. o Multiple protocols can be multiplexed over a single UDP port number. This is in contrast to techniques to encapsulate protocols over UDP using a protocol specific port number (such as ESP/UDP, GRE/UDP, SCTP/UDP). GUE provides a uniform and extensible mechanism for encapsulating all IP protocols in UDP with minimal overhead (four bytes of additional header). o GUE is extensible. New flags and optional fields can be defined. o The GUE header includes a header length field. This allows a network node to inspect an encapsulated packet without needing to parse the full encapsulation header. o Private data in the encapsulation header allows local customization and experimentation while being compatible with processing in network nodes (routers and middleboxes). o GUE includes both data messages (encapsulation of packets) and control messages (such as OAM). 7. Security Considerations Encapsulation of IP protocols within GUE should not increase security risk, nor provide additional security in itself. As suggested in section 5 the source port for of UDP packets in GUE should be randomly seeded to mitigate some possible denial service attacks. GUE is most useful when it is in the outermost header of a packet which allows for flow hash calculation as well as making GUE header data (such as virtual network identifier) visible to switches and middleboxes. GUE must be amenable to encapsulating (and being encapsulated within) IPsec. Also, we allow provisions to secure the GUE header itself without external protocol. Herbert, Yong, Zia Expires October 2015 [Page 17] Internet Draft Generic UDP Encapsulation April 16, 2015 Security for Generic UDP Encapsulation is described in more detail in [GUESEC]. 7.1. GUE security fields Security fields should be used to provide integrity and authentication of the GUE header. Security negotiation (algorithms, interpretation of security field, key management, etc.) is expected to be done out of band between hosts. 7.2. GUE and IPsec GUE may be used to encapsulate IPsec packets. This allows the benefits of deriving a flow hash for the inner, potentially encrypted, packet. In this case the protocol stack may be: +-------------------------------+ | | | UDP/IP header | | | |-------------------------------| | | | GUE Header | | | |-------------------------------| | | | ESP/AH/private security | | | |-------------------------------| | | | Encapsulated packet | | | +-------------------------------+ Note that IPsec would not cover the GUE header in this case (does not authenticate it for instance). GUE security optional fields may be used to provide authentication or integrity of the GUE header. 8. IANA Consideration A user UDP port number assignment for GUE has been assigned: Service Name: gue Transport Protocol(s): UDP Assignee: Tom Herbert Contact: Tom Herbert Description: Generic UDP Encapsulation Herbert, Yong, Zia Expires October 2015 [Page 18] Internet Draft Generic UDP Encapsulation April 16, 2015 Reference: draft-herbert-gue Port Number: 6080 Service Code: N/A Known Unauthorized Uses: N/A Assignment Notes: N/A 9. Acknowledgements The authors would like to thank David Liu for valuable input on this draft. 10. References 10.1. Normative References [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, August 1980. [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", RFC 2434, October 1998. [RFC2983] Black, D., "Differentiated Services and Tunnels", RFC 2983, October 2000. [RFC6040] Briscoe, B., "Tunnelling of Explicit Congestion Notification", RFC 6040, November 2010. [RFC6936] Fairhurst, G. and M. Westerlund, "Applicability Statement for the Use of IPv6 UDP Datagrams with Zero Checksums", RFC 6936, April 2013. [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- Network Tunneling", RFC 4459, April 2006. 10.2. Informative References [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, October 1996. [RFC3948] Huttunen, A., Swander, B., Volpe, V., DiBurro, L., and M. Stenberg, "UDP Encapsulation of IPsec ESP Packets", RFC 3948, January 2005. [RFC6830] Farinacci, D., Fuller, V., Meyer, D., and D. Lewis, "The Locator/ID Separation Protocol (LISP)", RFC 6830, January 2013. Herbert, Yong, Zia Expires October 2015 [Page 19] Internet Draft Generic UDP Encapsulation April 16, 2015 [RFC3378] Housley, R. and S. Hollenbeck, "EtherIP: Tunneling Ethernet Frames in IP Datagrams", RFC 3378, September 2002. [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. Traina, Generic Routing Encapsulation (GRE)", RFC 2784, March 2000. [RFC4023] Worster, T., Rekhter, Y., and E. Rosen, Ed., "Encapsulating MPLS in IP or Generic Routing Encapsulation (GRE)", RFC 4023, March 2005. [RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", RFC 2661, August 1999. [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP Authentication Option", RFC 5925, June 2010. [RFC3828] Larzon, L-A., Degermark, M., Pink, S., Jonsson, L-E., Ed., and G. Fairhurst, Ed., "The Lightweight User Datagram Protocol (UDP- Lite)", RFC 3828, July 2004, . [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, "Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, August 2014, . [NVGRE] NVGRE: Network Virtualization using Generic Routing Encapsulation draft-sridharan-virtualization-nvgre-03 [TCPUDP] Encapsulation of TCP and other Transport Protocols over UDP draft-cheshire-tcp-over-udp-00 [GREUDP] Generic UDP Encapsulation for IP Tunneling draft-yong-tsvwg- gre-in-udp-encap-02 [GUESEC] Yong, L., Herbert, T., "Generic UDP Encapsulation (GUE) for Secure Transport", draft-hy-gue-4-secure-transport-00, work in progress. [GUT] Generic UDP Tunnelling (GUT) draft-manner-tsvwg-gut-02.txt [REMCSUM] Remote Checksum Offload draft-herbert-remotecsumoffload-00 Appendix A: NIC processing for GUE This appendix provides some guidelines for Network Interface Cards Herbert, Yong, Zia Expires October 2015 [Page 20] Internet Draft Generic UDP Encapsulation April 16, 2015 (NICs) to implement common offloads and accelerations to support GUE. Note that most of this discussion is generally applicable to other methods of UDP based encapsulation. A.1. Receive multi-queue Contemporary NICs support multiple receive descriptor queues (multi- queue). Multi-queue enables load balancing of network processing for a NIC across multiple CPUs. On packet reception, a NIC must select the appropriate queue for host processing. Receive Side Scaling is a common method which uses the flow hash for a packet to index an indirection table where each entry stores a queue number. Flow Director and Accelerated Receive Flow Steering (aRFS) allow a host to program the queue that is used for a given flow which is identified either by an explicit five-tuple or by the flow's hash. GUE encapsulation should be compatible with multi-queue NICs that support five-tuple hash calculation for UDP/IP packets as input to RSS. The inner flow identifier (source port) ensures classification of the encapsulated flow even in the case that the outer source and destination addresses are the same for all flows (e.g. all flows are going over a single tunnel). By default, UDP RSS support is often disabled in NICs to avoid out of order reception that can occur when UDP packets are fragmented. As discussed above, fragmentation of GUE packets should be mitigated by fragmenting packets before entering a tunnel, path MTU discovery in higher layer protocols, or operator adjusting MTUs. Other UDP traffic may not implement such procedures to avoid fragmentation, so enabling UDP RSS support in the NIC should be a considered tradeoff during configuration. A.2. Checksum offload Many NICs provide capabilities to calculate standard ones complement payload checksum for packets in transmit or receive. When using GUE encapsulation there are at least two checksums that may be of interest: the encapsulated packet's transport checksum, and the UDP checksum in the outer header. A.2.1. Transmit checksum offload NICs may provide a protocol agnostic method to offload transmit checksum (NETIF_F_HW_CSUM in Linux parlance) that can be used with GUE. In this method the host provides checksum related parameters in a transmit descriptor for a packet. These parameters include the starting offset of data to checksum, the length of data to checksum, and the offset in the packet where the computed checksum is to be Herbert, Yong, Zia Expires October 2015 [Page 21] Internet Draft Generic UDP Encapsulation April 16, 2015 written. The host initializes the checksum field to pseudo header checksum. In the case of GUE, the checksum for an encapsulated transport layer packet, a TCP packet for instance, can be offloaded by setting the appropriate checksum parameters. NICs typically can offload only one transmit checksum per packet, so simultaneously offloading both an inner transport packet's checksum and the outer UDP checksum is likely not possible. In this case setting UDP checksum to zero (per above discussion) and offloading the inner transport packet checksum might be acceptable. If an encapsulator is co-resident with a host, then checksum offload may be performed using remote checksum offload [REMCSUM]. Remote checksum offload relies on NIC offload of the simple UDP/IP checksum which is commonly supported even in legacy devices. In remote checksum offload the outer UDP checksum is set and the GUE header includes an option indicating the start and offset of the inner "offloaded" checksum. The inner checksum is initialized to the pseudo header checksum. When a decapsulator receives a GUE packet with the remote checksum offload option, it completes the offload operation by determining the packet checksum from the indicated start point to the end of the packet, and then adds this into the checksum field at the offset given in the option. Computing the checksum from the start to end of packet is efficient if checksum-complete is provided on the receiver. A.2.2. Receive checksum offload GUE is compatible with NICs that perform a protocol agnostic receive checksum (CHECKSUM_COMPLETE in Linux parlance). In this technique, a NIC computes a ones complement checksum over all (or some predefined portion) of a packet. The computed value is provided to the host stack in the packet's receive descriptor. The host driver can use this checksum to "patch up" and validate any inner packet transport checksum, as well as the outer UDP checksum if it is non-zero. Many legacy NICs don't provide checksum-complete but instead provide an indication that a checksum has been verified (CHECKSUM_UNNECESSARY in Linux). Usually, such validation is only done for simple TCP/IP or UDP/IP packets. If a NIC indicates that a UDP checksum is valid, the checksum-complete value for the UDP packet is the "not" of the pseudo header checksum. In this way, checksum-unnecessary can be converted to checksum-complete. So if the NIC provides checksum-unnecessary for the outer UDP header in an encapsulation, checksum conversion can be done so that the checksum-complete value is derived and can be used by the stack to validate an checksums in the encapsulated packet. Herbert, Yong, Zia Expires October 2015 [Page 22] Internet Draft Generic UDP Encapsulation April 16, 2015 A.3. Transmit Segmentation Offload Transmit Segmentation Offload (TSO) is a NIC feature where a host provides a large (>MTU size) TCP packet to the NIC, which in turn splits the packet into separate segments and transmits each one. This is useful to reduce CPU load on the host. The process of TSO can be generalized as: - Split the TCP payload into segments which allow packets with size less than or equal to MTU. - For each created segment: 1. Replicate the TCP header and all preceding headers of the original packet. 2. Set payload length fields in any headers to reflect the length of the segment. 3. Set TCP sequence number to correctly reflect the offset of the TCP data in the stream. 4. Recompute and set any checksums that either cover the payload of the packet or cover header which was changed by setting a payload length. Following this general process, TSO can be extended to support TCP encapsulation in GUE. For each segment the Ethernet, outer IP, UDP header, GUE header, inner IP header if tunneling, and TCP headers are replicated. Any packet length header fields need to be set properly (including the length in the outer UDP header), and checksums need to be set correctly (including the outer UDP checksum if being used). To facilitate TSO with GUE it is recommended that optional fields should not contain values that must be updated on a per segment basis-- for example the GUE fields should not include checksums, lengths, or sequence numbers that refer to the payload. If the GUE header does not contain such fields then the TSO engine only needs to copy the bits in the GUE header when creating each segment and does not need to parse the GUE header. A.4. Large Receive Offload Large Receive Offload (LRO) is a NIC feature where packets of a TCP connection are reassembled, or coalesced, in the NIC and delivered to the host as one large packet. This feature can reduce CPU utilization in the host. Herbert, Yong, Zia Expires October 2015 [Page 23] Internet Draft Generic UDP Encapsulation April 16, 2015 LRO requires significant protocol awareness to be implemented correctly and is difficult to generalize. Packets in the same flow need to be unambiguously identified. In the presence of tunnels or network virtualization, this may require more than a five-tuple match (for instance packets for flows in two different virtual networks may have identical five-tuples). Additionally, a NIC needs to perform validation over packets that are being coalesced, and needs to fabricate a single meaningful header from all the coalesced packets. The conservative approach to supporting LRO for GUE would be to assign packets to the same flow only if they have identical five- tuple and were encapsulated the same way. That is the outer IP addresses, the outer UDP ports, GUE protocol, GUE flags and fields, and inner five tuple are all identical. Appendix B: Privileged ports Using the source port to contain an inner flow identifier value disallows the security method of a receiver enforcing that the source port be a privileged port. Privileged ports are defined by some operating systems to restrict source port binding. Unix, for instance, considered port number less than 1024 to be privileged. Enforcing that packets are sent from a privileged port is widely considered an inadequate security mechanism and has been mostly deprecated. To approximate this behavior, an implementation could restrict a user from sending a packet destined to the GUE port without proper credentials. Appendix C: Inner flow identifier as a route selector An encapsulator generating an inner flow identifier may modulate the value to perform a type of multipath source routing. Assuming that networking switches perform ECMP based on the flow hash, a sender can affect the path by altering the inner flow identifier. For instance, a host may store a flow hash in its PCB for an inner flow, and may alter the value upon detecting that packets are traversing a lossy path. Changing the inner flow identifier for a flow should be subject to hysteresis (at most once every thirty seconds) to limit the number of out of order packets. Appendix D: Hardware protocol implementation considerations A low level protocol, such is GUE, is likely interesting to being supported by high speed network devices. Variable length header (VLH) protocols like GUE are often considered difficult to efficiently implement in hardware. In order to retain the important characteristics of an extensible and robust protocol, hardware Herbert, Yong, Zia Expires October 2015 [Page 24] Internet Draft Generic UDP Encapsulation April 16, 2015 vendors may practice "constrained flexibility". In this model, only certain combinations or protocol header parameterizations are implemented in hardware fast path. Each such parameterization is fixed length so that the particular instance can be optimized as a fixed length protocol. In the case of GUE this constitutes specific combinations of GUE flags, fields, and next protocol. The selected combinations would naturally be the most common cases which form the "fast path", and other combinations are assumed to take the "slow path". In time, needs and requirements of the protocol may change which may manifest themselves as new parameterizations to be supported in the fast path. To allow allow this extensibility, a device practicing constrained flexibility should allow the fast path parameterizations to be programmable. Authors' Addresses Tom Herbert Facebook 1 Hacker Way Menlo Park, CA 94052 US Email: tom@herbertland.com Lucy Yong Huawei USA 5340 Legacy Dr. Plano, TX 75024 US Email: lucy.yong@huawei.com Osama Zia Microsoft 1 Microsoft Way Redmond, WA 98029 US Email: osamaz@microsoft.com Herbert, Yong, Zia Expires October 2015 [Page 25]