Network Working Group F. Templin, Ed. Internet-Draft Boeing Phantom Works Expires: December 19, 2005 June 17, 2005 Link Adaptation for IPv6-in-IPv4 Tunnels draft-templin-linkadapt-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on December 19, 2005. Copyright Notice Copyright (C) The Internet Society (2005). Abstract IPv6-in-IPv4 tunneling mechanisms support the minimum IPv6 MTU of 1280 bytes via static prearrangements at the tunnel encapsulator and/or dynamic MTU determination based on ICMPv4 messages, but these methods have known operational limitations. This document proposes a new MTU determination mechanism for IPv6-in-IPv4 tunnels that uses a link adaptation scheme with simplified IPv4 segmentation/reassembly and dynamic segment size probing. Templin Expires December 19, 2005 [Page 1] Internet-Draft Link Adaptation for Tunnels June 2005 1. Introduction IPv6-in-IPv4 tunnels span multiple IPv4 network hops yet are seen by IPv6 as ordinary links that must support the minimum IPv6 link MTU of 1280 bytes ([RFC2460], section 5). Common tunneling mechanisms (e.g., [RFC2529][RFC3056][ISATAP][MECH][TEREDO]) meet this requirement through conservative static prearrangements at the encapsulator at the expense of sub-optimal performance over some paths due to excessive IPv4 network-based fragmentation and/or missed opportunities to discover larger MTUs. Optional dynamic MTU determination methods based on ICMPv4 "fragmentation needed" messages are also available, but can result in MTU-related communication failures due to the unreliable and untrustworthy nature of ICMPv4 messages generated by network middleboxes. This document proposes a link adaptation method for IPv6-in-IPv4 tunnels that presents an assured MTU to the IPv6 layer. It uses simplified segmentation/reassembly and dynamic segment size probing with authenticated probe feedback. Thus, it provides greater robustness and efficiency than existing schemes by avoiding IPv4 network-based fragmentation and reducing dependence on unreliable/ untrustworthy ICMPv4 feedback from IPv4 network middleboxes. 2. Requirements The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when they appear in this document, are to be interpreted as described in [RFC2119]. 3. Link Adaptation for IPv6-in-IPv4 Tunnels The following subsections specify a link adaptation scheme for IPv6- in-IPv4 tunnels with properties similar to those defined for AAL5 [RFC2684] and IEEE 802.11 [WLAN]: 3.1. Layering IPv6-in-IPv4 tunneling mechanisms that implement the link adaptation specified in this document (hereafter referred to as "implementations") operate at a logical midpoint between the IPv6 and IPv4 protocol modules. From the viewpoint of IPv6, the implementation appears as a network driver that delivers whole Upper Layer Payloads (ULPs) to an underlying transmission media. From the viewpoint of IPv4, the implementation appears as a packetization layer protocol (e.g., similar to TCP, etc.) that segments user data to be encapsulated in IPv4 packets. Templin Expires December 19, 2005 [Page 2] Internet-Draft Link Adaptation for Tunnels June 2005 3.2. Tunnel Interface MTU Implementations MUST configure a minimum per-tunnel interface LinkMTU of 1280 bytes and SHOULD provide a configuration knob to set larger values. A maximum LinkMTU of 9180 bytes (i.e., the same as defined in [RFC1626]) is RECOMMENDED for normal use cases, since it is large enough to encode 8KB network filesystem blocks and take advantage of Gigabit Ethernet Jumbo Frames, yet not so large as to diminsh the effectiveness of 32-bit link layer CRCs [GIGE]. Implementations MAY set even larger LinkMTU values, but are advised that this may lead to unacceptable levels of undetected errors unless all physical segments in the path can provide assured error-free deliverey for large packets. Since LinkMTU values larger than 1280 bytes may result in [ICMPv6] "packet too big" messages due to temporary segmentation restrictions (see: section 3.3), ULPs SHOULD employ a probing strategy that begins with a smaller payload size (on the order of 1KB) and probes upward [PMTUD]. (Note that this may not be possible for some ULPs.) 3.3. Encapsulation/Segmentation Encapsulators cache per-flow segment sizes ("SEGSIZE") for the purpose of segmenting ULPs into chains of IPv4 datagrams. Conservative implementations can configure an initial SEGSIZE of 68 bytes minus the length of the IPv4 header and any additional encapsulation headers, since the minimum IPv4 LinkMTU is 68 bytes [RFC0791]. In practice, however, most Internet links configure much larger IPv4 LinkMTUs [RFC3150][RFC3819] such that larger initial SEGSIZE values are often possible. The encapsulator splits each ULP into a chain of at most 32 segments for presentation to the IPv4 layer. The segments MUST be contiguous and non-overlapping, i.e., the final byte of the (i)th segment MUST be the byte that immediately precedes the first byte of the (i+1)th segment. Non-final segments in the chain MUST be equal in size; the final segment MAY be of different size. For ULPs that span multiple segments, encapsulators use 2's compliment Fletcher-32 [STONE][RFC3385] to calculate a checksum across all ULP payload bytes and record the A and B results in a trailing 32-bit checksum. For ULPs that fit within a single segment, the trailing 32-bit checksum is omitted. Segments are encapsulated in-order in consecutive IPv4 packets with bit 1 of the "Flags" field (i.e., "Don't Fragment - DF") set to '1' and an increasing Segment ID ("SEGID") value between 0 - 31 encoded in the five low-order bits in the "Fragmentation Offset" field, i.e., the first packet encodes '0', the second packet encodes '1', etc. Templin Expires December 19, 2005 [Page 3] Internet-Draft Link Adaptation for Tunnels June 2005 Each packet in the chain except the final one sets the "More Fragments - MF" bit, i.e., the MF bit is set as for ordinary IPv4 fragmentation. Each packet in the chain is delivered to the link layer (i.e., the IPv4 stack) in increasing SEGID order, i.e., SEGID 0 first, followed by SEGID 1, etc., up to the final packet; the link layer SHOULD NOT reorder the packets or introduce artificial delays between packets. Implementations MAY increase a flow's SEGSIZE to larger values through path probing to avoid black holes [RFC2923]. Implementations probe a candidate SEGSIZE value 'N' by segmenting a ULP into a chain of two or more packets such that the final packet encapsulates a segment of size N, where N is larger than the size of the segments encapsulated in non-final packets. The chain SHOULD also include Forward Error Correction (FEC) information (format and encoding TBD) that covers the probe segment in case of loss. If the encapsulator receives a unicast IPv6 Router Advertisement message [RFC2461] from the decapsulator at the far end of the tunnel (see: section 3.4) with an MTU option that encodes the value N within a maximum probedelay ("MaxProbeDelay") timeout period, it deems the probe successful. Following a successful probe, but before advancing SEGSIZE to N, implementations SHOULD enter a brief verification phase during which additional probes are sent to detect asymmetric multipath MTU restrictions. Thereafter, implementations SHOULD re-probe periodically to confirm that packets with up to SEGSIZE byte segments are still reaching the decapsulator at the far end of the tunnel. Additional strategies for SEGSIZE management and black hole detection are found in [PMTUD]. 3.4. Decapsulation/Reassembly The Length, SEGID, MF and flow identification information in the encapsulation headers of packets in a chain provide sufficient information for the tunnel decapsulator to reassemble the original ULP with protection for packet reordering in the IPv4 network. Decapsulators MUST configure per-flow reassembly buffers of at least 1280 bytes and SHOULD configure larger per-flow reassembly buffers up to 9180 bytes or larger (see: section 3.2). Decapsulators use per-flow reassembly buffers to concatenate the ULP segments received in packet chains in increasing SEGID order (i.e., SEGID 0, followed by SEGID 1, etc.) even if the packets were re- ordered by the network. When all ULP segments have been concatenated into the reassembly buffer, the decapsulator uses 2's complement Fletcher-32 to detect errors if a trailing checksum was included (see: section 3.3). Templin Expires December 19, 2005 [Page 4] Internet-Draft Link Adaptation for Tunnels June 2005 If the decapsulator receives a packet chain that would overflow the reassembly buffer, it discards the chain and sends an [ICMPv6] "packet too big" message back to the source. The message body includes upper layer packet headers (IPv6 and above) and contents of the reassembly buffer up to a total of 1280 bytes, while the MTU value encodes the reassembly buffer size. If at least one segment was received, but one or more segments were lost and/or checksum verification failed, the decapsulator SHOULD send an [ICMPv6] "parameter problem" message with code "reassembly/ checksum error" back to the encapsulator at the originating end of the tunnel. The message body includes upper layer packet headers (IPv6 and above) and contents of the reassembly buffer up to a total of 1280 bytes, and the pointer identifies either the beginning of the first missing segment or the beginning of the 4 byte checksum field (if no segments were missing). Upon receipt of such [ICMPv6] errors, the encapsulator SHOULD take appropriate corrective actions such as reduce the tunnel's current SEGSIZE, impose an artifical inter-ULP queuing delay for the tunnel, relay the [ICMPv6] messages back to the original source as a congestion indication, etc. When a decapsulator receives a packet chain used for probing (see: section 3.3), it reassembles the ULP as above and sends a unicast IPv6 Router Advertisement message back to the encapsulator at the originating end of the tunnel with an MTU option that encodes the size of the segment encapsulated in the final packet in the chain. The encapsulator will receive the Router Advertisement and deem the probe successful. Following successful reassembly, the trailing checksum is discarded (if present) and the ULP payload is delivered to upper layers. 3.5. ICMPv4 Error Handling Encapsulators may receive ICMPv4 "fragmentation needed" error messages from inside a tunnel due to probe failures and/or route changes across previously-probed paths. These messages may come from either legitimate IPv4 network middleboxes or adversarial/ mis-configured middleboxes that return wrong information. Implementers are advised to consult [PMTUD] for operational recommendations on processing ICMPv4 "fragmentation needed" messages. 4. IANA Considerations The IANA is instructed to assign a code type for "reassembly/checksum error" under the [ICMPv6] Parameter Problem message type in the "ICMPv6 Type Numbers" registry. Templin Expires December 19, 2005 [Page 5] Internet-Draft Link Adaptation for Tunnels June 2005 5. Security Considerations The securing mechanisms for IPv6 neighbor discovery [RFC3971] and Cryptographically-Generated Addresses [RFC3972] are used to authenticate Router Advertisement probe responses. 6. Acknowledgments This document represents the mindshare of many contributers. 7. Appendix A: Additional Considerations Encapsulators can segment chains of two or more packets in which the final packet is longer than the non-final packets as a general- purpose mechanism for eliciting acknowledgements from the reassembler if improved reliability at the expense of additional overhead is desired. The equal size restriction for non-final segments and non- overlapping restriction for all segments in packet chains provides a significant simplification for reassembly algorithms [RFC0815]. Use of the link adaptation scheme described in this document may lead to an overall increase in short chains of small packets in the Internet. Network administrators are advised to follow the recommendations in [RFC3150] to minimize packet loss and packet reordering. Network middleboxes that do not honor the IPv4 DF bit will cause irreparable damage to the information encoded in the IPv4 headers of encapsulated packets if fragmentation is incurred. Network conditions such as load balancing, multi-path routing, spanning tree reconfigurations, etc. can cause a certain degree of reordering of the packets in a flow. For instance, Segment 5 of a segmented PDU could arrive before Segment 1. The 5-bit segment ID in each packet provides protection for reordering among the packets of the same PDU, but provides no protection for reordering of packets belonging to *different* PDUs. A small ID field is therefore needed in each packet to differentiate the packets of PDUs A and B. The question arises as to whether a very small (2-4 bit) ID field is enough to eliminate potential ambiguity due to packet reordering in the network. Several works conducted by CAIDA (www.caida.org) may provide insights. Since link-layer CRC-32 checks normally occur on each segment in the path, most errors detected during PDU reassembly will be due to packet splices and/or errors in the data path between the NIC Templin Expires December 19, 2005 [Page 6] Internet-Draft Link Adaptation for Tunnels June 2005 hardware and the reassembly buffer. The Fletcher-32 checksum algorithm has been shown to provide an effective edge-to-edge error detection capability for such errors [STONE]. The Fletcher-32 checksum is also dissimilar from both CRC-32 and the Internet checksum used by many upper layer protocols, thereby decreasing the likelihood of undetected errors. Prior to any path MTU probing for a flow, link adaptation should begin with a conservative initial SEGSIZE to yield an IPv4 packet size of 68 bytes (the maximum IPv4 packet size guaranteed to fit over any link in the IPv4 Internet without incurring fragmentation) so that an un-probed ULP payload of at least 1280 bytes will be assured for ultra-conservative implementations. But, [RFC3150] suggests a minimum MTU of 296 bytes over the slowest serial links, so a slightly more optimistic implementation could send ULP payloads as large as ((296 - encapsulation_header_length) * 32) ~= 9000 bytes (and perhaps a bit larger due to VJ header compression) as long as they arrange for the first few such payloads to generate probe responses from the far-end. For those optimistic implementations, if probe responses consistently arrive after an initial probe and subsequent verification phase, the flow's SEGSIZE can be advanced to the size used for probing. Otherwise, the interface can generate IPv6 "packet too big" messages to inform upper packetization layers that smaller IPv6 packets should be sent over this flow for the time being. An optimistic implementation could therefore set the maximum interface LinkMTU of 9180 bytes and perform the optimistic initial probing described above. Some upper layer packetization protocols (e.g., NFS) generate fixed payload sizes and rely on the network layer to deliver the payloads either as whole IP packets or as chains of IP fragments. Those protocols should consider "packet too big" messages coming from the interface as an indication to retransmit, since the IP fragmentation layer will have been informed of the smaller MTU for the flow. Subsequent payloads sent over the flow will therefore undergo IP fragmentation and each fragment will be presented to the interface for transmission. Since NFS performance (and the performance of other upper layer packetization protocols) is highly sensitive to packet handling overhead, implementations should periodically attempt to increase the SEGSIZE through probing even if initial probe attempts fail. Since the RTT paths along various paths may vary from the sub- microsecond level up to hundreds of milliseconds or more, Forward Error Correction (FEC) will clearly be required in some cases (i.e., instead of Automatic Repeat Request (ARQ)) even though efficiency may suffer [RFC3819]. Provisions for enabling adaptive and efficient FEC in the segmentation/reassembly procedures are FFS. Templin Expires December 19, 2005 [Page 7] Internet-Draft Link Adaptation for Tunnels June 2005 8. References 8.1. Normative References [ICMPV6] Conta, A., Deering, S., and M. Gupta, ed., "Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification", draft-ietf-ipngwg-icmp-v3 (work in progress), November 2004. [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2460] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, December 1998. [RFC2461] Narten, T., Nordmark, E., and W. Simpson, "Neighbor Discovery for IP Version 6 (IPv6)", RFC 2461, December 1998. [RFC3971] Arkko, J., Kempf, J., Zill, B., and P. Nikander, "SEcure Neighbor Discovery (SEND)", RFC 3971, March 2005. [RFC3972] Aura, T., "Cryptographically Generated Addresses (CGA)", RFC 3972, March 2005. 8.2. Informative References [FRAG] Mogul, J. and C. Kent, "Fragmentation Considered Harmful, In Proc. SIGCOMM '87 Workshop on Frontiers in Computer Communications Technology.", August 1987. [GIGE] Dykstra, P., "Gigabit Ethernet Jumboframes (And Why You Should Care), http://sd.wareonearth.com/~phil/jumbo.html", December 1999. [ISATAP] Templin, F., Gleeson, T., Talwar, M., and D. Thaler, "Intra-Site Automatic Tunnel Addressing Protocol (ISATAP)", draft-ietf-ngtrans-isatap (work in progress), January 2005. [MECH] Nordmark, E. and R. Gilligan, "Transition Mechanisms for IPv6 Hosts and Routers", draft-ietf-v6ops-mech-v2 (work in progress), March 2005. Templin Expires December 19, 2005 [Page 8] Internet-Draft Link Adaptation for Tunnels June 2005 [PMTUD] Mathis, M., Heffner, J., and K. Lahey, "Path MTU Discovery", draft-ietf-pmtud-method (work in progress), February 2005. [RFC0815] Clark, D., "IP datagram reassembly algorithms", RFC 815, July 1982. [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990. [RFC1626] Atkinson, R., "Default IP MTU for use over ATM AAL5", RFC 1626, May 1994. [RFC2529] Carpenter, B. and C. Jung, "Transmission of IPv6 over IPv4 Domains without Explicit Tunnels", RFC 2529, March 1999. [RFC2684] Grossman, D. and J. Heinanen, "Multiprotocol Encapsulation over ATM Adaptation Layer 5", RFC 2684, September 1999. [RFC2923] Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923, September 2000. [RFC3056] Carpenter, B. and K. Moore, "Connection of IPv6 Domains via IPv4 Clouds", RFC 3056, February 2001. [RFC3150] Dawkins, S., Montenegro, G., Kojo, M., and V. Magret, "End-to-end Performance Implications of Slow Links", BCP 48, RFC 3150, July 2001. [RFC3385] Sheinwald, D., Satran, J., Thaler, P., and V. Cavanna, "Internet Protocol Small Computer System Interface (iSCSI) Cyclic Redundancy Check (CRC)/Checksum Considerations", RFC 3385, September 2002. [RFC3819] Karn, P., Bormann, C., Fairhurst, G., Grossman, D., Ludwig, R., Mahdavi, J., Montenegro, G., Touch, J., and L. Wood, "Advice for Internet Subnetwork Designers", BCP 89, RFC 3819, July 2004. [STONE] Stone, J., "Checksums in the Internet (Stanford Doctoral Dissertation)", August 2001. [TEREDO] Huitema, C., "Teredo: Tunneling IPv6 over UDP through NATs", draft-huitema-v6ops-teredo (work in progress), April 2005. [WLAN] Society, I., "Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, IEEE Templin Expires December 19, 2005 [Page 9] Internet-Draft Link Adaptation for Tunnels June 2005 Computer Society, ANSI/IEEE 802.11, 1999 Edition.". Templin Expires December 19, 2005 [Page 10] Internet-Draft Link Adaptation for Tunnels June 2005 Author's Address Fred Lambert Templin (editor) Boeing Phantom Works P.O. Box 3707 Seattle, WA 98124 USA Email: fred.l.templin@boeing.com Templin Expires December 19, 2005 [Page 11] Internet-Draft Link Adaptation for Tunnels June 2005 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Templin Expires December 19, 2005 [Page 12]