Network Working Group F. Templin, Ed. Internet-Draft Boeing Phantom Works Expires: September 4, 2006 March 3, 2006 Link Adaptation for IPv6-in-IPv4 Tunnels draft-templin-linkadapt-02.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on September 4, 2006. Copyright Notice Copyright (C) The Internet Society (2006). Abstract IPv6-in-IPv4 tunnel endpoints support an MTU of 1280 bytes or larger via static prearrangements and/or dynamic MTU determination based on ICMPv4 messages, but these methods have known operational limitations. This document proposes a new MTU determination mechanism for IPv6-in-IPv4 tunnels that supports larger MTUs using a link adaptation scheme with tunnel endpoint-based segmentation/ reassembly and dynamic segment size probing. Templin Expires September 4, 2006 [Page 1] Internet-Draft Link Adaptation for Tunnels March 2006 1. Introduction IPv6-in-IPv4 tunnels span multiple IPv4 network hops yet are seen by IPv6 as ordinary links that must support the minimum IPv6 link MTU of 1280 bytes ([RFC2460], section 5). Common tunneling mechanisms (e.g., [RFC2529][RFC3056][RFC4213][RFC4214][RFC4380]) meet this requirement through conservative static prearrangements at the expense of degraded performance over some paths due to excessive IPv4 network-based fragmentation and/or missed opportunities to discover larger MTUs. Optional dynamic MTU determination methods based on ICMPv4 "fragmentation needed" messages are also available, but can result in communication failures due to the unreliable and untrustworthy nature of ICMPv4 messages generated by network middleboxes. This document proposes a link adaptation method for IPv6-in-IPv4 tunnels that presents an assured MTU to the IPv6 layer. It uses tunnel endpoint-based segmentation/reassembly and dynamic segment size probing with authenticated probe feedback. Thus, it provides greater robustness and efficiency than existing schemes by avoiding IPv4 network-based fragmentation and dependence on unreliable/ untrustworthy ICMPv4 feedback from IPv4 network middleboxes. 2. Requirements The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when they appear in this document, are to be interpreted as described in [RFC2119]. 3. Link Adaptation for IPv6-in-IPv4 Tunnels The following subsections specify a link adaptation scheme for IPv6- in-IPv4 tunnels with properties similar to those defined for AAL5 [RFC2684] and IEEE 802.11 [WLAN]: 3.1. Layering IPv6-in-IPv4 tunnel endpoints that implement the link adaptation specified in this document (hereafter referred to as "implementations") operate at a logical midpoint between the IPv6 and IPv4 protocol modules. From the viewpoint of IPv6, the implementation appears as a network driver that delivers whole Upper Layer Payloads (ULPs) to an underlying transmission media. From the viewpoint of IPv4, the implementation appears as a packetization layer protocol that segments ULPs to be encapsulated in IPv4 packets. IPv6-in-IPv4 tunnel endpoints therefore operate at a logical "layer Templin Expires September 4, 2006 [Page 2] Internet-Draft Link Adaptation for Tunnels March 2006 2.5" between IPv6 as layer 3 and IPv4 as layer 2. 3.2. Tunnel MTU Implementations MUST configure a minimum per-tunnel LinkMTU of 1280 bytes and SHOULD provide a configuration knob to set larger values. A maximum per-tunnel LinkMTU of 9180 bytes (i.e., the same as defined in [RFC1626]) is RECOMMENDED for normal use cases, since it is large enough to accommodate Gigabit Ethernet Jumbo Frames yet not so large as to diminish the effectiveness of 32-bit link layer CRCs [GIGE]. Implementations MAY set even larger LinkMTU values, but are advised that this may lead to unacceptable levels of undetected errors unless all physical segments in the path can provide assured error-free delivery for large packets. 3.3. Encapsulation/Segmentation Encapsulating tunnel endpoints cache per-flow segment sizes ("SEGSIZE") for the purpose of segmenting ULPs that are too large to traverse the tunnel into chains of SEGSIZE-byte (or smaller) segments. Conservative implementations can configure an initial SEGSIZE of 68 bytes minus the length of the IPv4 header plus any additional layer 2.5 encapsulation headers, since the minimum IPv4 LinkMTU is 68 bytes [RFC0791]. Under normal conditions, however, implementations can configure initial SEGSIZE values up to 576 bytes minus the IPv4 and layer 2.5 encapsulation header lengths since all IPv4 nodes are required to configure a Maximum Receive Unit (MRU) of at least 576 bytes [RFC0791][RFC1122][RFC1812]. (Also, most links in the Internet configure still larger IPv4 LinkMTUs [RFC3150][RFC3819] such that larger initial SEGSIZE values are often possible.) Encapsulating tunnel endpoints split each ULP they send into a tunnel into chains of segments for presentation to the IPv4 layer. The segments MUST be contiguous and non-overlapping, i.e., the final byte of the (i)th segment MUST be the byte that immediately precedes the first byte of the (i+1)th segment. Non-final segments in the chain MUST be equal in length; the final segment MAY be of different length. For ULPs that span multiple segments, encapsulators use 2's compliment Fletcher-32 [STONE][RFC3385] to calculate a checksum across all payload bytes and encode the A and B results in a trailing 32-bit field as the final 4 bytes of the final packet(s) in the chain. For ULPs that fit within a single segment, the trailing 32- bit checksum is omitted. Each segment in the chain is encapsulated in an IPv4 header plus any additional layer 2.5 encapsulating headers, with the reserved bit in the IPv4 "Flags" field set to '1' to inform the decapsulating tunnel endpoint that the segmentation/reassembly scheme specified by this Templin Expires September 4, 2006 [Page 3] Internet-Draft Link Adaptation for Tunnels March 2006 document is used. In addition, each segment encodes the following information in the 16-bit IPv4 "Identification" field: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ULPID | SEGID |P|A| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ IPv4 Identification Field ULPID: 6 bits An identifying value assigned by the sender to aid in reassembling the segments of a ULP. SEGID: 8 bits A value that identifies a specific segment within a ULP. Acceptable values are in the range 0 - 254. P: 1 bit Probe flag. 0 = Ordinary Segment, 1 = Probe Segment. A: 1 bit Additional Segments flag; 0 = Last Segment, 1 = Additional Segments. Each IPv4 packet in a chain encodes an identical value in the "ULPID" field (bits 0 thru 5 of the IPv4 Identification field) to identify the segments of a specific ULP; IPv4 packets that encapsulate segments of different ULPs encode different ULPID values. Consecutive IPv4 packets in a chain encode an increasing Segment ID value between 0 - 254 in the "SEGID" field (bits 6 thru 13 of the IPv4 Identification field), i.e., the first packet encodes the value '0', the second packet encodes the value '1', etc. Each packet in the chain except the final one sets the "Additional Segments - A" bit (bit 15 of the IPv4 Identification field) to indicate that additional segments follow. Finally, each packet in the chain is delivered to the link layer (i.e., the IPv4 stack) in increasing SEGID order, i.e., SEGID 0 first, followed by SEGID 1, etc., up to the final packet. The link layer SHOULD NOT reorder the packets. To increase efficiency and avoid excessive packet chain lengths, implementations SHOULD seek to increase a flow's SEGSIZE to larger values through path probing to avoid black holes [RFC2923]. Implementations probe a candidate SEGSIZE value 'N' by setting the "Probe Segment - P" bit (bit 14 of the IPv4 Identification field) in a probe segment of size N within a packet chain. After sending the probe segment, if the encapsulator receives a unicast IPv6 Router Templin Expires September 4, 2006 [Page 4] Internet-Draft Link Adaptation for Tunnels March 2006 Advertisement message [RFC2461] from the decapsulator at the far end of the tunnel (see: section 3.4) with an MTU option that encodes the value N within a maximum probedelay ("MaxProbeDelay") timeout period it deems the probe successful. For probe segments that contain valid data for reassembly as part of a packet chain, the encapsulator sets the appropriate SEGID value in the IPv4 packet header as for ordinary segmentation. For probe segments that are to be discarded by the decapsulator, the encapsulator sets the value 255 in the SEGID field. Following a successful probe, but before advancing SEGSIZE to N, implementations SHOULD enter a brief verification phase during which additional probe segments are sent to detect asymmetric multipath MTU restrictions. Thereafter, implementations SHOULD re-probe periodically to confirm that packets with up to SEGSIZE byte segments are still reaching the decapsulator at the far end of the tunnel. Additional strategies for SEGSIZE management and black hole detection are found in [PMTUD][ICMPATK]. 3.4. Decapsulation/Reassembly For tunneled packets with the reserved bit in the IPv4 "Flags" field set to '1' (see section 3.3), the IPv4 length, ULPID, SEGID and A fields along with flow identification information in layer 2.5 encapsulation headers provide sufficient information for the decapsulating tunnel endpoint to reassemble an original ULP with protection for packet reordering in the IPv4 network. Implementations of this scheme configure per-flow reassembly buffers of at least 1280 bytes and SHOULD configure larger reassembly buffers up to 9180 bytes or larger (see: section 3.2). Note that these reassembly buffers occur at the logical layer 2.5 midpoint between the IPv4 and IPv6 stacks and are thus distinct from the IPv4 and IPv6 reassembly caches. Decapsulating tunnel endpoints use per-flow reassembly buffers to concatenate the segments received in packet chains for a particular ULPID in increasing SEGID order (i.e., SEGID 0, followed by SEGID 1, etc.) even if the packets were re-ordered by the network. When all segments for a particular ULPID have been concatenated into the reassembly buffer, the implementation uses 2's complement Fletcher-32 to detect errors if a trailing checksum was included (see: section 3.3). If the decapsulating tunnel endpoint receives a packet chain that would overflow the reassembly buffer, it discards the chain and sends an [ICMPv6] "packet too big" message back to the source. The message body includes upper layer packet headers (IPv6 and above) and Templin Expires September 4, 2006 [Page 5] Internet-Draft Link Adaptation for Tunnels March 2006 contents of the reassembly buffer up to a total of 1280 bytes, and the MTU value encodes the reassembly buffer size. If the decapsulating tunnel endpoint receives at least one segment, but one or more segments are lost and/or checksum verification fails, it SHOULD send an [ICMPv6] "parameter problem" message with code "reassembly/checksum error" back to the encapsulating tunnel endpoint. The message body includes upper layer packet headers (IPv6 and above) and contents of the reassembly buffer up to a total of 1280 bytes, and the pointer identifies either the beginning of the first missing segment or the beginning of the 4 byte checksum field (if no segments were missing). Upon receipt of such [ICMPv6] errors, the encapsulator SHOULD take appropriate corrective actions such as reduce the tunnel's current SEGSIZE, impose an artificial inter-ULP queuing delay for the tunnel, relay the [ICMPv6] messages back to the original source as a congestion indication, etc. If the decapsulating tunnel endpoint receives a segment used for probing (i.e., an IPv4 packet in the chain with the 'P' flag set), it sends a unicast IPv6 Router Advertisement message back to the encapsulator at the originating end of the tunnel with an MTU option that encodes the probe segment length (subject to rate-limiting as for [ICMPv6] error messages). If the IPv4 packet containing the probe segment encodes the value 255 in the SEGID field, the segment is discarded; otherwise, the segment is included as part of the normal reassembly procedure described above. Following successful reassembly, the decapsulating tunnel endpoint discards the trailing checksum (if present) and delivers the ULP to upper layers. 3.5. Setting the DF Bit When encapsulating tunnel endpoints segment ULPs (see: section 3.3), they can optionally set or not set the "Don't Fragment - DF" bit in the IPv4 headers of packets in a chain. If the DF bit is not set, network-based IPv4 fragmentation may occur for packets in a chain resulting in well-known performance issues [FRAG]. Additionally, some middleboxes (such as IPv4 NATs and firewalls) are only capable of passing the first fragment of a multi-fragment IPv4 datagram, which could result in silent communication failures at decapsulating tunnel endpoints. Finally, sending large IPv4 packets with the DF bit not set could result in IPv4 reassembly buffer overruns at some decapsulating tunnel endpoints and thereby also result in silent communication failures. While not setting the DF bit can lead to communication failures observed as path MTU-related black holes, in some instances it might Templin Expires September 4, 2006 [Page 6] Internet-Draft Link Adaptation for Tunnels March 2006 result in successful communications when setting the DF bit would otherwise have resulted in packet loss due to link MTU restrictions. In view of these considerations, encapsulating tunnel endpoints are advised to adopt a consistent strategy regarding setting of the DF bit. In any case, encapsulating tunnel endpoints SHOULD set the DF bit in the IPv4 headers of packets used for probing. 3.6. ICMPv4 Error Handling Encapsulators may receive ICMPv4 "fragmentation needed" error messages from inside a tunnel due to probe failures and/or route changes across previously-probed paths. These messages may come from either legitimate IPv4 network middleboxes or adversarial/ mis-configured middleboxes that return wrong information. Implementers are advised to consult [PMTUD][ICMPATK] for operational recommendations on processing ICMPv4 "fragmentation needed" messages. 4. IANA Considerations The IANA is instructed to assign a code type for "reassembly/checksum error" under the [ICMPv6] Parameter Problem message type in the "ICMPv6 Type Numbers" registry. 5. Security Considerations The securing mechanisms for IPv6 neighbor discovery [RFC3971] and Cryptographically-Generated Addresses [RFC3972] are used to authenticate Router Advertisement probe responses. 6. Acknowledgments This document represents the mindshare of many contributers. 7. Appendix A: Additional Considerations Encapsulators can use the probing mechanism described in section 3 as a general-purpose method for eliciting acknowledgements from the reassembler if improved reliability at the expense of additional overhead is desired. The equal size restriction for non-final segments and non-overlapping restriction for all segments in packet chains provides a significant Templin Expires September 4, 2006 [Page 7] Internet-Draft Link Adaptation for Tunnels March 2006 simplification for reassembly algorithms [RFC0815]. Use of the link adaptation scheme described in this document may lead to an overall increase in short chains of small packets in the Internet. Network administrators are advised to follow the recommendations in [RFC3150] to minimize packet loss and packet reordering. Also, overly-long packet chains should be avoided if possible due to interactions with Active Queue Management (AQM) in the network. Since link-layer CRC-32 checks normally occur on each segment in the path, most errors detected during ULP reassembly will be due to packet splices and/or errors in the data path between the NIC hardware and the reassembly buffer. The Fletcher-32 checksum algorithm has been shown to provide an effective edge-to-edge error detection capability for such errors [STONE]. The Fletcher-32 checksum is also dissimilar from both CRC-32 and the Internet checksum used by many upper layer protocols, thereby decreasing the likelihood of undetected errors. Some upper layer packetization protocols (e.g., NFS) generate fixed payload sizes and rely on the network layer to deliver the payloads either as whole IP packets or as chains of IP fragments. Since NFS performance (and the performance of other upper layer packetization protocols) is highly sensitive to packet handling overhead, implementations should periodically attempt to increase the SEGSIZE through probing even if initial probe attempts fail. 8. Appendix B: Changes Changes since -01: o Updated references Changes since -00: o Defined new coding of segmentation/reassembly info in the IPv4 Identification field o Changed "tunneling mechanism" to "tunnel endpoint" o Clarified text on trailing checksums o general document cleanup; removed "additional considerations" that no longer apply 9. References Templin Expires September 4, 2006 [Page 8] Internet-Draft Link Adaptation for Tunnels March 2006 9.1. Normative References [ICMPV6] Conta, A., Deering, S., and M. Gupta, ed., "Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification", draft-ietf-ipngwg-icmp-v3 (work in progress), July 2005. [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981. [RFC1122] Braden, R., "Requirements for Internet Hosts - Communication Layers", STD 3, RFC 1122, October 1989. [RFC1812] Baker, F., "Requirements for IP Version 4 Routers", RFC 1812, June 1995. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2460] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, December 1998. [RFC2461] Narten, T., Nordmark, E., and W. Simpson, "Neighbor Discovery for IP Version 6 (IPv6)", RFC 2461, December 1998. [RFC3971] Arkko, J., Kempf, J., Zill, B., and P. Nikander, "SEcure Neighbor Discovery (SEND)", RFC 3971, March 2005. [RFC3972] Aura, T., "Cryptographically Generated Addresses (CGA)", RFC 3972, March 2005. 9.2. Informative References [FRAG] Mogul, J. and C. Kent, "Fragmentation Considered Harmful, In Proc. SIGCOMM '87 Workshop on Frontiers in Computer Communications Technology.", August 1987. [GIGE] Dykstra, P., "Gigabit Ethernet Jumboframes (And Why You Should Care), http://sd.wareonearth.com/~phil/jumbo.html", December 1999. [ICMPATK] Gont, F., "ICMP Attacks Against TCP", draft-gont-tcpm-icmp-attacks (work in progress), October 2005. [PMTUD] Mathis, M. and J. Heffner, "Path MTU Discovery", draft-ietf-pmtud-method (work in progress), October 2005. Templin Expires September 4, 2006 [Page 9] Internet-Draft Link Adaptation for Tunnels March 2006 [RFC0815] Clark, D., "IP datagram reassembly algorithms", RFC 815, July 1982. [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990. [RFC1626] Atkinson, R., "Default IP MTU for use over ATM AAL5", RFC 1626, May 1994. [RFC2529] Carpenter, B. and C. Jung, "Transmission of IPv6 over IPv4 Domains without Explicit Tunnels", RFC 2529, March 1999. [RFC2684] Grossman, D. and J. Heinanen, "Multiprotocol Encapsulation over ATM Adaptation Layer 5", RFC 2684, September 1999. [RFC2923] Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923, September 2000. [RFC3056] Carpenter, B. and K. Moore, "Connection of IPv6 Domains via IPv4 Clouds", RFC 3056, February 2001. [RFC3150] Dawkins, S., Montenegro, G., Kojo, M., and V. Magret, "End-to-end Performance Implications of Slow Links", BCP 48, RFC 3150, July 2001. [RFC3385] Sheinwald, D., Satran, J., Thaler, P., and V. Cavanna, "Internet Protocol Small Computer System Interface (iSCSI) Cyclic Redundancy Check (CRC)/Checksum Considerations", RFC 3385, September 2002. [RFC3819] Karn, P., Bormann, C., Fairhurst, G., Grossman, D., Ludwig, R., Mahdavi, J., Montenegro, G., Touch, J., and L. Wood, "Advice for Internet Subnetwork Designers", BCP 89, RFC 3819, July 2004. [RFC4213] Nordmark, E. and R. Gilligan, "Basic Transition Mechanisms for IPv6 Hosts and Routers", RFC 4213, October 2005. [RFC4214] Templin, F., Gleeson, T., Talwar, M., and D. Thaler, "Intra-Site Automatic Tunnel Addressing Protocol (ISATAP)", RFC 4214, October 2005. [RFC4380] Huitema, C., "Teredo: Tunneling IPv6 over UDP through Network Address Translations (NATs)", RFC 4380, February 2006. [STONE] Stone, J., "Checksums in the Internet (Stanford Doctoral Dissertation)", August 2001. Templin Expires September 4, 2006 [Page 10] Internet-Draft Link Adaptation for Tunnels March 2006 [WLAN] Society, I., "Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, IEEE Computer Society, ANSI/IEEE 802.11, 1999 Edition.". Templin Expires September 4, 2006 [Page 11] Internet-Draft Link Adaptation for Tunnels March 2006 Author's Address Fred L. Templin (editor) Boeing Phantom Works P.O. Box 3707 Seattle, WA 98124 USA Email: fred.l.templin@boeing.com Templin Expires September 4, 2006 [Page 12] Internet-Draft Link Adaptation for Tunnels March 2006 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Templin Expires September 4, 2006 [Page 13]