Internet-Draft Matt Mathis John Heffner PSC Kevin Lahey Freelance Feb 23, 2003 Packetization Layer Path MTU Discovery draft-mathis-plpmtud-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes a new Packetization Layer MTU probing algorithm which is not subject to the problems associated with the current path MTU discovery algorithms [RFC1191, RFC1981, RFC2923]. The general strategy of the new algorithm is to start with a small MTU and probe upward, testing successively larger MTUs by probing with single packets. If the probe is successfully delivered, then the MTU is raised. If the probe is lost, it is treated as an MTU limitation and not as a congestion signal. Table of Contents TBD 1. Introduction Path MTU discovery (PMTUD) as described in RFC1191 and RFC1981 Mathis, et al [Page 1] Internet-Draft Expires Aug 2003 Feb 23, 2003 depends on ICMP messages from the network. For a variety of reasons, these messages may not be correctly generated or propagated back to the end host, causing connection failure [RFC2923]. This draft proposes a robust method of performing path MTU discovery from TCP which does not depend on messages from the network. These procedures should be applicable to other transport- or application-level Packetization protocols which implement similar features. The lower layers need only to be consistent about what packet sizes are acceptable. Media that has parametric limitation (e.g. MTU bounds due to limited clock stability) must include explicit mechanisms to consistently reject packets that might otherwise be nondeterministically delivered. Classic ICMP-layer PMTUD, when working properly, can speed the discovery of the correct PMTU. In addition, packetization-layer PMTUD (PLPMTUD) can be extended with heuristics to use other criteria to select PMTU. For example, on a path that is so congested that the fair share window is only 5 KB, TCP may be better behaved with 512-byte packets than with 1500-byte packets since with the larger packets the window would be too small to trigger Fast Retransmit. PLPMTUD is defined by two independent algorithms. The "probing method" which is will specified in this document, defines the manner in which a candidate MTU may be validated or invalidated. The "probing strategy" describes a suggested approach to choosing MTUs to probe. It is only loosely described here and is subject to future research and improvement. The general strategy is to start with a small MTU and probe upward, testing successively larger sizes by probing with single packets. If the probe is successfully delivered, then the MTU is raised. If the probe is lost, it is treated as an MTU limitation and not as a congestion signal. 2. TODO list This Internet-Draft is a partial update of an earlier informally published document. It still needs to be revised to: o Restructure the document and use the "packetization layer" and "network layer" terminology from RFC1191, to make it less TCP specific. Mathis, et al [Page 2] Internet-Draft Expires Aug 2003 Feb 23, 2003 o Collect all of the TCP specific details into one section. Describe the general principles that apply to all transport pro­ tocols. o Fold all of RFC1191, RFC1981 and related lore into this docu­ ment. (longer term) o This document should throughly address persistent timeouts. When a TCP or other transport connection suddenly experiences persistent timeouts, several competing recovery strategies might be invoked at each level in the protocol stack, including restarting network interfaces, trying alternate first hop routers, using smaller MTU's, etc. All of these individual strategies need to be tied into a single unified multi-level strategy. o We need to consider robustness under a number of pathological conditions, such as when there is multi-path routing over paths with different MTUs. Please send comments and suggestions to mtu@psc.edu. 3. Context and terminology This algorithm is built on top of TCP. It's basic design is portable to other protocols, including application protocols over RTP or UDP and SCTP. It is light weight enough where it is not mandatory that MSS information be passed between successive TCP connections to the same remote host. It does not incur excessive overhead for each con­ nection to discover the maximum MTU on its own. In TCP it can be inconvenient to compute the largest possible segment size given a particular MTU due the presence of variable length options, such as TCP SACK. MSS probing minimizes this problem by choosing the segment sizes and testing if the link can support trans­ mission of the resulting IP packet. It is recommended that the test packet is padded with the maximal length variable options. Note that we use the term Maximal Transmission Unit to mean the largest possible IP packet. e.g. the largest possible layer 2 pay­ load. Most link layer standards organizations use MTU to mean the largest possible total layer two frame, including the layer two header. MSS probing can be adapted to other, non-TCP protocols. In Mathis, et al [Page 3] Internet-Draft Expires Aug 2003 Feb 23, 2003 particular, MSS probing can be adapted to tunneling protocols if the tunnel endpoints have a mechanism to detect and report missing pack­ ets. 4. Probe method A new "candidate MSS" is tested by sending one "probe segment", which is larger than the current MSS. Before a probe can be sent the following criteria MUST be met: There connection MUST have at least the candidate MSS worth of pending data. The connection MUST be using the current MSS, as defined by having received at least one acknowledgment for a recent non-probe segment at the current MSS. This implicitly limits successful probes to once per two round trips. [Making the algorithm robust in the presence of multi-path routing is likely to require an additional RTT.] Failed and inconclusive probes must be more widely spaced than the normal AIMD congestion interval for the current average window size. This is enforced by keeping a "probe count down" which is decremented on each non-probe segment sent. Probes MUST NOT be sent before the probe countdown reaches zero. After a probe segment has been sent (of size candidate MSS), the sub­ sequent segment(s) MUST be sent as though the probe segment was not over sized. Thus if the probe segment is lost, it will leave a hole that is exactly one current MSS. We refer to this potential hole as the probe gap. Note that the length of the probe segment is deter­ mined by the candidate MSS under consideration, but the length of the probe gap is the current MSS. [This has been shown to be more restrictive than necessary.] The candidate MSS MUST be strictly smaller than three times the cur­ rent MSS. Thus the probe segment fully covers at most one subsequent segment. The second subsequent segment is at most partially covered by the probe segment. This guarantees that the segments following the probe segment will cause at most one superfluous duplicate acknowledgment. The TCP MUST be using Fast-Retransmit and SACK or new Reno, such that isolated lost segments will normally be retransmitted without the spurious retransmission of any additional segments. During the probe, all of the normal retransmission, recovery and con­ gestion control machinery is in effect except if just the probe gap is retransmitted (and no other segments) the normal multiplicative cwnd reduction is suppressed. If any other segments are Mathis, et al [Page 4] Internet-Draft Expires Aug 2003 Feb 23, 2003 retransmitted, all normal cwnd reductions MUST take place. The probe is completed when the acknowledgments sequence advances past the probe gap. If the probe gap was not retransmitted the probe was successful. If the probe gap was retransmitted and there were no other retransmissions, the candidate MSS failed. If there were any other retransmissions the probe was inconclusive. If the probe was successful, the current MSS is updated to the candi­ date MSS. If cwnd and other congestion state variables are kept in packets, they MUST be rescaled by the change in MSS, to preserve the current window size in bytes. If the probe failed or was inconclusive the probe count down is set to COUNTDOWN_SCALE times the square of the current window size in packets. If an RFC1191 style ICMP "Can't" fragment message is received, it is used to compute a MSS limit by deducting the TCP/IP header sizes (including options) from the MTU reported in the ICMP message. If the MSS limit is between the current MSS and candidate MSS, the cur­ rent MSS is updated from the MSS limit, otherwise the message is ignored. If the current MSS is updated, then the probe strategy is forced into to monitor state described below. 5. Probe strategy The probe strategy described here is a recommended baseline algo­ rithm. It is not presented in formal standards language because the probe strategy can include heuristics to help select an optimal MSS for a given path. As a consequence there is opportunity for future improvements to this algorithms. The probing strategy has three major states: search, monitor and sus­ pend. During the search state, it sequentially searches for the largest MSS that the path can support. Once the path MSS has been discovered, the probing algorithm enters the monitor state where it probes infrequently to detect if the path MSS has become larger. If the MSS probing persistently fails it may be desirable to suspend path MSS probing and heuristically select one of the common default MSSs: 576, 1280, or 1500 Bytes. The recommended search strategy is a multi-phase scan: First, a coarse scan for the approximate path MSS using factor of 2 steps starting at 1024 Bytes until a probe fails, followed by successively finer scans between the largest previously successful and unsuccess­ ful probes. Mathis, et al [Page 5] Internet-Draft Expires Aug 2003 Feb 23, 2003 Table 1: Recommended MSS scanning sequence (Course scan down column 1, fine scan across each row) 512, [Use only after repeated timeouts] 1024, 1492, 2002 2048 4096, 4352 8192, 9000 16384, 17914 32768 64512 ((Additional values needed)) During the scan it is recommended that the MSS not be raised if cwnd is too small as determined by a heuristic. For the time being the recommended heuristic is that the MSS is only raised when the cwnd is larger than 20 segments. Once the scan has has found an appropriate MSS, the probe strategy enters the monitor state, where it re-probes the most recent failed MTU, once every MONITOR_INTERVAL seconds. If the probe fails, it remains in the monitor state. If it succeeds, it enters the scanning state. If the network becomes too congested during either the scan or moni­ tor states it is recommended that the MSS be reduced to smaller size as determined by a heuristic. The recommended heuristic is to reduce the MSS if ssthresh is reduced to 5 segments or smaller. The recom­ mended reduction is to the next smaller major MSS step in table 1. When there are repeated timeouts (MAX_TIMO or more retransmissions, w/o any received ACKs), it is presumed that the connection was re- routed onto a link with a smaller MSS, and that ICMP messages are not being delivered. The MSS probing algorithms is reset by pulling back the MSS to 1024 Bytes, rescaling the congestion control variables and reentering the search state. If there is a timeout and cwnd prior to the timeout was smaller than 6 packets, then the probe strategy can enter the suspended phase and set the MSS to 512 (1280) Bytes. This has the effect of reducing the minimum data rate that TCP can stably manage. 6. Shared state The common implementations of RFC1191 keep the discovered MTU in a route structure in the IP layer, because that is really the proper place to process ICMP messages. Path MSS discovery can most easily be added to a current pMTUd implementation by keeping most of the state variables for MSS probing in the same route structure. Mathis, et al [Page 6] Internet-Draft Expires Aug 2003 Feb 23, 2003 The following state should be keep in the IP layer per peer address: Most recent successful IP message size (MSS+full TCP/IP header size), most recent failed IP message size, Probe strategy state, indication if there is currently a probe in progress, and the probing TCP con­ nection, if so. TCP should keep the following state: indication if currently probing, sequence of the most recent probe gap, TCP/IP header size. [[Note, we really need to take all of the relevant parts of RFC1191 as well as various lessons learned and fold all of them into one new document]] 7. Probing intervals COUNTDOWN_SCALE 2 - The scale factor applied to the window squared in packets to compute the the smallest number of non-probe packets required before the next probe. MONITOR_INTERVAL 600 - The interval in seconds between attempts to probe for larger MSS when in the monitor state. MAX_TIMO 2 - The number of repeated timeouts needed to trigger 8. Normative references [RFC1191] Path MTU discovery. J.C. Mogul, S.E. Deering. Nov-01-1990. (Format: TXT=47936 bytes) (Obsoletes RFC1063) (Status: DRAFT STANDARD) [RFC1435] IESG Advice from Experience with Path MTU Discovery. S. Knowles. March 1993. (Format: TXT=2708 bytes) (Status: INFORMATIONAL) [RFC1981] Path MTU Discovery for IP version 6. J. McCann, S. Deering, J. Mogul. August 1996. (Format: TXT=34088 bytes) (Status: PROPOSED STANDARD) [RFC2923] TCP Problems with Path MTU Discovery. K. Lahey. September 2000. (Format: TXT=30976 bytes) (Status: INFORMATIONAL) 9. Informative references [RFC1063] IP MTU discovery options. J.C. Mogul, C.A. Kent, C. Par­ tridge, K. McCloghrie. Jul-01-1988. (Format: TXT=27121 bytes) (Obsoleted by RFC1191) Mathis, et al [Page 7] Internet-Draft Expires Aug 2003 Feb 23, 2003 [RFC1626] Default IP MTU for use over ATM AAL5. R. Atkinson. May 1994. (Format: TXT=11841 bytes) (Obsoleted by RFC2225) (Status: PROPOSED STANDARD) [RFC1791] TCP And UDP Over IPX Networks With Fixed Path MTU. T. Sung. April 1995. (Format: TXT=22347 bytes) (Status: EXPERIMENTAL) 10. Security considerations Since the MTU reported in the ICMP messages is constrained to be between the old MTU and the candidate MTU, this algorithm is more difficult to attack through fraudulent ICMP messaged. Furthermore, since this algorithm can function properly without ICMP messages that part of the algorithm can be disabled for additional robustness in hostile environments. 11. IANA considerations 12. Contributors 13. Acknowledgements Matt Mathis and John Heffner are supported by a grant from Cisco Sys­ tems, Inc. 14. Authors' addresses Please send comments and suggestions to mtu@psc.edu. Matt Mathis and John Heffner Pittsburgh Supercomputing Center 4400 Fifth Ave. Pittsburgh, PA 15213 mathis@psc.edu jheffner@psc.edu Kevin Lahey Freelance kml@patheticgeek.net 15. Intellectual Property The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to per­ tain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made Mathis, et al [Page 8] Internet-Draft Expires Aug 2003 Feb 23, 2003 any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards- related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. 16. Full copyright statement Copyright (C) The Internet Society Feb 23, 2003. All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this doc­ ument itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of develop­ ing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. Mathis, et al [Page 9]