Internet-Draft                                               Matt Mathis
                                                            John Heffner
                                                                     PSC
                                                             Kevin Lahey
                                                               Freelance
                                                            14 Feb, 2004

                           Path MTU Discovery
                     draft-ietf-pmtud-method-01.txt


Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.


Abstract

   This document describes a robust new method for Path MTU Discovery
   that relies on TCP or other Packetization Layer to probe an Internet
   path with progressively larger packets.  This method is described as
   an extension to RFC 1191 and RFC 1981, which specify ICMP based Path
   MTU Discovery for IP versions 4 and 6.  This document does not define
   a protocol, but rather a method to use features of existing protocols
   to discover the path MTU.

   The general strategy of the new algorithm is to start with a small
   MTU and probe upward, testing successively larger MTUs by probing


Mathis, et al                                                   [Page 1]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   with single packets.  If the probe is successfully delivered, then
   the MTU is raised.  If the probe is lost, it is treated as an MTU
   limitation and not as a congestion signal.


Table of Contents

   1. Introduction  . . . . . . . . . . . . . . . . . . . . . . . 3
   2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . 3
   3. Overview  . . . . . . . . . . . . . . . . . . . . . . . . . 6
   3.1. General Method  . . . . . . . . . . . . . . . . . . . . . 8
   3.2. Generating Probes . . . . . . . . . . . . . . . . . . . . 9
   3.3. Normal sequence of events to raise the MTU  . . . . . . . 10
   3.4. Processing MTU Indications  . . . . . . . . . . . . . . . 11
   3.4.1. Processing Packet Too Big Messages  . . . . . . . . . . 11
   3.4.2. Packetization Layer retransmits lost packets  . . . . . 11
   3.4.3. Packetization Layer Retransmission Timeout  . . . . . . 13
   3.5. Probing Intervals . . . . . . . . . . . . . . . . . . . . 14
   3.6. Interoperation with prior algorithms  . . . . . . . . . . 15
   4. Requirements  . . . . . . . . . . . . . . . . . . . . . . . 15
   5. Implementation Issues . . . . . . . . . . . . . . . . . . . 16
   5.1. Layering and Accounting for Header Sizes. . . . . . . . . 17
   5.2. Storing PMTU information  . . . . . . . . . . . . . . . . 18
   5.3. Host fragmentation  . . . . . . . . . . . . . . . . . . . 19
   5.4. Multicast . . . . . . . . . . . . . . . . . . . . . . . . 19
   5.5. Path MTU Search Strategy  . . . . . . . . . . . . . . . . 20
   5.5.1. Search  . . . . . . . . . . . . . . . . . . . . . . . . 20
   5.5.2. Monitor . . . . . . . . . . . . . . . . . . . . . . . . 21
   5.5.3. Suspend . . . . . . . . . . . . . . . . . . . . . . . . 21
   5.6. Implementation issues for specific Packetization Layers . 21
   5.6.1. Probing method using TCP  . . . . . . . . . . . . . . . 21
   5.6.2. Probing method using SCTP . . . . . . . . . . . . . . . 22
   5.6.3.  Issues for tunnels . . . . . . . . . . . . . . . . . . 23
   5.6.4.  Issues for other transport protocols . . . . . . . . . 23
   5.7.  Diagnostic tools . . . . . . . . . . . . . . . . . . . . 23
   5.8.  Management interface . . . . . . . . . . . . . . . . . . 23
   6. Normative references  . . . . . . . . . . . . . . . . . . . 24
   7. Informative references  . . . . . . . . . . . . . . . . . . 24
   8. Security considerations . . . . . . . . . . . . . . . . . . 24
   9. IANA considerations . . . . . . . . . . . . . . . . . . . . 25
   10. Contributors . . . . . . . . . . . . . . . . . . . . . . . 25
   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 25
   12. Authors' addresses . . . . . . . . . . . . . . . . . . . . 25
   13. Intellectual Property  . . . . . . . . . . . . . . . . . . 25
   14. Full copyright statement . . . . . . . . . . . . . . . . . 26


Mathis, et al                                                   [Page 2]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


1. Introduction


   This document describes a method for Packetization Layer Path MTU
   Discovery (PLPMTUD) which is an extension to existing Path MTU
   discovery methods.  The proper MTU is determined by starting with
   small packets and probing with successively larger packets.   The
   bulk of the algorithm is implemented above IP, in the transport layer
   (e.g. TCP) or other "Packetization Protocol" that is responsible for
   determining packet boundaries.

   This document draws heavily RFC1191 and RFC1981 for terminology,
   ideas and some of the text.

   The methods described in this document apply both IPv4 and IPv6, and
   to many transport protocols such as TCP.   This document does not
   define a protocol, but rather a method to use features of existing
   protocols to discover the path MTU.  It does not require cooperation
   from the lower layers (except that they are consistent about what
   packet sizes are acceptable) or the far node.  Variants in
   implementations will not cause interoperability problems.

   The methods described in this document are carefully designed to
   maximize robustness in the presence of less than ideal
   implementations of other protocols or Internet components.

   For sake of clarity we uniformly prefer TCP and IPv6 terminology.  In
   the terminology section we also present the analogous IPv4 terms and
   concepts for the IPv6 terminology.  In a few situations we describe
   specific details that are different between IPv4 and IPv6.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC 2119].


2. Terminology

   IP          - Either IPv4 [IPv4-SPEC] or IPv6 [IPv6-SPEC].

   node        - A device that implements IP.

   router      - A node that forwards IP packets not explicitly
                 addressed to itself.

   host        - Any node that is not a router.


Mathis, et al                                                   [Page 3]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   upper layer - A protocol layer immediately above IP.  Examples are
                 transport protocols such as TCP and UDP, control
                 protocols such as ICMP, routing protocols such as OSPF,
                 and Internet or lower-layer protocols being "tunneled"
                 over (i.e., encapsulated in) IP such as IPX,
                 AppleTalk, IP itself.

   link        - A communication facility or medium over which nodes can
                 communicate at the link layer, i.e., the layer
                 immediately below IP.  Examples are Ethernets (simple
                 or bridged); PPP links; X.25, Frame Relay, or ATM
                 networks; and Internet (or higher) layer "tunnels",
                 such as tunnels over IPv4 or IPv6.   In some earlier
                 documents the term "lower layer" was used for this
                 concept.

   interface   - A node's attachment to a link.

   address     - An IP-layer identifier for an interface or a set of
                 interfaces.

   packet      - An IP header plus payload.

   MTU         - Maximum Transmission Unit, the size in bytes of the
                 largest IP packet, including the IP header and payload,
                 that can be transmitted on a link or path.   Note that
                 this could more properly be called the IP MTU, to be
                 consistent with how other standards organizations use
                 the acronym MTU.

   link MTU    - The Maximum Transmission Unit, i.e., maximum IP packet
                 size in octets, that can be conveyed in one piece over
                 a link.    Beware that this definition differers from
                 the definition used by other standards organizations.

                 For IETF documents, link MTU is uniformly defined as
                 the IP MTU over the link.  This includes the IP header,
                 but excludes link layer headers and other framing which
                 is not part of IP or the IP payload.

                 Other standards organizations generally define link MTU
                 to include the link layer headers.

   path        - The set of links traversed by a packet between a source
                 node and a destination node

   path MTU    - The minimum link MTU of all the links in a path between
                 a source node and a destination node.


Mathis, et al                                                   [Page 4]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   PMTU        - Path MTU

   classical PMTU discovery,
               - Process described in RFC 1191 and RFC 1981, in which
                 nodes rely on ICMP messages to learn the MTU of a path.

   PL, packetization layer
               - The layer of the network stack which segments data into
                 packets.

   PLPMTUD     - Packetization Layer Path MTU Discovers, the method
                 described in this document, which is an extension to
                 classical PMTU discovery.

   Packet Too Big message
               - An ICMP message reporting that an IP packet is too
                 large to forward.  This is the IPv6 term that
                 corresponds to the IPv4 "ICMP Can't fragment" message.

   flow        - A context in which MTU discovery is applied.  This is
                 naturally an instance of the packetization protocol, e.g.
                 one side of a TCP connection.

   MPS         - The maximum IP payload size available over a specific
                 path.  This is typically the path MTU minus the IP header
                 As an example, this is the maximum TCP packet size,
                 including TCP payload and headers but not including
                 IP headers.   This has also been called the "L3 MTU".

   MSS         - The TCP Maximum Segment Size, the maximum payload
                 size available to the TCP layer.   This is typically the
                 path MPS minus the size of the TCP headers.

   probe packet- A packet which is being used to test for a larger MTU.

   probe size  - The size of a packet being used to probe for a larger MTU.

   successful probe
               - The probe packet was delivered through the network.

   inconclusive probe
               - The probe packet was not delivered, but there were other lost
                 packets close enough to the probe where can not presume that
                 the probe was lost due to MTU. By implication the probe
                 might have been lost due to something other than MTU (such
                 congestion), so the results are inconclusive.   Inconclusive
                 probes are generally repeated at the same probe size, after
                 a suitable delay.


Mathis, et al                                                   [Page 5]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   failed probe
               - The probe packet was not delivered and there were no other
                 lost packets close to the probe.   This is taken as an
                 indication that the probe was larger than the path MTU,
                 and future probes should generally be for at smaller sizes.

   errored probe
               - There were losses or timeouts during the verification
                 phase which suggest a potentially disruptive failure
                 or network condition.   These are generally retried
                 only after substantially longer intervals.
                 @@@ not used

   probe gap   - The expected missing payload data that will need to be
                 retransmitted if the probe is not delivered.

   probe phase - The interval (time or protocol events)
                 between when a probe is sent, and when
                 it is determined that the the probe succeeded, failed or was
                 inconclusive

   verification phase - An additional interval during which the new path MTU
                 is considered provisional.   Packet losses or timeouts are
                 treated as an indication that there may be a problem with
                 the provisional MTU.

   Transition phase - The interval between the probe phase and the
                 verification phase, during which packets using the new MTU
                 propagate to the far node and the acknowledgment propagates
                 back.

   full stop timeout - a timeout where the none of the packets transmitted
                 after some specific event at the sender (e.g. entering the
                 probe or verification phase) is acknowledged by the receiver.
                 This is taken as an indication that the MTU change caused
                 some failure in the network.

   search strategy - the heuristics used to choose successive probe sizes
                 to converge to the proper path MTU, as described in
                 section 5.5.


3. Overview

   This document describes a method for TCP or other packetization


Mathis, et al                                                   [Page 6]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   protocols to dynamically discover the MTU of a path without relying
   on explicit signals from the network.  These procedures are
   applicable to TCP and other transport- or application-level
   packetization protocols in which the receiver always reports to the
   sender complete information about which packets were lost in the
   network.

   The general strategy of the new procedure is for the packetization
   layer to find the proper MTU by probing with progressively larger
   packets, without disrupting its normal protocol operation.  If a
   probe packet is successfully delivered, then the path MTU is
   provisionally raised.  If there are no additional losses during the
   subsequent verification phase, then the path MTU is confirmed
   (verified) to be at least as large as the provisional MTU.  PLPMTUD
   can then probe again with an even larger MTU, according to MTU search
   strategy described in section 5.5.

   The verification phase is used to detect situations where raising the
   MTU greatly raises the packet loss rate.  For example this might
   happen if some link is striped across multiple physical channels and
   the stripes have inconsistent MTUs.

   A conservative implementation of PLPMTUD would use a full round trip
   time for the verification phase.  In this case each time PLPMTUD
   raises the MTU it takes three full round trip times to do so.  It
   takes one round trip for the probe phase, during which the probe
   propagates to the far node and an acknowledgment is returned.   The
   second round trip is the transitional phase, during which data
   packets using the provisional MTU propagate to the far node and are
   acknowledged.  During he third and final round trip time, it is
   verified that raising the MTU does not cause excessive loss.

   The isolated loss of a probe packet (with or without a Packet Too Big
   message) is treated as an indication of an MTU limit, and not as a
   congestion indicator.  In this case alone, the packetization protocol
   is permitted to retransmit the probe gap without adjusting the
   congestion window.

   If there is a timeout or additional lost packets during any of the
   three phases, the loss is treated as a congestion indication as well
   as a indication of some sort of failure of the PLPMTUD process.   The
   congestion indication is treated like any other congestion
   indication: window or rate adjustments are mandatory per the relevant
   congestions control standards [Congestion].   Probing can resume with
   some new probe size after a delay which is determined by the nature
   of the indicated failure.

   The most likely (and least serious) PLPMTUD failure is the link


Mathis, et al                                                   [Page 7]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   experiencing legitimate congestion related losses at about the same
   time as the probe.   In this case, it is appropriate to retry the
   probe (with the same probe size) as soon as the packetization layer
   has fully adapted to the congestion and recovered from the losses.

   In other cases, additional losses or timeouts indicate problems with
   the link or packetization layer, and that probes may be disruptive.
   In these situations it is desirable to use progressively longer
   delays depending on the severity of the failure and if it is
   repeated.

   PLPMTUD can optionally process Packet Too Big messages to select the
   provisional MTU for faster convergence in exchange for a slight
   decrease in robustness.  Processing malicious or erroneous Packet Too
   Big messages can cause PLPMTUD to arrive at the incorrect MTU for a
   path, which is likely to reduce protocol performance.  The document
   describes three options for processing Packet Too Big messages:
   completely ignore them, only accept them in response to probes or
   accept all Packet Too Big messages (fully implementing classic PMTUD
   within PLPMTUD).  Theses are further described in section 3.8.

   Relatively few details of this procedure affect interoperability with
   other standards or Internet protocols.  These details are specified
   in RFC2119 standards language in the requirements section.  The vast
   majority of the implementation details are recommendations based on
   experiences with earlier versions of path MTU discovery.  These are
   motivated by a desire to maximize robustness of PLPMTUD in the
   presence of less than ideal implementations as they exist in the
   field.


3.1. General Method


   Most of the difficulty in implementing PLPMTUD arises because it
   needs to be implemented in several different places within a single
   node.  In general each packetization protocol needs to have it's own
   implementation of PLPMTUD.  Furthermore, the natural mechanism to
   share path MTU information between concurrent or subsequent
   connections over the same path is a path information cache in the IP
   layer.  The various packetization protocols need to have the means to
   access and update the shared cache in the IP layer.

   Rather than prescribing implementation details this memo describes
   PLPMTUD in terms of its primary subsystems, without fully describing
   how they are integrated into a complete implementation.  The
   subsystems are: generating probes, processing probe responses, the
   search strategy and, the supporting infrastructure (including the


Mathis, et al                                                   [Page 8]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   path cache).   The first two are introduced in this section and are
   subject to the requirements specified in the following section.   The
   probe strategy and issues related to the support infrastructure and
   cache are covered in the implementation section.

3.2. Generating Probes

   A new candidate MTU is tested by sending one "probe packet", which is
   larger than the current MTU.  In this section we present a couple of
   possible ways to alter packetization layers to generate probe
   packets.   The different techniques incur different overheads in
   three areas: difficulty in generating the probe packet (in terms of
   packetization layer implementation complexity and computational
   overhead) possible additional network capacity consumed by the probes
   and the overhead of recovering from failed probes (both network and
   protocol overhead).

   For example a protocol such as SCTP might be extended to allow
   padding with dummy data inside the SCTP packets.  This greatly
   simplify the implementation because the probing can be performed
   without participation from the application and if the probe fails,
   the missing data (the "probe gap") is assured to fit within the
   current MTU when it is retransmitted.  However, the padding does
   consume network capacity without carrying any useful payload.

   This technique does not work for TCP, because there is not a separate
   length field or other mechanism to differentiate between padding and
   real payload data.  With TCP the natural approach is to send
   additional payload data in an over-sized segment.   There are several
   variants which have different tradeoffs.

   In one method, after a TCP probe segment has been sent the subsequent
   segment(s) may be sent as though the probe segment was not over-
   sized.  Thus if the probe segment is lost, it will leave a gap in the
   sequence space that is exactly the correct size to be filled by one
   segment at the current MTU.   Since this method generates overlapping
   data, it will cause duplicate acknowledgments if the probe is
   successfully delivered.  The sender must be capable of ignoring these
   expected duplicate acknowledgments in a manner which will not cause
   unnecessary retransmission or congestion window reduction.

   In the second method, after a TCP probe segment has been sent,
   subsequent TCP segments are sent in a non-overlapping manner.  If the
   probe segment is lost, it will leave a gap which will require
   retransmission of multiple segments to fill.  This method has lower
   overhead for successful probes, but it requires more complexity in
   the retransmit logic to correctly retransmit the missing data with
   multiple segments that fit into the old MTU, while properly


Mathis, et al                                                   [Page 9]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   suppressing the congestion adjustments for this one situation and no
   others.

   Under all conditions it is important that the packetization layer
   sends additional data (packets) after the probe, such that Fast
   Retransmit or equivalent algorithms in the packetization layer will
   trigger the retransmission of the probe if it is lost in the network.

3.3. Normal sequence of events to raise the MTU


   If the probe size is smaller than the path MTU and there are no other
   losses, the normal sequence of events will be:

 Step 1) The probe is sent, followed by more packets at the current MTU.
   By definition PLPMTUD enters the probe phase.   The probe propagates
   through the network and the far node acknowledges it (or possibly
   latter data, if ACKs are cumulative and delayed ACK is in effect).

 Step 2) The ACK for the probe reaches the data sender.   By definition,
   this ends the probe phase.

 Step 3) The packetization layer provisionally raises the MTU to the
   probe size.  PLPMTUD enters the transitional phase when it starts
   sending data using the provisional MTU.

   Note that implementations that use packet counts for congestion
   accounting (e.g. keep cwnd in units of packets) must rescale their
   congestion accounting such that raising the MTU does not raise the
   total congestion window or data rate.

   If the implementation packetizes the data at the application API, it
   may transmit already queued data at the current MTU before raising
   the MTU.  In this case this data is not part of either the probing or
   transition phases, because all of the packets in flight fit within
   the current MTU.

 Step 4) Once the first packet of the transitional phase is
   acknowledged, PLPMTUD enters the verification phase to determine if
   raising the MTU causes packet loss.   In principle the verification
   phase can be of arbitrary duration, however at this time we are
   recommending one full window of data (i.e one full round trip time).

 Step 5) Once there has been sufficient data delivered and acknowledged
   in the provisional MTU is considered verified and the path MTU is
   updated.   PLPMTUD can then probe for an even larger MTU, as
   described in the searching strategy in section 5.5.


Mathis, et al                                                  [Page 10]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


3.4. Processing MTU Indications


   Other events described below are treated as exceptions and alter or
   cancel some of the steps above.


3.4.1. Processing Packet Too Big Messages


   Classical PMTU discovery specifies the generation of Packet Too Big
   Messages if an over-sized packet (e.g. a probe) encounters a link
   that has a smaller MTU.  Since these messages can not be
   authenticated they introduce a number of well documented denial of
   service attacks against classical PMTUD [DOS].

   In PLPMTUD these messages are not required for correct operation, so
   in principle they can be summarily ignored at the expense of slower
   convergence to the proper MTU.   However we believe that a slightly
   better compromise is to process Packet too big messages in two
   specific contexts: in conjunction with a PLPMTUD probe or a
   retransmission timeout in the packetization layer (indication a re-
   route to a link with a smaller MTU).

   Every Packet Too Big Message should be subjected to the following
   checks:

 o If globally forbidden then discard the message.

 o If forbidden by the application then discard the message.

 o If this path has been tagged "bogus ICMP messages" then discard the
   message.

 o If the reported MTU fails consistency checks then set "bogus ICMP
   messages" flag for this path and discards the message.  These
   consistency checks include: unrecognized or unparseable enclosed
   header, reported MTU is larger than the size indicated by the
   enclosed header or larger than the current MTU, provisional MTU or
   probe size as appropriate.

 o If the Packet Too Big Message is acceptable under all of these
   checks, save the "ICMP MTU", pending another packetization layer
   protocol event.

3.4.2. Packetization Layer retransmits lost packets


Mathis, et al                                                  [Page 11]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   Each packetization protocol has it's own mechanism to detect lost
   packets and request the retransmission of missing data.  The primary
   signals used by the packetization layer are these protocol specific
   loss indications.   In all cases the packetization layer is
   responsible for retransmitting the lost data and notifying PLPMTUD
   that there was a loss.

 o If the probe itself was lost, and there were no other losses during
   the probe phase (The RTT between when the probe was sent and the loss
   detected) than it is taken as an indication that the path MTU is
   smaller than the probe size.  In this situation alone the
   Packetization Layer is permitted to retransmit the missing data (the
   "probe gap") without adjusting its congestion window or data
   transmission rate.

   If an accepted Packet Too Big Message was received after the probe
   was sent, and it passes the additional checks that the ICMP MTU is
   greater than the current MTU, then set the provisional MTU to the
   ICMP MTU and proceed from step 3 in section 3.3 above.

   If there was not a accepted Packet Too Big Message, then the
   indicated event is a "probe failure", which can be retried with a
   smaller probe size after a suitable delay for a probe_failure_event.
   See section 3.7 for more complete descriptions of failure events.

 o If there are losses during the probe phase and the probe was not
   lost, then the probe was successful.  However, since additional loses
   have the potential to spoil the verification phase, it is important
   that PLPMTUD not progress into the transition phase (step 3 above)
   until after the Packetization Layer has fully recovered from the
   losses and completed the congestion window (or rate) adjustment.

 o If there are losses during the probe phase and the probe was also
   lost the outcome depends on the presence an ICMP MTU set by an
   acceptable Packet Too Big Message.

   If there was an accepted Packet Too Big Message received since the
   probe was sent, and it passes the additional checks that the ICMP MTU
   is greater than the current MTU, then set the provisional MTU to the
   ICMP MTU, and once the Packetization Layer has fully recovered from
   the losses and completed the congestion window (or rate) adjustment
   then proceed to step 3 in section 3.3 above.

   If there was not an accepted Packet Too big Message, then the probe
   is inconclusive because the lost probe might have been caused by
   congestion.   The probe can be retried  after a suitable delay for a
   inconclusive_probe_event.


Mathis, et al                                                  [Page 12]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


 o Losses during the transition phase do not receive special treatment.

 o Losses during the verification phase are taken as a indication that
   the path may have a non-uniform MTU or some other problems such that
   raising the MTU substantially raises the loss rate.  If so, this is
   potentially a very serious problem, so the provisional MTU is
   considered to have failed the verification phase and the path MTU is
   set back to the previously verified MTU (the previously the current
   MTU).

   Packet loss during the verification phase might also be due to
   coincidental congestion on the path, unrelated to the probe, so it
   would seem to be desirable that PLPMTUD re-probes the path.  The risk
   is that this effectively raises the tolerated loss threshold because
   even though raising the MTU causes additional loss, there is a
   statistical chance that repeated attempts to verify a the new MTU may
   yield as false pass.    The compromise is to re-probe once with the
   same probe size (after delay inconclusive_probe_event), and if this
   also fails, then the probe may not be retried until after a suitable
   delay for a verification_fail_event, which exponentially increases on
   each successive failure.

   Losses during the verification phase may indicate that a Packet Too
   Big Message reported the incorrect ICMP MTU, so if the provisional
   MTU was updated from the ICMP MTU (which was from an earlier Packet
   Too Big Message), set the "bogus ICMP message" flag for this path.
   This will prevent PLPMTUD from processing further "Packet Too Big
   Messages" for this path.   If the provisional MTU was correct, the
   re-probe above will correctly use it.   If it was not correct, then
   by definition the path reported at least one incorrect  "Packet Too
   Big Message", and should not process any additional messages.

3.4.3. Packetization Layer Retransmission Timeout


   If there is a retransmission timeout during the probe or verification
   phase it may be an indication of a serious problem with the path or
   the Packetization Layer.  We first define the notion of a "full stop
   timeout" to be a timeout where the none of the packets transmitted
   after some specific event at the sender (e.g. entering the probe or
   verification phase) is acknowledged by the receiver.    If a
   retransmission timeout is not full stop it is processed above as
   loss, except using longer delays before re-probing.
   (probe_timeout_event, verification_timeout_event)

   If there is a full stop timeout following a probe then it is taken as
   an indication that probing may be disruptive to either the network or
   the far node (e.g. it triggered a bug halt due to a buffer overrun,


Mathis, et al                                                  [Page 13]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   etc).   The probe should not be retried until after a long delay, for
   probe_stop_event.  Not that this makes it particularly important that
   probes are only sent when the sender can send sufficient additional
   data to assure the correct operation of Fast Retransmit and similar
   algorithms in the Packetization Layer.

   If there is a full stop timeout when the path MTU is raised to the
   provisional MTU and the provisional MTU was updated from the ICMP
   MTU, then it is assumed that the MTU reported in the Packet Too Big
   Message was incorrect.  Set the "bogus ICMP message" flag for this
   path and re-probe with a smaller probe size after a suitable delay
   for an ICMP_fail_event.

   If there is a full stop timeout when the path MTU is raised to the
   provisional MTU and the provisional MTU is the same as the probe size
   (because the probe packet was not lost), then something truly
   unexpected happened.   It is possible that the timeout is unrelated
   to the probe, so in this situation re-probe with a smaller probe size
   after a suitable delay for an verification_stop_event.

3.5. Probing Intervals


   Section 3.4 above describes a number of probe failure events.   In
   all cases the basic response is the same: to wait some time interval
   (dependent on the specific event and possibly the history) and then
   to probe again.  For events that are "inconclusive", is is generally
   appropriate to re-probe with the same probe size.   For events that
   are identified as "failed probes" is is generally appropriate to re-
   probe with a smaller probe size.   The search strategy described in
   section 5.5 is used to select probe sizes.

   Many of the intervals below are specified in terms of elapsed round
   trips relative to the current congestion window.   This is because
   TCP and other Packetization Layer protocols tend to exhibit periodic
   loses which cause periodic variations of the congestion window and
   possibly the data rate.  It is preferable that the PLPMTUD probes are
   scheduled near the low point of these cycles.

   In order from least to most serious:

   inconclusive_probe_event - Packet loss near the lost probe marked the
   result ambiguous.   Since the loss of non-probe causes a window (or
   data rate) reduction, it is desirable to schedule the re-probe (of
   the same probe size) a few round trips after the end of the recovery.

   ICMP_fail_event - Since this is detected by a timeout, it is first
   desirable for the packetization protocol to come back into


Mathis, et al                                                  [Page 14]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   equilibrium with the network (for TCP, this generally means recover
   it's self clock by completing slowstart up to one half of the old
   congestion window) before probing again with a smaller probe segment.

   @@@ TODO finish

   probe_failure_event -

   verification_fail_event -

   verification_timeout_event -

   probe_timeout_event -

   verification_stop_event -

3.6. Interoperation with prior algorithms


   To cache or not.  To ICMP or not -

   Three choices for processing packet too big: ignore all, accept all
   or only on probes.

   Three choices for starting size: cached, small, or interface


4. Requirements

   All Internet nodes SHOULD implement PLPMTUD in order to discover and
   take advantage of the largest MTU supported along the Internet path.

   Links MUST not deliver packets that are larger than their MTU.  Links
   that have parametric limitations (e.g. MTU bounds due to limited
   clock stability) MUST include explicit mechanisms to consistently
   reject packets that might otherwise be nondeterministically
   delivered.

   The requirements below only apply to those implementations that
   include PLPMTUD.

   If the IP protocol is IPv4 the DF bit must be set.

   A packetization protocol MUST use a loss reporting mechanism
   mechanism (e.g. SACK) which avoids spurious retransmission of any
   other data when a probe packet is lost.


Mathis, et al                                                  [Page 15]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   A Packetization Layer SHOULD NOT send a probe packet unless the flow
   is expected to have at least the 3 round trips worth of data needed
   to successfully complete the probing and verification process.

   A Packetization Layer MUST NOT send a probe unless it has sufficient
   data available to send such that a lost packet will trigger Fast
   Retransmit or similar algorithm.

   Failed and inconclusive probes MUST NOT be sent more frequently than
   the normal congestion interval for the current average window size.
   @@@@ too TCP specific

   During the probe, the normal congestion control machinery MUST remain
   in effect except when only the probe gap is detected as lost.  In
   this case the normal multiplicative congestion window reduction is
   suppressed.  If any other lost data is detected, all normal
   congestion control MUST take place.

   If the probe is successful, the current MPS is updated to the
   candidate MPS.  If window and other congestion state variables are
   kept in units of packets, they MUST be rescaled to preserve the
   current window size in bytes.   @@ move


5. Implementation Issues


   This section discusses a number of issues related to the
   implementation of Path MTU Discovery.  This is not a specification,
   but rather a set of notes provided as an aid for implementers.

   The issues include:

   - What layer or layers implement Path MTU Discovery?

   - Accounting for headers

   - How is the PMTU information cached?

   - How are ICMP messages processed

   - How to implement PMTUD with TCP?

   - What should other transport and higher layers do?


Mathis, et al                                                  [Page 16]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   - What should tunnels above IP do?


5.1. Layering and Accounting for Header Sizes.

   Packetization Layer Path MTU Discovery is most easily implemented by
   splitting its functions between layers.  The IP layer is in the best
   place to keep shared state, collect the ICMP messages, track IP
   headers sizes and manage MTU information from the link layer
   interfaces.  However the procedures that PLPMTUD uses for probing,
   verifications and scanning for the path MTU are very tightly coupled
   to the data recovery and congestion control state machines in the
   Packetization Layer.   The most difficult part of implementing
   PLPMTUD is properly splitting the implementation between the layers.

   Note that this layering is constant with the advice in the current
   PMTUD specifications [RFC1191, RFC1981].  Today, many implementations
   of classical PMTU Discovery are already split along these same
   layers.

   Early implementation of PLPMTUD revealed that it is critically
   important to have a good clean mechanism for accounting header sizes
   at all layers.  This is because the Packetization Layer does its
   calculations in its own natural data unit, which are almost always a
   reflection of the service that the Packetization Layer provides to
   the application or other upper layers.  For example, TCP naturally
   performs all of its calculations in terms of sequence numbers and
   segment sizes, and the probe gap is the segment that was carried by
   the probe packet.  However, the probe size, ICMP MTU, etc are
   measures of full packets, which not only include the TCP data and
   fixed IP and TCP headers, but may also include IP extension headers
   or options, TCP options and even IPsec AH or ESP headers.

   PLPMTUD requires frequent bidirectional translation between these two
   domains: the Packetization Layer's natural data unit and full IP
   packet sizes.  While there are a number of possible ways to
   accurately implement this duality of size measures, our experience
   has been that it is best if the boundary between the IP layer and the
   Packetization layer communicate in terms of the IP Maximum Payload
   Size or MPS.  The MPS is the only size measure that is common to both
   the IP and Packetization Layers, because it exactly matches the
   boundary between the layers.  The IP Layer is responsible for adding
   or deducting it's own headers when translating between MTU and MPS.
   Likewise the Packetization Layer is responsible for adding or
   deducting its own headers when calculations in it's own natural data
   units.

   This document does not take a stance on the placement of IPsec, which


Mathis, et al                                                  [Page 17]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   logically sits between IP and the Packetization Layer.  IPsec can be
   treated either as part of IP or as part of the Packetization Layer,
   as long as the accounting is consistent within any given
   implementation.  If IPsec is treated as part of the IP layer, then
   each security association to a remote node may need to be treated as
   a separate flow if they have different length security headers.  If
   IPsec is treated as part of the packetization layer, the IPsec header
   size has to be included in the Packetization Layer's header size
   calculations.

5.2. Storing PMTU information

   This memo uses the concept of a "flow" to define the scope in which
   path MTU information is used.  Each flow locally stores its maximum
   payload size (MPS), which is used for packetizing data.  Flows may
   communicate with the IP layer to store or access cached PMTU values,
   providing a means by which similar flows may share information.  To
   do so, the flow must convert between these two values by adding or
   subtracting the size of the IP header plus any additional
   intermediate headers.  The IP layer also stores PMTU information from
   the ICMP layer when it receives Packet Too Big messages.

   Ideally, a PMTU value should be associated with a specific path
   traversed by packets exchanged between the source and destination
   nodes.  However, in most cases a node will not have enough
   information to completely and accurately identify such a path.
   Rather, a node must associate a PMTU value with some local
   representation of a path.  It is left to the implementation to select
   the local representation of a path.

   An implementation could use the destination address as the local
   representation of a path.  The PMTU value associated with a
   destination would be the minimum PMTU learned across the set of all
   paths in use to that destination.  The set of paths in use to a
   particular destination is expected to be small, in many cases
   consisting of a single path.  This approach will result in the use of
   optimally sized packets on a per-destination basis.  This approach
   integrates nicely with the conceptual model of a host as described in
   [ND]: a PMTU value could be stored with the corresponding entry in
   the destination cache.   However, NAT and other forms of middle boxes
   may exhibit differing MTUs at as single IP address.

   If IPv6 flows [IPv6-SPEC] are in use, an implementation could use the
   IPv6 flow id as the local representation of a path.  Packets sent to
   a particular destination but belonging to different flows may use
   different paths, with the choice of path depending on the flow id.
   This approach will result in the use of optimally sized packets on a
   per-flow basis, providing finer granularity than PMTU values


Mathis, et al                                                  [Page 18]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   maintained on a per-destination basis.

   For source routed packets (i.e. packets containing an IPv6 Routing
   header, or IPv4 LSRR or SSRR options), the source route may further
   qualify the local representation of a path.    An implementation
   could use source route information in the local representation of a
   path.

   If IPsec is in use, the security association can also be used to
   represent paths.

5.3. Host fragmentation

   Packetization layers are encouraged to avoid sending messages that
   will require fragmentation (for the case against fragmentation, see
   [FRAG]).  However this is not always possible.  Some packetization
   layers, such as a UDP application outside the kernel, may be unable
   to change the size of messages it sends.  This may result in packet
   sizes that exceeds the Path MTU.

   IPv4 permitted such applications to send packets without DF set.
   These packets would be fragmented in the IP layer in the host or
   fragmented by the network.  This approach is no longer recommended.

   We recommend that IPv4 implementation use a new strategy to mimic
   IPv6 functionality.   When the application sends packets that are too
   large for the path they should be fragmented in the host IP layer.
   However, the DF bit should be set on the fragments, so they will not
   be fragmented again in the network.   Note that in principle the IP
   fragmentation layer is an example of a Packetization Layers, it could
   implement full PLPMTUD in the fragmentation process.

   At lease one major operating system already uses this strategy.

5.4. Multicast

   In the case of a multicast destination address, copies of a packet
   may traverse many different paths to reach many different nodes.  The
   local representation of the "path" to a multicast destination must in
   fact represent a potentially large set of paths.

   Minimally, an implementation could maintain a single MPS value to be
   used for all packets originated from the node.  This MPS value would
   be the minimum MPS learned across the set of all paths in use by the
   node.  This approach is likely to result in the use of smaller
   packets than is necessary for many paths.

   Alternatively, if the application using multicast gets complete


Mathis, et al                                                  [Page 19]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   delivery reports (unlikely because this requirement  has poor scaling
   properties), PLPMTUD could be implemented in multicast protocols.

5.5. Path MTU Search Strategy

   The probe strategy described here is a recommended baseline
   algorithm.  It is not presented in formal standards language because
   the probe strategy can include heuristics to help select an optimal
   MSS for a given path.  As a consequence there is opportunity for
   future improvements to this algorithms.

   The probing strategy has three major states: Search, Monitor and
   Suspend.  In the Search state, it sequentially searches for the
   largest MSS that the path can support.  Once the appropriate MPS has
   been discovered, the probing algorithm enters the Monitor state where
   it probes infrequently to detect if the path MPS has become larger.

   If the MPS probing persistently fails it may be desirable to suspend
   MPS probing and heuristically select one of the common default MSSs:
   576, 1240, or 1460 Bytes.


5.5.1. Search

   @@@ this entire section still needs to be rewritten @@@

   The recommended search strategy is a multi-phase scan: First, a
   coarse scan for the approximate MTU using factor of 2 steps starting
   at 1024 Bytes until a probe fails, followed by successively finer
   scans between the largest previously successful and unsuccessful
   probes.  The TCP should use its best knowledge of the lower@@ layer
   header sizes to appropriately determine the MPS from the MTUs listed
   in the table below.

          Table 1: Recommended MTU scanning sequence
          (Coarse scan down column 1, fine scan across each row)
          512, [Use only after repeated timeouts]
          1024,  1492, 1500, 2002
          2048
          4096, 4352
          8192, 9000
          16384, 17914
          32768
          64512
          ((Additional values needed))

   During the scan it is recommended that the MPS not be raised if cwnd


Mathis, et al                                                  [Page 20]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   is too small as determined by a heuristic.  The recommended heuristic
   is that the MPS is only raised when the cwnd is larger than 20
   segments.  @@@ This may be too high.

5.5.2. Monitor

   Once the scan has found an appropriate MPS, the probe strategy enters
   the Monitor state, where it re-probes the most recent failed MTU,
   once every MONITOR_INTERVAL seconds.  If the probe fails, it remains
   in the Monitor state.  If it succeeds, it enters the scanning state.

   If the network becomes too congested during either the Search or the
   Monitor states, it is recommended that the MPS be reduced to a
   smaller size as determined by a heuristic.  The recommended heuristic
   is to reduce the MSS if ssthresh is reduced to 5 segments or smaller.
   The recommended reduction is to the next smaller coarse step in Table
   1.

   When there are repeated timeouts (MAX_TIMO or more retransmissions,
   without any received ACKs), it is presumed that the connection was
   re-routed onto a link with a smaller MSS, and that ICMP messages are
   not being delivered.  The MSS probing algorithms is reset by pulling
   back the MSS to 1024 Bytes, rescaling the congestion control
   variables and reentering the Search state.

5.5.3. Suspend

   If there is a timeout, and cwnd prior to the timeout was smaller than
   6 packets, then the probe strategy can enter the Suspend state and
   set the MSS to 512 or 1240 Bytes.  This has the effect of reducing
   the minimum data rate that TCP can stably manage.


5.6. Implementation issues for specific Packetization Layers

   Different protocols introduce specific problems.


5.6.1. Probing method using TCP

   TCP has no mechanism that could be used to distinguish between real
   application data and some other form of padding that might be used to
   fill out probe packets.  Therefor, TCP must generate probes by
   sending oversized segments that are carrying real application date.
   As previously mentioned there are two approaches that TCP might use
   to minimize the overheads associated with the probing process.


Mathis, et al                                                  [Page 21]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   A TCP implementation of PLPMTUD can elect to send subsequent segments
   as though the probe segment was not oversized.  This has the
   advantage that TCP only need to retransmit a segment at the current
   MTU to recover from failed probes.  However the duplicate data in the
   probe does consume network resources and will cause duplicate
   acknowledgments.   It is important that these extra duplicate
   acknowledgments not trigger Fast Retransmit.  This can be guaranteed
   by limiting the largest probe segment size to twice the current
   segment size (causing at most 1 duplicate acknowledgment) three times
   the current segment size (causing at most 2 duplicate
   acknowledgments.

   The other approach is to send non-overlapping segments following the
   probe.  Although this is cleaner from a protocol architecture
   standpoint it clashes with many of the optimizations used improve the
   efficiency of data motion withing many operating systems.  In
   particular many implementations divide the data into segments and
   precompute checksums as the data is copied out of user space.  In
   these implementation it can be very expensive to adjust segment
   boundaries after the data is already queued.

   If TCP is using SACK or any other variable length headers, the
   headers on the probe and verification packets should be padded to the
   maximum possible length.  Otherwise, large headers may cause delivery
   problems for future segments.


   Note that the header size and overhead calculations described in
   section @@@ apply here.  TCP's natural data accounting units are
   sequence space and Maximum Segment Size.  However the the PLPMTUD
   process is described in terms of total packet size, which is larger
   than the MSS by all fixed and optional headers.

   At the point when TCP is ready to start verification, it is permitted
   to not re-packetize already queued data.  This postpones the
   verification process by the time required to send the queued data.

   If the verification phase experiences any segment losses, TCP is
   required to pull back to the prior MSS.   Since failing the
   verification phase should be an infrequent error condition it  is
   less important  that this be  as efficient as probing.

5.6.2. Probing method using SCTP

   In the SCTP protocol packetization is the responsibility of the
   application or protocol above SCTP.   By implication SCTP can not
   easily generate probes sending additional application data.


Mathis, et al                                                  [Page 22]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   For SCTP the recommended method for generating probes is to pad
   messages by @@@@@@ or by sending a message that consists entirely of
   padding and no application data.

   The verification phase ......


5.6.3.  Issues for tunnels

   @@@ to be written

5.6.4.  Issues for other transport protocols

   Some transport protocols (such as ISO TP4 [ISOTP]) are not allowed to
   repacketize when doing a retransmission.  That is, once an attempt is
   made to transmit a segment of a certain size, the transport cannot
   split the contents of the segment into smaller segments for
   retransmission.  In such a case, the original segment can be
   fragmented by the IP layer during retransmission.  Subsequent
   segments, when transmitted for the first time, should be no larger
   than allowed by the Path MTU.

5.7.  Diagnostic tools

   All implementations MUST include a mechanism to implement diagnostic
   tools that do not rely on the operating systems implementation of
   path MTU discovery.   This requires an mechanism where an application
   can send oversized packets that are not subjected to the operating
   systems notion of the current path MTU, up to the physical MTU limit
   as supported by the network interface, as well as a mechanism to
   collect any Packet Too Big Messages.

5.8.  Management interface

   It is suggested that an implementation provide a way for a system
   utility program to:

   - Specify that Path MTU Discovery not be done on a given path.

   - Change the PMTU value associated with a given path.

   - Global controls on ICMP processing

   - Per connection or per application controls on ICMP processing

   The former can be accomplished by associating a flag with the path;
   when a packet is sent on a path with this flag set, the IP layer does
   not send packets larger than the IPv6 minimum link MTU.


Mathis, et al                                                  [Page 23]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   These features might be used to work around an anomalous situation,
   or by a routing protocol implementation that is able to obtain Path
   MTU values.

   The implementation should also provide a way to change the timeout
   period for aging stale PMTU information.


6. Normative references

 [RFC1191]  Path MTU discovery. J.C. Mogul, S.E. Deering. Nov-01-1990.
            (Format: TXT=47936 bytes) (Obsoletes RFC1063) (Status: DRAFT
            STANDARD)

 [RFC1981]  Path MTU Discovery for IP version 6. J. McCann, S. Deering,
            J. Mogul. August 1996. (Status: PROPOSED STANDARD)

 [RFC2119]  Key words for use in RFCs to Indicate Requirement Levels. S.
            Bradner.  March 1997. (Status: BEST CURRENT PRACTICE)

7. Informative references

 [RFC1063]  IP MTU discovery options. J.C. Mogul, C.A. Kent, C. Par-
            tridge, K. McCloghrie. Jul-01-1988. (Obsoleted by RFC1191)

 [RFC1435]  IESG Advice from Experience with Path MTU Discovery. S.
            Knowles. March 1993. (Format: TXT=2708 bytes) (Status:
            INFORMATIONAL)

 [RFC1626]  Default IP MTU for use over ATM AAL5. R. Atkinson. May 1994.
            (Status: PROPOSED STANDARD)

 [RFC1791]  TCP And UDP Over IPX Networks With Fixed Path MTU. T. Sung.
            April 1995. (Status: EXPERIMENTAL)

 [RFC2923]  TCP Problems with Path MTU Discovery. K. Lahey. September
            2000. (Status: INFORMATIONAL)


8. Security considerations

   Since the MTU reported in the ICMP messages is constrained to be


Mathis, et al                                                  [Page 24]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   between the old MTU and the candidate MTU, this algorithm is more
   difficult to attack through fraudulent ICMP messages.

   Furthermore, since this algorithm can function properly without ICMP
   messages that part of the algorithm can be disabled for additional
   robustness in hostile environments.

9. IANA considerations


10. Contributors

11. Acknowledgements

   Matt Mathis and John Heffner are supported by a grant from Cisco Sys-
   tems, Inc.

12. Authors' addresses

   Please send comments and suggestions to pmtud@ietf.org.

   Matt Mathis and John Heffner
   Pittsburgh Supercomputing Center
   4400 Fifth Ave.
   Pittsburgh, PA 15213
   mathis@psc.edu
   jheffner@psc.edu

   Kevin Lahey
   Freelance
   kml@patheticgeek.net


13. Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   intellectual property or other rights that might be claimed to  per-
   tain to the implementation or use of the technology described in this
   document or the extent to which any license under such rights might
   or might not be available; neither does it represent that it has made
   any effort to identify any such rights.  Information on the IETF's
   procedures with respect to rights in standards-track and standards-
   related documentation can be found in BCP-11.  Copies of claims of
   rights made available for publication and any assurances of licenses


Mathis, et al                                                  [Page 25]


Internet-Draft Expires Sept 2004                            14 Feb, 2004


   to be made available, or the result of an attempt made to obtain a
   general license or permission for the use of such proprietary rights
   by implementers or users of this specification can be obtained from
   the IETF Secretariat.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights which may cover technology that may be required to practice
   this standard.  Please address the information to the IETF Executive
   Director.


14. Full copyright statement

   Copyright (C) The Internet Society 14 Feb, 2004. All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this doc-
   ument itself may not be modified in any way, such as by removing the
   copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the  purpose of develop-
   ing Internet standards in which case the procedures for copyrights
   defined in the Internet Standards process must be followed, or as
   required to translate it into languages other than English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.


Mathis, et al                                                  [Page 26]