The ISP Column 
A monthly column on things Internet
 
Mutterings on MTUs

February 2009 
Geoff Huston

  In the previous column I explored an error I had encountered where
  an IPv4-only web browser on my dual stack system could connect to a
  web server and retrieve the web pages, while a dual stack configured
  browser managed to get itself stuck displaying a white page.

        The problem I encountered was not a fault in the local system,
        nor a fault in the configuration of the remote web server. The
        problem lies in a combination of factors: Firstly, IPv6 will
        not permit routers to perform packet fragmentation in transit,
        and relies on the end system to use the path MTU discovery
        algorithm [RFC1191, RFC1981] to correctly size its packets so
        that no in-flight fragmentation is necessary. Secondly, it
        appears that the ICMPv6 "packet too big" message is not being
        correctly generated in some cases, or, in other cases, the
        ICMPv6 message is subsequently filtered out, so that the
        packet's original source does not receive the
        notification. Thirdly, many host TCP stacks are configured to
        use the local interface MTU as the basis of the initial
        offered MSS, and rely on the correct operation of path MTU
        discovery or, only in the case of IPv4, rely on in-flight
        fragmentation of the TCP packets to correct any problems that
        may arise if any transit segment uses an MTU size lower than
        the pair-wise MTU selection performed as part of the initial
        TCP handshake. And, finally, most TCP stacks do not use any
        form of recovery mechanism in the event that a the ICMPv6
        Packet Too Big indication is lost.

  One possible response is to use active probing as a means of path
  MTU discovery, as described in RFC4821. Here the sending stack
  explicitly probes the path MTU using a path probe mechanism where
  the TCP stack periodically probes the path with a packet that is
  larger than the current path MTU estimate and a longer than usual
  associated ACK timer. Successful acknowledgement of the probe packet
  raises the path MTU estimate to the probe packet size, while the
  absence of an acknowledgement is treated as a probe failure, rather
  than a congestion indication, so that the TCP session does not
  throttle its sending rate in response to an MTU probe failure.

  But packet probing as a means of Path MTU Discovery is not a common
  feature of currently deployed IPv6 implementations in end host
  stacks, so other approaches need to be considered. The previous
  column looked at a number of approaches that could work around the
  problem, but the basic question of path MTU management remains.

  The logical question is: If fragmentation is causing such a problem
  then shouldn't we try to avoid the problem completely and just use
  the minimum packet size for all IP packets? After all, fragmentation
  is problematical for firewalls and filters, because fragments do not
  contain the TCP or UDP port addresses that are a conventional part
  of so many filtering roles, and fragmentation is a problem for the
  destination host in so far as each fragment that has a new IP
  identifier causes the destination to open up a new fragmentation
  reassembly context.

  I offered the opinion that for IPv6 servers and clients, the
  conservative message if you want a robust service your best option
  is to set the server's MTU for IPv6 to at least 40 octets less than
  the interface MTU size, or even to consider setting this MTU value
  down to 1280 octets, the minimum universally supported
  non-fragmented IPv6 packet size, to take the most conservative
  position.

  The conventional belief in this topic appears to be that the most
  compelling argument to raise the packet size above the minimum level
  is to maximise data performance.  So it was not unexpected to see
  some comments in response that questioned the implicit compromise in
  performance through using a smaller MTU. One view was the drop
  dropping the MTU size just "cripples their own network". There
  appears to be a body of opinion that it's a more productive use of
  our time and effort to try and hunt down each and every point in the
  network where these ICMPv6 packet-too-big messages are either not
  being correctly generated, or are being blocked than to spend the
  time performing a configuration change of IPv6 servers to operate
  with an MTU size positioned at between 1400 to 1460 octets, or at
  least 40 octets less than the server's interface MTU size.

The MTU Question

  "Conventional beliefs" always intrigue me, in that while sometimes
  these beliefs express basic constraints and truths, at other times
  they are misleading and just plain wrong. So the question I'd like
  to look at in this article is: How bad is an MTU setting of 1280, as
  compared to an MTU setting of 1500? What performance differential
  can one expect and why? Is this really "crippling" to a network and
  its clients? What is the relationship between internet performance
  and the maximum packet size as set of the MTU setting?  

Packet Overheads and MTU 

  The first, and perhaps the most obvious difference in MTU settings
  is the different relative amount of overhead from the packet and
  frame headers. For a TCP stream operating with the timestamp option
  set (which appears to be a typical setting for Apache web servers)
  the overhead is 20 octets of IPv4 header, or 40 octets of IPv6
  header, 20 octets of TCP header and 12 octets of timestamp TCP
  option header, or a total of 52 or 72 octets. If an 802.3 Ethernet
  frame is being used there is a further framing overhead of 8 octets
  of preamble and start of frame delimiter, 12 octets of MAC
  addresses, 2 octets of Ethertype, and a trailing frame overhead of 4
  octets of the CRC value and 12 octets of the interframe gap (Figure
  1).


  [Figure 1]

  Given that the per-packet overheads are of constant size, then the
  smaller the MTU the higher the relative overhead of the packet
  overheads. The following table shows the maximal TCP carriage
  efficiency when using a maximum-sized TCP packet stream using the
  TCP timestamp option and 802.3 MAC level framing.


    MTU Size IPv4 Payload 802.3 Data IPv6 Payload  802.3 Data 
                          Efficiency               Efficiency

        9000        8948      99.00%         8928      98.78%
        4000        3948      97.77%         3928      97.28%
        1500        1448      94.15%         1428      92.85%
        1480                                 1408      92.75%
        1460                                 1388      92.66%
        1400        1348      93.74%         1328      92.35%
        1280        1228      93.17%         1208      91.65%
        1000         948      91.33%                  
         576         524      85.34%                  


  One way to interpret this table is that, for example, the maximal
  data rate of a 100Mbps connection using IPv4 and a 1500 octet MTU is
  94.15Mbps, while the maximal rate on the same 100Mbps Ethernet using
  an IPv6 connection is 92.85Mbps. If the IPv6 connection is tunneled
  using an IPv6 in IPv4 tunneling mechanism the IPv6 payload is
  reduced by a further 20 octets, bringing the maximal data throughput
  to 92.75%, and the use of a UDP tunnel, such as used by Teredo
  tunneling would bring this down further to 92.66Mbps.

  In the previous column on MTU issues with IPv6 I suggested using a
  1400 MTU as a conservative approach to avoid the issues associated
  with Path MTU discovery black holes. As shown in the above table,
  the overall impact of this approach of dropping the interface MTU
  from 1500 to 1400 octets for IPv6 is a net drop in the maximum
  attainable performance level of 92.85Mbps to 92.35Mbps for data
  throughput.

TCP Performance and MTU size

  When talking about flow performance in the Internet we are normally
  talking about the performance of the TCP protocol stack and the flow
  performance of a reliable data transfer as managed by a TCP rate
  control protocol. In this case I'm trying to look at the assessment
  of what it means to vary the MTU size in a TCP environment.


        There are now many variants of TCP in terms of a rate control
        protocols, so it is somewhat of a misnomer to consider "the"
        TCP rate control algorithm and give the impression that there
        is only one such algorithm. (See the articles "Evolving TCP"
        <http://www.potaroo.net/ispcol/2004-08/2004-08-isp.htm> and
        "faster" <http://www.potaroo.net/ispcol/2005-06/faster.html>
        for an overview of this topic.). A Wikipedia entry on the
        topic of the TCP congestion avoidance algorithm may also be
        informative: <http://en.wikipedia.org/wiki/TCP_Reno -
        TCP_Tahoe_and_Reno>.

        In this case I'll stick with TCP New Reno's behaviour in
        general when I refer to "TCP"

  TCP is a rate adaptive control protocol where the sender is
  attempting to operate the connection at the maximum speed that is
  permitted by the available resources on the sending and receiving
  end host and to operate at a speed that represents a fair proportion
  of the available network path's resources.

  The general characteristics of TCP rate control is shown in the
  following figure. TCP initially starts in a mode termed "Slow
  Start", and then may either stabilize at a resource-limited sending
  rate, or use a dynamic rate control algorithm termed "congestion
  avoidance".

  [Figure 2 - General TCP Rate Control Behaviour]


TCP Slow Start

  I always thought this term was somewhat of a misnomer, as its
  anything but slow once it gets some momentum going! TCP Slow Start
  mode is used at session startup, and at times when the TCP
  connection state appears to be been significantly disrupted because
  of extensive packet loss or periods of connection inactivity. Slow
  Start is used to quickly establish an initial estimate of the
  maximal end to end flow rate.

  TCP's slow start flow control mechanism is such that the sender
  increases its sending window by 1 MSS for each received ACK, for as
  long as the receiver has available advertised buffer space and as
  long as there are no discarded packets. If the receiver sends an ACK
  for every packet this would imply a rate doubling every RTT
  interval. However, most TCP implementations turn on delayed ACK by
  default, and, generally, these default TCP settings ACK every second
  packet, implying that TCP will increase the sending rate by an
  additional 50% every RTT interval.

  This is a multiplicative increase that is very sensitive to the
  round trip time (RTT). For example, if the available path bandwidth
  is 100Mbps, and the MTU size is 1500 octets and delayed ACKs are in
  use, then it will take some 10.5 RTT intervals for TCP Slow Start to
  achieve this path bandwidth. For a LAN with an RTT of less than
  1millisecond slow start can reach this bandwidth within 1 or 2
  milliseconds. But when the RTT stretches to transcontinental levels
  of around 30msec then this can take one third of a second in total
  time, and inter-continental paths of 300msec RTT imply a slow start
  period of 3 seconds to reach this peak bandwidth level. If the slow
  start interval time is directly proportional to the RTT, what's the
  relationship between slow start efficiency and MTU size?

  What effect will varying the MTU size have on the performance of the
  slow start algorithm?

  Let's take an example here, of two systems separated by a 24ms
  network path, no delayed ACKs and TCP timestamps turned on, with a
  maximum path bandwidth of 100Mbps and look at the performance of
  slow start over various packet sizes.

      Time  MTU=1500  MTU=1480  MTU=1400  MTU=1280  MTU=576
                IPv4      IPv6      IPv6      IPv6     IPv4
    (secs)    (Mbps)    (Mbps)    (Mbps)    (Mbps)   (Mbps)

     0.000        0         0         0         0        0
     0.024     2.04      2.02      1.91      1.75     0.81
     0.048     4.09      4.03      3.82      3.50     1.62
     0.072     8.17      8.06      7.64      7.00     3.24
     0.096    16.34     16.13     15.27     13.99     6.49
     0.120    32.68     32.26     30.55     27.99    12.97
     0.144    65.37     64.51     61.10     55.98    25.94
     0.168   100.09    100.30    100.24    100.15    51.88
     0.192                                          103.77


  It will take 0.192 seconds for an IPv4 connection using a 576 byte
  MTU to reach the bottleneck capacity of 100Mbps. It will 0.168
  seconds, or 7 RTT intervals, for an IPv6 connection sized anywhere
  between 1280 octets and 1500 octets MTU. The range of time from the
  smallest MTU to the very common 1500 MTU in this example is just
  7/100's of a second.

  [Figure 3 - TCP Slow Start]


  Assuming that the data set to be transferred is large enough, then
  TCP slow start will continue increasing the sending rate until the
  sender runs out of local buffer space (as the sender has to keep a
  copy of all unacknowledged sent data), or the sender thinks the
  receiver has run out of receiver buffer space (calculated using the
  receivers advertised buffer size, less the amount of unacknowledged
  data in flight), the sender reaches a previously cached threshold
  value for the sending threshold value, of the sender sees an
  indicated of packet loss through the reception of duplicate ACKs, or
  the sender's ACK timer expires.

  In the first two cases TCP will lock into the buffer-constrained
  sending rate, and will continue to send at that rate indefinitely,
  or until at least until either packet loss or ACK timeout
  occurs. The data rate of this buffer-constrained mode of operation
  is independent of the MTU size.

  In the case of an ACK timeout the session will revert to a restart
  mode.

  In the other cases, TCP will move into Congestion Avoidance mode.

Buffer Constrained TCP 

  The receiver advertises its available window with every ACK, and
  this window limits the amount of additional data that the sender can
  send into the network before it must stop and wait for a further
  update with the next ACK.

  Similarly the sender operates a local sending buffer that holds all
  unacknowledged sent data. Once this buffer is filled the sender must
  stop and await an ACK that clears some data out from this local
  buffer.

  A TCP session operating in a buffer-constrained mode does not have a
  direct dependency on the MTU size, as the buffer size reflects the
  amount of unacknowledged data in flight, and not the number of
  packets used to move the data.

  For TCP to operate at maximal performance the sender's buffer and
  the receiver's buffer must both be greater than the bandwidth delay
  product of the network path. This is related to the MTU size to some
  small extent.

  In this example, with an available bandwidth of 100Mbps and a round
  trip delay time of 0.24 seconds, as long as both the receiver and
  sender have a local buffer of 300,000 octets or greater, then any
  TCP connection would be capable of operating at maximal speed over
  the path. To refine this a little, with a maximal TCP data rate of
  94.15Mbps with a 1500 octet MTU, the receiver and sender need to
  have buffers of no less than 282,450 octets. If the MTU is 1400 the
  maximal data rate is 93.74Mbps and the minimum buffer size of
  281,220 octets.

TCP Congestion Avoidance

  TCP's congestion avoidance mode uses a "sawtooth" low frequency
  oscillation of the sending rate around the estimated sustainable
  flow rate, constantly probing to see if a higher flow rate can be
  sustained, and ready to immediately back off once the current flow
  rate experiences congestion.

  This is done by the sender incrementing the value of a sending
  "congestion window" in response to each received non-duplicate
  ACK. Within each RTT interval this incremental opening of the
  congestion window results in an increase in the sending window by a
  total of 1 MSS.

  In response to a packet loss event, signalled by the reception of an
  ACK packet that indicates a lost packet (a duplicate ACK), then TCP
  will attempt to repair the loss. In the first instance TCP will wait
  to see if the duplicate ACK was caused by minor packet
  reordering. In response to three duplicate ACKs the sender will
  assume that the cause is network congestion rather than packet
  reordering, and the sender will repair the packet loss error and
  halve its current sending rate. Once the loss is repaired TCP will
  once more start probing higher sending rates by inflating its
  sending window by a total of 1 MSS unit each RTT time interval,
  assuming that delayed ACKs are turned off.

  This is one area of TCP's performance that is potentially related to
  MTU size. The congestion rate increase is equivalent to a rate
  increase per RTT of ((MSS * 8)/RTT)bps In an example configuration
  with a 24 ms RTT, there will be rate increase of 
  (MSS * 8) / (0.024)**2 bits per second**2.

  The following table shows TCP's Congestion Avoidance rate increase
  using an RTT of 24ms.


         MTU=1500  MTU=1480  MTU=1400  MTU=1280  MTU=576
             IPv4      IPv6      IPv6      IPv6     IPv4
           (Mbps)    (Mbps)    (Mbps)    (Mbps)   (Mbps)

            20.1      19.6      18.4      16.8      7.3

  In this case the rate acceleration of an IPv6 TCP session in
  congestion avoidance mode, using an MTU of 1280 and an RTT of 24ms
  is some 16% slower than an IPv4 session using an MTU of 1500 over
  the same path.

  But its not the difference in the rate increase that is critical
  here. The salient question is: Will a change in the MTU result in a
  performance change of a TCP session result for TCP in congestion
  avoidance mode?

  In theory, the answer is "No, not really." Assuming that the TCP
  connection will experience packet loss when the sending rate hits a
  rate of P bps and TCP then drops to P/2 bps, then rises pack to P
  bps at a rate of (MTU * 8)/RTT**2 repeatedly, then the average sending
  rate is (3/4**P bps. This average sending rate is independent of the
  MTU value, as long as the data file is large enough to sustain a
  number of cycles of the congestion avoidance algorithm.

 
  [Figure 4 - TCP Congestion Avoidance behaviour for a range of MTU
  settings]

  Figure 4 shows a simplified data rate for a TCP connection over a
  connection with a 24ms RTT, using no delayed ACKs where the onset of
  data loss occurs when the sender reaches a sending rate of
  110Mbps. This is simplified in that it is assumed that only a single
  packet loss occurs when TCP reaches the maximum sending rate, jitter
  and the effects of RTT lengthening when the network elements' buffer
  space starts to fill are not included, nor is the TCP recovery
  behaviour following the loss of the packet included. Note that this
  figure tracks the effective data rate, and does not include the
  per-packet overhead of the TCP, IP and Etherframe encapsulation of
  the data.

  The data throughout the results from this performance is shown in
  Figure 5.

  [Figure 5 - TCP Congestion Avoidance Data Throughput for a range of
  MTU settings]

  The variation here in cumulative data throughput is due to the
  differing per-packet overheads, and not due to any intrinsic
  difference in the efficiency of TCP in controlling the data flows
  when the packet size is varied. TCP performance itself is
  independent of the MTU setting.

TCP Rate Equation

  If TCP performance is really independent of the MTU setting, then
  why shouldn't we all just use 1280 octet MTU settings on IPv6 hosts,
  1500 octet settings on all else and just be done? Why bother with
  larger packet sizes at all?

  There is a common belief that MTU size will impact TCP performance,
  and that the larger the MTU the higher the TCP performance.

  Is there any model of TCP that would confirm this view?

  This an extract from "Advice for Internet Subnet Designers"
  [RFC3819]:

  ---------------------

    8.5.1.  The Formulae

       The performance of TCP's AIMD Congestion Avoidance algorithm has been
       extensively analyzed.  The current best formula for the performance
       of the specific algorithms used by Reno TCP (i.e., the TCP specified
       in [RFC2581]) is given by Padhye, et al. [PFTK98].  This formula is:

                                             MSS
               BW = --------------------------------------------------------
                    RTT*sqrt(1.33*p) + RTO*p*[1+32*p^2]*min[1,3*sqrt(.75*p)]

       where

               BW   is the maximum TCP throughout achievable by an
                    individual TCP flow
               MSS  is the TCP segment size being used by the connection
               RTT  is the end-to-end round trip time of the TCP connection
               RTO  is the packet timeout (based on RTT)
               p    is the packet loss rate for the path
                    (i.e., .01 if there is 1% packet loss)

       Note that the speed of the links making up the Internet path does not
       explicitly appear in this formula.  Attempting to send faster than
       the slowest link in the path causes the queue to grow at the
       transmitter driving the bottleneck.  This increases the RTT, which in
       turn reduces the achievable throughput.

       This is currently considered to be the best approximate formula for
       Reno TCP performance.  A further simplification of this formula is
       generally made by assuming that RTO is approximately 5*RTT.

       TCP is constantly being improved.  A simpler formula, which gives an
       upper bound on the performance of any AIMD algorithm which is likely
       to be implemented in TCP in the future, was derived by Ott, et al.
       [MSMO97].

                         MSS   1
               BW = C    --- -------
                         RTT sqrt(p)

       where C is 0.93.

    8.5.2.  Assumptions

       Both formulae assume that the TCP Receiver Window is not limiting the
       performance of the connection.  Because the receiver window is
       entirely determined by end-hosts, we assume that hosts will maximize
       the announced receiver window to maximize their network performance.

       Both of these formulae allow BW to become infinite if there is no
       loss.  However, an Internet path will drop packets at bottlenecked
       queues if the load is too high.  Thus, a completely lossless TCP/IP
       network can never occur (unless the network is being underutilized).

       The RTT used is the arithmetic average, including queuing delays.

       The formulae are for a single TCP connection.  If a path carries many
       TCP connections, each will follow the formulae above independently.

       The formulae assume long-running TCP connections.  For connections
       that are extremely short (<10 packets) and don't lose any packets,
       performance is driven by the TCP slow-start algorithm.  For
       connections of medium length, where on average only a few segments
       are lost, single connection performance will actually be slightly
       better than given by the formulae above.

       The difference between the simple and complex formulae above is that
       the complex formula includes the effects of TCP retransmission
       timeouts.  For very low levels of packet loss (significantly less
       than 1%), timeouts are unlikely to occur, and the formulae lead to
       very similar results.  At higher packet losses (1% and above), the
       complex formula gives a more accurate estimate of performance (which
       will always be significantly lower than the result from the simple
       formula).

       Note that these formulae break down as p approaches 100%.


    [MSMO97]      Mathis, M., Semke, J., Mahdavi, J. and T. Ott, "The
                     Macroscopic Behavior of the TCP Congestion Avoidance
                     Algorithm", Computer Communication Review, Vol. 27,
                     number 3, July 1997.


    [PFTK98]      Padhye, J., Firoiu, V., Towsley, D. and J. Kurose,
                     "Modeling TCP Throughput: a Simple Model and its
                     Empirical Validation", UMASS CMPSCI Tech Report TR98-
                     008, Feb. 1998.

        
  ---------------------

  It should be noted that this rate equation assumes the operation of
  delayed ACKs with every second ACK suppressed by the receiver. The
  action of delayed ACKs on TCP throughput can be made explicit in a
  slightly different rate equation, described in RFC3448:

  The throughput equation is:

                                   s
   X =  ----------------------------------------------------------
        R*sqrt(2*b*p/3) + (t_RTO * (3*sqrt(3*b*p/8) * p * (1+32*p^2)))


   Where:

      X is the transmit rate in bytes/second.

      s is the packet size in bytes.

      R is the round trip time in seconds.

      p is the loss event rate, between 0 and 1.0, of the number of loss
        events as a fraction of the number of packets transmitted.

      t_RTO is the TCP retransmission timeout value in seconds.

      b is the number of packets acknowledged by a single TCP
        acknowledgement.


  This analysis suggests that TCP performance is directly proportional
  to the MTU size, and that, for example, doubling the MTU size would
  double the effective TCP performance, all other things being equal.

  But are all other things equal in such a picture?

  The other part of this rate equation is that TCP performance is
  inversely proportional to the square root of the packet loss ratio.
  If the network loss characteristics are such that large packets have
  the same probability of loss as smaller packets, then larger packets
  will yield improved performance.

  But packet loss probability is not quite so simple. One source of
  packet loss is that of bit level corruption, most commonly
  associated with radio systems, but also visible on wireline
  transmission systems. The probability of bit level corruption in a
  transmission frame can be modelled as

        P = 1 - ((1 - BER)**(FRAME_SIZE * 8))

  Or, if the BER value is sufficiently small,

        P = BER * [FRAME_SIZE * 8]

  In other words, the BER-constrained model of TCP performance is one
  where the packet loss ratio is proportional to the MTU, and this
  would imply that if packet loss is predominately BER-related then
  TCP performance is proportional to the ratio of MTU / sqrt(MTU).

  However, this is not the complete picture of packet loss, as the
  second source of packet loss is congestion loss in the network's
  switching elements. Here the packet loss probability is not
  necessarily an independent function, and as an individual TCP flow
  rate reaches the path bottleneck rate, the packet loss probability
  approaches the value of 1.

  Lets see what this rate equation can offer in terms of a throughput
  calculation.

  In our example, to achieve a 98Mbps transfer rate between two points
  separated by a 24 ms RTT delay with a 1500 byte MTU using IPv6 would
  require a packet loss ratio of 0.000035, or 35 packets per
  million. An MTU of 1280 using IPv6 with the same packet loss rate
  would perform at a maximum of 84Mbps, while a 576 byte MTU using
  IPv4 would perform at 35Mbps.

  The following table shows the result of this TCP performance
  equation for various MTU sizes and packet loss ratios, using a
  constant RTT of 24msec and turning off Delayed ACKs.

    Loss Ratio  MTU=1500  MTU=1480  MTU=1400  MTU=1280  MTU=576
                    IPv4      IPv6      IPv6      IPv6     IPv4
                  (Mbps)    (Mbps)    (Mbps)    (Mbps)   (Mbps)

      0.000001      594       578       545       496      207
      0.000010      188       183       172       157       65
      0.000024      120       117       110       100       42
      0.000035      100        98        92        84       35
      0.000100       59        58        54        50       21
      0.001000       19        18        17        15        6
      0.010000        5         5         5         4        2
      0.100000        1         1         1         1        0
 

  Is there some way to check this equation against the theoretical TCP
  rate model?

  One way to do this is to make some assumptions about our example
  network path to see if the theoretical model of TCP congestion
  avoidance and this rate equation coincides. The major assumption is
  that the buffer level at the network path saturation point is
  minimally adequate, and during an RTT when the sender exceeds the
  100Mbps bottleneck transmission capacity using congestion avoidance,
  the network element will drop a single packet.

  This implies that for each congestion avoidance cycle of a rate
  halving and then a rate increase of 1 MSS per RTT there will only be
  one packet dropped. So what is the packet loss rate predicted by the
  congestion avoidance model?

  Firstly we start with the peak data rate, as calculated by the
  packet and framing overheads, assuming a network path with an RTT of
  24msec, a bottleneck bandwidth of 100Mbps, a TCP session with no
  delayed ACKS and timestamps enabled.

                MTU=1500  MTU=1480  MTU=1400  MTU=1280  MTU=576
                    IPv4      IPv6      IPv6      IPv6     IPv4
                  (Mbps)    (Mbps)    (Mbps)    (Mbps)   (Mbps)

    Peak Rate      94.15     92.75     92.66     92.35    91.65


  We can then multiply this by the RTT to give the Mbits per RTT
  interval at the peak sending rate.

                MTU=1500  MTU=1480  MTU=1400  MTU=1280  MTU=576
                    IPv4      IPv6      IPv6      IPv6     IPv4
                (Mb/RTT)  (Mb/RTT)  (Mb/RTT)  (Mb/RTT) (Mb/RTT)

    Peak Rate      2.260     2.226     2.224     2.216    2.200


  We can then multiply this by 125000 to give the peak rate in units
  of bytes per RTT, and then divide this by the data payload size of
  the packet to give the peak packet rate per RTT

               MTU=1500   MTU=1480   MTU=1400   MTU=1280    MTU=576
                   IPv4       IPv6       IPv6       IPv6       IPv4
             (Pkts/RTT) (Pkts/RTT) (Pkts/RTT) (Pkts/RTT) (Pkts/RTT)

    Peak Rate     195.1      197.6      209.3      229.3      545.5


  Once TCP achieves this peak rate it will then halve its sending rate
  and then build back to the peak rate at a rate of one additional
  packet per RTT. If the peak packet rate is P packets per RTT, then
  the complete TCP congestion avoidance cycle will take (P/2+1) RTT
  intervals to complete, and the total number of packets sent in a
  single cycle is (P/2+1) * (P * 3/4), and the loss rate is 1 packet
  per the number of packets sent in a single cycle.

                MTU=1500  MTU=1480  MTU=1400  MTU=1280  MTU=576
                    IPv4      IPv6      IPv6      IPv6     IPv4

    Packets per   14,415    14,793    16,588    19,867  112,013
          cycle
      Loss Rate 0.000069  0.000068  0.000060  0.000050 0.000009


  The Congestion Avoidance model predicts that the sustained data rate
  is 3/4 of the peak rate. So now we have the associated packet loss
  ratios, how does this compare to the rates predicted by the TCP rate
  equation?


      TCP Data  MTU=1500  MTU=1480  MTU=1400  MTU=1280  MTU=576
      Rate          IPv4      IPv6      IPv6      IPv6     IPv4
                  (Mbps)    (Mbps)    (Mbps)    (Mbps)   (Mbps)

      Model        70.61     69.56     69.50     69.26    68.74
      Equation     71.27     70.21     70.12     69.87    69.20


  It appears that a congestion-driven packet loss model can generate
  TCP rate equation outcomes where the variation in data throughput
  when the MTU is varied is related predominately to the differences
  in the relative overheads of the packet and frame headers rather
  than any intrinsic limitation of TCP.

  So the theory appears to be suggesting that changing the MTU size is
  not going to result in a dramatic difference in outcomes in terms of
  data throughput for TCP.

  If an IPv6 server uses an MTU of 1500 and relies on the correct
  operation of Path MTU discovery it can achieve a maximal data
  throughput of some 92.8% of the available path bandwidth. If the
  server adopts a more conservative position and uses an IPv6 MTU of
  1280 octets the throughput efficiency will drop to some 91.6% of the
  available path bandwidth. To me this appears to be a relatively
  minor difference in terms of achievable performance.

  But all of this so far is theoretical. Perhaps a relevant question
  now is: How does this theory work in practice?

  I'll report on some experimental findings in my next article on this
  topic.

 
        For a 'ready reckoner' of TCP performance check out some work
        done by the WAND Network Research Group in New Zealand:
        http://wand.net.nz/~perry/max_download.php

        And for more reading relating to MTU and performance, have a
        look at Matt Mathis' work at http://staff.psc.edu/mathis/MTU/


Disclaimer

  The above views do not necessarily represent the views of the Asia
  Pacific Network Information Centre.
 
About the Author
 
 
  GEOFF HUSTON holds a B.Sc. and a M.Sc. from the Australian National
  University. He has been closely involved with the development of the
  Internet for many years, particularly within Australia, where he was
  responsible for the initial build of the Internet within the
  Australian academic and research sector. He is author of a number of
  Internet-related books, and is currently the Chief Scientist at
  APNIC, the Regional Internet Registry serving the Asia Pacific
  region. He was a member of the Internet Architecture Board from 1999
  until 2005, and served on the Board of the Internet Society from
  1992 until 2001.

  www.potaroo.net