INTERNET-DRAFT           Expires May 1997             INTERNET-DRAFT

  Network Working Group                                    Matt Mathis
  INTERNET-DRAFT                      Pittsburgh Supercomputing Center
  Expiration Date:  May 1997                                  Nov 1996


                   Empirical Bulk Transfer Capacity

               < draft-ietf-bmwg-ippm-treno-btc-00.txt >


  Status of this Document

     This document is an Internet-Draft.  Internet-Drafts are working
     documents of the Internet Engineering Task Force (IETF), its
     areas, and its working groups.  Note that other groups may also
     distribute working documents as Internet-Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as
     ``work in progress.''

     To learn the current status of any Internet-Draft, please check
     the ``1id-abstracts.txt'' listing contained in the Internet-
     Drafts Shadow Directories on ftp.is.co.za (Africa),
     nic.nordu.net (Europe), munnari.oz.au (Pacific Rim),
     ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast).

  Abstract:

     Bulk Transport Capacity (BTC) is a measure of a network's ability
     to transfer significant quantities of data with a single
     congestion-aware transport connection (e.g. state-of-the-art
     TCP).  For many applications the BTC of the underlying network
     dominates the the overall elapsed time for the application, and
     thus dominates the performance as perceived by a user.

     The BTC is a property of an IP cloud (links, routers, switches,
     etc) between a pair of hosts.  It does not include the hosts
     themselves (or their transport-layer software).  However,
     congestion control is crucial to the BTC metric because the
     Internet depends on the end systems to fairly divide the
     available bandwidth on the basis of common congestion behavior.
     The BTC metric is based on the performance of a reference
     congestion control algorithm that has particularly uniform and
     stable behavior.

  Introduction

     This Internet-draft is likely to become one section of some
     future, larger document covering several different metrics.

  Motivation:

     Bulk Transport Capacity (BTC) is a measure of a network's ability
     to transfer significant quantities of data with a single
     congestion-aware transport connection (e.g. state-of-the-art
     TCP).  For many applications the BTC of the underlying network
     dominates the the overall elapsed time for the application, and
     thus dominates the performance as perceived by a user.  Examples
     of such applications include ftp and other network copy
     utilities.

     The BTC is a property of an IP cloud (links, routers, switches,
     etc) between a pair of hosts.  It does not include the hosts
     themselves (or their transport-layer software).  However,
     congestion control is crucial to the BTC metric because the
     Internet depends on the end systems to fairly divide the
     available bandwidth on the basis of common congestion behavior.
     The BTC metric is based on the performance of a reference
     congestion control algorithm that has particularly uniform and
     stable behavior.  The reference algorithm is documented in
     appendix A, and can be implemented in TCP using the SACK option
     [RFC2018].  It is similar in style and behavior to the congestion
     control algorithm which have been in standard use [Jacoboson88,
     Stevens94, Stevens96] in the Internet.

     Since the behavior of the reference congestion control algorithm
     is well defined and implementation independent, it will be
     possible confirm that different measurements only reflect
     properties of the network and not the end-systems.  As such BTC
     will be a true network metric.  [A strong definition of "network
     metric" belongs in the framework document: - truly indicative of
     what *could* be done with TCP or another good transport layer -
     sensitive to weaknesses in the routers, switches, links, etc.  of
     the IP cloud that would also cause problems for production
     transport layers - *not* be sensitive to weaknesses in common
     host hardware or software, such as current production TCP
     implementations, that can be removed by doing transport right on
     the hosts - complete as a methodology in that little/no
     additional deep knowledge of state-of-the-art measurement
     technology is needed Others that may come to mind.  - Guy Almes]

     Implementing standard congestion control algorithms within the
     diagnostic eliminates calibration problems associated with the
     non-uniformity of current TCP implementations.  However, like all
     empirical metrics it introduces new problems, most notably the
     need to certify the correctness of the implementation and to
     verify that there are not systematic errors due to limitations of
     the tester.

     This version of the metric is based on the tool TReno (pronounced
     tree-no), which implements the reference congestion control
     algorithm over either traceroute-style UDP and ICMP messages or
     ICMP ping packets.

     Many of the calibration checks can be included in the measurement
     process itself.  The TReno program includes error and warning
     messages for many conditions which indicate either problems with
     the infrastructure or in some cases problems with the measurement
     process.  Other checks need to be performed manually.

  Metric Name: TReno-Type-P-Bulk-Transfer-Capacity
                (e.g. TReno-UDP-BTC)

        Metric Parameters: A pair of IP addresses, Src (aka "tester")
                and Dst (aka "target"), a start time T and initial
                MTU.

     [The framework document needs a general way to address additional
     constraints that may be applied to metrics: E.g. for a
     NetNow-style test between hosts on two exchange points, some
     indication of/control over the first hop is needed.]

        Definition: The average data rate attained by the reference
                congestion control algorithm, while using type-P
                packets to probe the forward (Src to Dst) path.
                In the case of ICMP ping, these messages also probe
                the return path.

        Metric Units: bits per second

        Ancillary results and output used to verify
          the proper measurement procedure and calibration:
          - Statistics over the entire test
            (data transferred, duration and average rate)
          - Statistics from the equilibrium portion of the test
            (data transferred, duration, average rate, and number
            of equilibrium congestion control cycles)
          - Path property statistics (MTU, Min RTT, max cwnd in
            equilibrium and max cwnd during Slow-start)
          - Statistics from the non-equilibrium portion of the
            test (nature and number of non-equilibrium events).
          - Estimated load/BW/buffering used on the return path.
          - Warnings about data transmission abnormalities.
            (e.g packets out-of-order)
          - Warning about conditions which may effect metric
            accuracy. (e.g insufficient tester buffering)
          - Alarms about serious data transmission abnormalities.
            (e.g. data duplicated in the network)
          - Alarms about tester internal inconsistencies and events
            which might invalidate the results.
          - IP address/name of the responding target.
          - TReno version.

        Method: Run the treno program on the tester with the chosen
          packet type addressed to the target.  Record both the 
          BTC and the ancillary results.

        Manual calibration checks.  (See detailed explanations below).
          - Verify that the tester and target have sufficient raw
            bandwidth to sustain the test.
          - Verify that the tester and target have sufficient
            buffering to support the window needed by the test.
          - Verify that there is not any other system activity on the
            tester or target.
          - Verify that the return path is not a bottleneck at the
            load needed to sustain the test.
          - Verify that the IP address reported in the replies is some
            interface of the selected target.

        Version control:
         - Record the precise TReno version (-V switch)
         - Record the precise tester OS version, CPU version and
            speed, interface type and version.

        Discussion:

     We do not use existing TCP implementations due to a number of
     problems which make them difficult to calibrate as metrics.  The
     Reno congestion control algorithms are subject to a number of
     chaotic or turbulent behaviors which introduce non-uniform
     performance [Floyd95, Hoe95, mathis96].  Non-uniform performance
     introduces substantial non-calibratable uncertainty when used as
     a metric.  Furthermore a number of people [Paxon:testing,
     Comer:testing, ??others??] have observed extreme diversity
     between different TCP implementations, raising doubts about
     repeatability and consistency between different TCP based
     measures.

     There are many possible reasons why a TReno measurement might not
     agree with the performance obtained by a TCP based application.
     Some key ones include: older TCP's missing key algorithms such as
     MTU discovery, support for large windows or SACK, or mistuning of
     either the data source or sink.  Some network conditions which
     need the newer TCP algorithms are detected by TReno and reported
     in the ancillary results.  Other documents will cover methods to
     diagnose the difference between TReno and TCP performance.

     Note that the BTC metric is defined specifically to be the
     average data rate between the source and destination hosts.  The
     ancillary results are designed to detect a number of possible
     measurement problems, and in a few case pathological behaviors in
     the network.  The ancillary results should not be used as metrics
     in their own right.  The discussion below assumes that the TReno
     algorithm is implemented as a user mode program running under a
     standard operating system.  Other implementations, such as a
     dedicated measurement instrument, can have stronger builtin
     calibration checks.

     The raw performance (bandwidth) limitations of both the tester
     and target SHOULD be measured by running TReno in a controlled
     environment (e.g. a bench test).  Ideally the observed
     performance limits should be validated by diagnosing the nature
     of the bottleneck and verifying that it agrees with other
     benchmarks of the tester and target (e.g. That TReno performance
     agrees with direct measures of backplane or memory bandwidth or
     other bottleneck as appropriate.)  These raw performance
     limitations MAY be obtained in advance and recorded for later
     reference.  Currently no routers are reliable targets, although
     under some conditions they can be used for meaningful
     measurements.  For most people testing between a pair of modern
     computer systems at a few megabits per second or less, the tester
     and target are unlikely to be the bottleneck.

     TReno may not be accurate, and SHOULD NOT be used as a formal
     metric at rates above half of the known tester or target limits.
     This is because during Slow-start TReno needs to be able to send
     bursts which are twice the average data rate.

     [need exception if the 1st hop LAN is the limit in all cases?]

     Verifying that the tester and target have sufficient buffering is
     difficult.  If they do not have sufficient buffer space, then
     losses at their own queues may contribute to the apparent losses
     along the path.  There several difficulties in verifying the
     tester and target buffer capacity.  First, there are no good
     tests of the target's buffer capacity at all.  Second, all
     validation of the testers buffering depend in some way on the
     accuracy of reports by the tester's own operating system.  Third,
     there is the confusing result that in many circumstances
     (particularly when there is more than sufficient average
     performance) where insufficient buffering does not adversely
     impact measured performance.

     TReno separately instruments the performance of the equilibrium
     and non-equilibrium portions of the test.  This is because
     TReno's behavior is intrinsicly more accurate during equilibrium.
     If TReno can not sustain equilibrium, it either suggests serious
     problems with the network or that the expected performance is
     lower than can be accurately measures by TReno.

     TReno reports (as calibration alarms) any events where transmit
     packets were refused due to insufficient buffer space.  It
     reports a warning if the maximum measured congestion window is
     larger than the reported buffer space.  Although these checks are
     likely to be sufficient in most cases they are probably not
     sufficient in all cases, and will be subject of future research.

     Note that on a timesharing or multi-tasking system, other
     activity on the tester introduces burstyness due to operating
     system scheduler latency.  Therefore, it is very important that
     there be no other system activity during a test.  This SHOULD be
     confirmed with other operating system specific tools.

     In traceroute mode, TReno computes and reports the load on the
     return path.  Unlike real TCP, TReno can not distinguish between
     losses on the forward and return paths, so idealy we want the
     return path to introduce as little loss as possible.  The best
     way to test the return path is with TReno ICMP mode using ACK
     sized messages, and verify that the measured packet rate is
     improved by a factor of two.  [More research needed]

     In ICMP mode TReno measures the net effect of both the forward
     and return paths on a single data stream.  Bottlenecks and packet
     losses in the forward and return paths are treated equally.

     It would raise the accuracy of TReno traceroute mode if the ICMP
     TTL execeded messages were generated at the target and
     transmitted along the return path with elevated priority (reduced
     losses and queuing delays).

     People using the TReno metric as part of procurement documents
     should be aware that in many circumstances MTU has an intrinsic
     and large impact on overall path performance.  Under some
     conditions the difficulty in meeting a given performance
     specifications is inversely proportional to the square of the
     path MTU.  (e.g. halving the specified MTU makes meeting the
     bandwidth specification 4 times harder.)

     In metric mode, TReno presents exactly the same load to the
     network as a properly tuned state-of-the-art TCP between the same
     pair of hosts.  Although the connection is not transferring
     useful data, it is no more wasteful than fetching an un-wanted
     web page takes the same time to transfer.

  References

     [RFC2018] Mathis, M., Mahdavi, J. Floyd, S., Romanow, A., "TCP
     Selective Acknowledgment Options",
     ftp://ds.internic.net/rfc/rfc2018.txt

     [Jacobson88] Jacobson, V., "Congestion Avoidance and Control",
     Proceedings of SIGCOMM '88, Stanford, CA., August 1988.

     [Stevens94] Stevens, W., "TCP/IP Illustrated, Volume 1: The
     Protocols", Addison-Wesley, 1994.

     [Stevens96] Stevens, W., "TCP Slow Start, Congestion Avoidance,
     Fast Retransmit, and Fast Recovery Algorithms", Work in progress
     ftp://ietf.org/internet-drafts/draft-stevens-tcpca-spec-01.txt

     [Floyd95] Floyd, S., "TCP and successive fast retransmits",
     February 1995, Obtain via ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.

     [Hoe95] Hoe, J., "Startup dynamics of TCP's congestion control
     and avoidance schemes".  Master's thesis, Massachusetts Institute
     of Technology, June 1995.

     [mathis96] Mathis, M. and Mahdavi, J. "Forward acknowledgment:
     Refining tcp congestion control",  Proceedings of ACM SIGCOMM '96,
     Stanford, CA., August 1996.

  Author's Address

     Matt Mathis
     email: mathis@psc.edu
     Pittsburgh Supercomputing Center
     4400 Fifth Ave.
     Pittsburgh PA 15213

  ----------------------------------------------------------------
  Appendix A:

     Currently the best existing description of the algorithm is in
     the "FACK technical note" below http://www.psc.edu/networking/tcp.html.
     Within TReno, all invocations of "bounding parameters" will be
     reported as warnings.

     The FACK technical note will be revised for TReno, supplemented by a
     code fragment and included here.