HTTP/1.1 200 OK Date: Tue, 09 Apr 2002 01:20:02 GMT Server: Apache/1.3.20 (Unix) Last-Modified: Wed, 27 Nov 1996 00:20:00 GMT ETag: "304db5-4342-329b8930" Accept-Ranges: bytes Content-Length: 17218 Connection: close Content-Type: text/plain INTERNET-DRAFT Expires May 1997 INTERNET-DRAFT Network Working Group Matt Mathis INTERNET-DRAFT Pittsburgh Supercomputing Center Expiration Date: May 1997 Nov 1996 Empirical Bulk Transfer Capacity < draft-ietf-bmwg-ippm-treno-btc-00.txt > Status of this Document This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as ``work in progress.'' To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet- Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Abstract: Bulk Transport Capacity (BTC) is a measure of a network's ability to transfer significant quantities of data with a single congestion-aware transport connection (e.g. state-of-the-art TCP). For many applications the BTC of the underlying network dominates the the overall elapsed time for the application, and thus dominates the performance as perceived by a user. The BTC is a property of an IP cloud (links, routers, switches, etc) between a pair of hosts. It does not include the hosts themselves (or their transport-layer software). However, congestion control is crucial to the BTC metric because the Internet depends on the end systems to fairly divide the available bandwidth on the basis of common congestion behavior. The BTC metric is based on the performance of a reference congestion control algorithm that has particularly uniform and stable behavior. Introduction This Internet-draft is likely to become one section of some future, larger document covering several different metrics. Motivation: Bulk Transport Capacity (BTC) is a measure of a network's ability to transfer significant quantities of data with a single congestion-aware transport connection (e.g. state-of-the-art TCP). For many applications the BTC of the underlying network dominates the the overall elapsed time for the application, and thus dominates the performance as perceived by a user. Examples of such applications include ftp and other network copy utilities. The BTC is a property of an IP cloud (links, routers, switches, etc) between a pair of hosts. It does not include the hosts themselves (or their transport-layer software). However, congestion control is crucial to the BTC metric because the Internet depends on the end systems to fairly divide the available bandwidth on the basis of common congestion behavior. The BTC metric is based on the performance of a reference congestion control algorithm that has particularly uniform and stable behavior. The reference algorithm is documented in appendix A, and can be implemented in TCP using the SACK option [RFC2018]. It is similar in style and behavior to the congestion control algorithm which have been in standard use [Jacoboson88, Stevens94, Stevens96] in the Internet. Since the behavior of the reference congestion control algorithm is well defined and implementation independent, it will be possible confirm that different measurements only reflect properties of the network and not the end-systems. As such BTC will be a true network metric. [A strong definition of "network metric" belongs in the framework document: - truly indicative of what *could* be done with TCP or another good transport layer - sensitive to weaknesses in the routers, switches, links, etc. of the IP cloud that would also cause problems for production transport layers - *not* be sensitive to weaknesses in common host hardware or software, such as current production TCP implementations, that can be removed by doing transport right on the hosts - complete as a methodology in that little/no additional deep knowledge of state-of-the-art measurement technology is needed Others that may come to mind. - Guy Almes] Implementing standard congestion control algorithms within the diagnostic eliminates calibration problems associated with the non-uniformity of current TCP implementations. However, like all empirical metrics it introduces new problems, most notably the need to certify the correctness of the implementation and to verify that there are not systematic errors due to limitations of the tester. This version of the metric is based on the tool TReno (pronounced tree-no), which implements the reference congestion control algorithm over either traceroute-style UDP and ICMP messages or ICMP ping packets. Many of the calibration checks can be included in the measurement process itself. The TReno program includes error and warning messages for many conditions which indicate either problems with the infrastructure or in some cases problems with the measurement process. Other checks need to be performed manually. Metric Name: TReno-Type-P-Bulk-Transfer-Capacity (e.g. TReno-UDP-BTC) Metric Parameters: A pair of IP addresses, Src (aka "tester") and Dst (aka "target"), a start time T and initial MTU. [The framework document needs a general way to address additional constraints that may be applied to metrics: E.g. for a NetNow-style test between hosts on two exchange points, some indication of/control over the first hop is needed.] Definition: The average data rate attained by the reference congestion control algorithm, while using type-P packets to probe the forward (Src to Dst) path. In the case of ICMP ping, these messages also probe the return path. Metric Units: bits per second Ancillary results and output used to verify the proper measurement procedure and calibration: - Statistics over the entire test (data transferred, duration and average rate) - Statistics from the equilibrium portion of the test (data transferred, duration, average rate, and number of equilibrium congestion control cycles) - Path property statistics (MTU, Min RTT, max cwnd in equilibrium and max cwnd during Slow-start) - Statistics from the non-equilibrium portion of the test (nature and number of non-equilibrium events). - Estimated load/BW/buffering used on the return path. - Warnings about data transmission abnormalities. (e.g packets out-of-order) - Warning about conditions which may effect metric accuracy. (e.g insufficient tester buffering) - Alarms about serious data transmission abnormalities. (e.g. data duplicated in the network) - Alarms about tester internal inconsistencies and events which might invalidate the results. - IP address/name of the responding target. - TReno version. Method: Run the treno program on the tester with the chosen packet type addressed to the target. Record both the BTC and the ancillary results. Manual calibration checks. (See detailed explanations below). - Verify that the tester and target have sufficient raw bandwidth to sustain the test. - Verify that the tester and target have sufficient buffering to support the window needed by the test. - Verify that there is not any other system activity on the tester or target. - Verify that the return path is not a bottleneck at the load needed to sustain the test. - Verify that the IP address reported in the replies is some interface of the selected target. Version control: - Record the precise TReno version (-V switch) - Record the precise tester OS version, CPU version and speed, interface type and version. Discussion: We do not use existing TCP implementations due to a number of problems which make them difficult to calibrate as metrics. The Reno congestion control algorithms are subject to a number of chaotic or turbulent behaviors which introduce non-uniform performance [Floyd95, Hoe95, mathis96]. Non-uniform performance introduces substantial non-calibratable uncertainty when used as a metric. Furthermore a number of people [Paxon:testing, Comer:testing, ??others??] have observed extreme diversity between different TCP implementations, raising doubts about repeatability and consistency between different TCP based measures. There are many possible reasons why a TReno measurement might not agree with the performance obtained by a TCP based application. Some key ones include: older TCP's missing key algorithms such as MTU discovery, support for large windows or SACK, or mistuning of either the data source or sink. Some network conditions which need the newer TCP algorithms are detected by TReno and reported in the ancillary results. Other documents will cover methods to diagnose the difference between TReno and TCP performance. Note that the BTC metric is defined specifically to be the average data rate between the source and destination hosts. The ancillary results are designed to detect a number of possible measurement problems, and in a few case pathological behaviors in the network. The ancillary results should not be used as metrics in their own right. The discussion below assumes that the TReno algorithm is implemented as a user mode program running under a standard operating system. Other implementations, such as a dedicated measurement instrument, can have stronger builtin calibration checks. The raw performance (bandwidth) limitations of both the tester and target SHOULD be measured by running TReno in a controlled environment (e.g. a bench test). Ideally the observed performance limits should be validated by diagnosing the nature of the bottleneck and verifying that it agrees with other benchmarks of the tester and target (e.g. That TReno performance agrees with direct measures of backplane or memory bandwidth or other bottleneck as appropriate.) These raw performance limitations MAY be obtained in advance and recorded for later reference. Currently no routers are reliable targets, although under some conditions they can be used for meaningful measurements. For most people testing between a pair of modern computer systems at a few megabits per second or less, the tester and target are unlikely to be the bottleneck. TReno may not be accurate, and SHOULD NOT be used as a formal metric at rates above half of the known tester or target limits. This is because during Slow-start TReno needs to be able to send bursts which are twice the average data rate. [need exception if the 1st hop LAN is the limit in all cases?] Verifying that the tester and target have sufficient buffering is difficult. If they do not have sufficient buffer space, then losses at their own queues may contribute to the apparent losses along the path. There several difficulties in verifying the tester and target buffer capacity. First, there are no good tests of the target's buffer capacity at all. Second, all validation of the testers buffering depend in some way on the accuracy of reports by the tester's own operating system. Third, there is the confusing result that in many circumstances (particularly when there is more than sufficient average performance) where insufficient buffering does not adversely impact measured performance. TReno separately instruments the performance of the equilibrium and non-equilibrium portions of the test. This is because TReno's behavior is intrinsicly more accurate during equilibrium. If TReno can not sustain equilibrium, it either suggests serious problems with the network or that the expected performance is lower than can be accurately measures by TReno. TReno reports (as calibration alarms) any events where transmit packets were refused due to insufficient buffer space. It reports a warning if the maximum measured congestion window is larger than the reported buffer space. Although these checks are likely to be sufficient in most cases they are probably not sufficient in all cases, and will be subject of future research. Note that on a timesharing or multi-tasking system, other activity on the tester introduces burstyness due to operating system scheduler latency. Therefore, it is very important that there be no other system activity during a test. This SHOULD be confirmed with other operating system specific tools. In traceroute mode, TReno computes and reports the load on the return path. Unlike real TCP, TReno can not distinguish between losses on the forward and return paths, so idealy we want the return path to introduce as little loss as possible. The best way to test the return path is with TReno ICMP mode using ACK sized messages, and verify that the measured packet rate is improved by a factor of two. [More research needed] In ICMP mode TReno measures the net effect of both the forward and return paths on a single data stream. Bottlenecks and packet losses in the forward and return paths are treated equally. It would raise the accuracy of TReno traceroute mode if the ICMP TTL execeded messages were generated at the target and transmitted along the return path with elevated priority (reduced losses and queuing delays). People using the TReno metric as part of procurement documents should be aware that in many circumstances MTU has an intrinsic and large impact on overall path performance. Under some conditions the difficulty in meeting a given performance specifications is inversely proportional to the square of the path MTU. (e.g. halving the specified MTU makes meeting the bandwidth specification 4 times harder.) In metric mode, TReno presents exactly the same load to the network as a properly tuned state-of-the-art TCP between the same pair of hosts. Although the connection is not transferring useful data, it is no more wasteful than fetching an un-wanted web page takes the same time to transfer. References [RFC2018] Mathis, M., Mahdavi, J. Floyd, S., Romanow, A., "TCP Selective Acknowledgment Options", ftp://ds.internic.net/rfc/rfc2018.txt [Jacobson88] Jacobson, V., "Congestion Avoidance and Control", Proceedings of SIGCOMM '88, Stanford, CA., August 1988. [Stevens94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", Addison-Wesley, 1994. [Stevens96] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms", Work in progress ftp://ietf.org/internet-drafts/draft-stevens-tcpca-spec-01.txt [Floyd95] Floyd, S., "TCP and successive fast retransmits", February 1995, Obtain via ftp://ftp.ee.lbl.gov/papers/fastretrans.ps. [Hoe95] Hoe, J., "Startup dynamics of TCP's congestion control and avoidance schemes". Master's thesis, Massachusetts Institute of Technology, June 1995. [mathis96] Mathis, M. and Mahdavi, J. "Forward acknowledgment: Refining tcp congestion control", Proceedings of ACM SIGCOMM '96, Stanford, CA., August 1996. Author's Address Matt Mathis email: mathis@psc.edu Pittsburgh Supercomputing Center 4400 Fifth Ave. Pittsburgh PA 15213 ---------------------------------------------------------------- Appendix A: Currently the best existing description of the algorithm is in the "FACK technical note" below http://www.psc.edu/networking/tcp.html. Within TReno, all invocations of "bounding parameters" will be reported as warnings. The FACK technical note will be revised for TReno, supplemented by a code fragment and included here.