Network Working Group Matt Mathis INTERNET-DRAFT Pittsburgh Supercomputing Center Expiration Date: Jan 1998 July 1997 Empirical Bulk Transfer Capacity < draft-ietf-bmwg-ippm-treno-btc-01.txt > Status of this Document This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Abstract: Bulk Transport Capacity (BTC) is a measure of a network's ability to transfer significant quantities of data with a single congestion-aware transport connection (e.g. state-of-the-art TCP). For many applications the BTC of the underlying network dominates the overall elapsed time for the application, and thus dominates the performance as perceived by a user. The BTC is a property of an IP cloud (links, routers, switches, etc.) between a pair of hosts. It does not include the hosts themselves (or their transport-layer software). However, congestion control is crucial to the BTC metric because the Internet depends on the end systems to fairly divide the available bandwidth on the basis of common congestion behavior. The BTC metric is based on the performance of a reference congestion control algorithm that has particularly uniform and stable behavior. Introduction: Bulk Transport Capacity (BTC) is a measure of a network's ability to transfer significant quantities of data with a single congestion-aware transport connection (e.g. state-of-the-art TCP). For many applications the BTC of the underlying network dominates the overall elapsed time for the application, and thus dominates the performance as perceived by a user. Examples of such applications include FTP and other network copy utilities. The BTC is a property of an IP cloud (links, routers, switches, etc.) between a pair of hosts. It does not include the hosts themselves (or their transport-layer software). However, congestion control is crucial to the BTC metric because the Internet depends on the end systems to fairly divide the available bandwidth on the basis of common congestion behavior. Four standard control congestion algorithms are described in RFC2001: Slow-start, Congestion Avoidance, Fast Retransmit and Fast Recovery. Of these algorithms, Congestion Avoidance drives the steady-state bulk transfer behavior of TCP. It calls for opening the congestion window by 1 segment size on each round trip time, and closing it by 1/2 on congestion, as signaled by lost segments. Slow-start is part of TCP's transient behavior. It is used to quickly bring new or recently timed out connections up to an appropriate congestion window. In Reno TCP, Fast Retransmit and Fast Recovery are used to support the Congestion Avoidance algorithm during recovery from lost segments. During the recovery interval the data receiver sends duplicate acknowledgements, which the data sender must use to identify missing segments as well as to estimate the quantity of outstanding data in the network. The research community has observed unpredictable or unstable TCP performance caused by errors and uncertainties in the estimation of outstanding data [Lakshman94, Floyd95, Hoe95]. Simulations of reference TCP implementations have uncovered situations where incidental changes in other parts of the network have a large effect on performance [Mathis96]. Other simulations have shown that under some conditions, slightly better networks (higher bandwidth or lower delay) yield lower throughput [This is easy to construct, but has it been published?]. As a consequence, even reference TCP implementations do not make good metrics. Furthermore, many TCP implementations in use in the Internet today have outright bugs which can have arbitrary and unpredictable effects on performance [Comer94, Brakmo95, Paxson97a, Paxson97b]. The difficulties with using TCP for measurement can be overcome by using the Congestion Avoidance algorithm by itself, in isolation from other algorithms. In [Mathis97] it is shown that the performance of the Congestion Avoidance algorithm can be predicted by a simple analytical model. The model was derived in [Ott96a, Ott96b]. The model predicts the performance of the Congestion Avoidance algorithm as a function of the round trip time, and the TCP segment size and the probability of receiving a congestion signal (i.e. packet loss). The paper shows that the model accurately predicts the performance of TCP using the SACK option [RFC2018] under a wide range of conditions. If losses are isolated (no more than one per round trip) then Fast Recovery successfully estimates the actual congestion window during recovery, and Reno TCP also fits the model. This version of the BTC metric is based on the TReno ("tree-no") diagnostic, which implements a protocol-independent version of the Congestion Avoidance algorithm. TReno's internal protocol is designed to accurately implement the Congestion Avoidance algorithm under a very wide range of conditions, and to diagnose timeouts when they interrupt Congestion Avoidance. In [Mathis97] it is observed that TReno fits the same performance model as SACK and Reno TCPs. [Although the paper was written using an older version of TReno, which has less accurate internal measurements.] Implementing the Congestion Avoidance algorithm within a diagnostic tool eliminates calibration problems associated with the non-uniformity of current TCP implementations. However, like all empirical metrics it introduces new problems, most notably the need to certify the correctness of the implementation and to verify that there are not systematic errors due to limitations of the tester. Many of the calibration checks can be included in the measurement process itself. The TReno program includes error and warning messages for many conditions that indicate either problems with the infrastructure or in some cases problems with the measurement process. Other checks need to be performed manually. Metric Name: TReno-Type-P-Bulk-Transfer-Capacity (e.g. TReno-UDP-BTC) Metric Parameters: A pair of IP addresses, Src (aka "tester") and Dst (aka "target"), a start time T and initial MTU. Definition: The average data rate attained by the Congestion Avoidance algorithm, while using type-P packets to probe the forward (Src to Dst) path. In the case of ICMP ping, these messages also probe the return path. Metric Units: bits per second Ancillary results: * Statistics over the entire test (data transferred, duration and average rate) * Statistics over the Congestion Avoidance portion of the test (data transferred, duration and average rate) * Path property statistics (MTU, Min RTT, max cwnd during Congestion Avoidance and max cwnd during Slow-start) * Direct measures of the analytic model parameters (Number of congestion signals, average RTT) * Indications of which TCP algorithms must be present to attain the same performance. * The estimated load/BW/buffering used on the return path * Warnings about data transmission abnormalities. (e.g. packets out-of-order, events that cause timeouts) * Warnings about conditions which may affect metric accuracy. (e.g. insufficient tester buffering) * Alarms about serious data transmission abnormalities. (e.g. data duplicated in the network) * Alarms about internal inconsistencies of the tester and events which might invalidate the results. * IP address/name of the responding target. * TReno version. Method: Run the TReno program on the tester with the chosen packet type addressed to the target. Record both the BTC and the ancillary results. Manual calibration checks: (See detailed explanations below). * Verify that the tester and target have sufficient raw bandwidth to sustain the test. * Verify that the tester and target have sufficient buffering to support the window needed by the test. * Verify that there is not any other system activity on the tester or target. * Verify that the return path is not a bottleneck at the load needed to sustain the test. * Verify that the IP address reported in the replies is an appropriate interface of the selected target. Version control: * Record the precise TReno version (-V switch) * Record the precise tester OS version, CPU version and speed, interface type and version. Discussion: Note that the BTC metric is defined specifically to be the average data rate between the source and destination hosts. The ancillary results are designed to detect possible measurement problems, and to help diagnose the network. The ancillary results should not be used as metrics in their own right. The current version of TReno does not include an accurate model for TCP timeouts or their effect on average throughput. TReno takes the view that timeouts reflect an abnormality in the network, and should be diagnosed as such. There are many possible reasons why a TReno measurement might not agree with the performance obtained by a TCP-based application. Some key ones include: older TCPs missing key algorithms such as MTU discovery, support for large windows or SACK, or mis-tuning of either the data source or sink. Some network conditions which require the newer TCP algorithms are detected by TReno and reported in the ancillary results. Other documents will cover methods to diagnose the difference between TReno and TCP performance. It would raise the accuracy of TReno's traceroute mode if the ICMP "TTL exceeded" messages were generated at the target and transmitted along the return path with elevated priority (reduced losses and queuing delays). People using the TReno metric as part of procurement documents should be aware that in many circumstances MTU has an intrinsic and large impact on overall path performance. Under some conditions the difficulty in meeting a given performance specifications is inversely proportional to the square of the path MTU. (e.g. Halving the specified MTU makes meeting the bandwidth specification 4 times harder.) When used as an end-to-end metric TReno presents exactly the same load to the network as a properly tuned state-of-the-art bulk TCP stream between the same pair of hosts. Although the connection is not transferring useful data, it is no more wasteful than fetching an unwanted web page with the same transfer time. Calibration checks: The following discussion assumes that the TReno diagnostic is implemented as a user mode program running under a standard operating system. Other implementations, such as thoes in dedicated measurement instruments, can have stronger built-in calibration checks. The raw performance (bandwidth) limitations of both the tester and target should be measured by running TReno in a controlled environment (e.g. a bench test). Ideally the observed performance limits should be validated by diagnosing the nature of the bottleneck and verifying that it agrees with other benchmarks of the tester and target (e.g. That TReno performance agrees with direct measures of backplane or memory bandwidth or other bottleneck as appropriate). These raw performance limitations may be obtained in advance and recorded for later reference. Currently no routers are reliable targets, although under some conditions they can be used for meaningful measurements. When testing between a pair of modern computer systems at a few megabits per second or less, the tester and target are unlikely to be the bottleneck. TReno may not be accurate, and should not be used as a formal metric, at rates above half of the known tester or target limits. This is because during the initial Slow-start TReno needs to be able to send bursts which are twice the average data rate. Likewise, if the link to the first hop is not more than twice as fast as the entire path, some of the path properties such as max cwnd during Slow-start may reflect the testers link interface, and not the path itself. Verifying that the tester and target have sufficient buffering is difficult. If they do not have sufficient buffer space, then losses at their own queues may contribute to the apparent losses along the path. There are several difficulties in verifying the tester and target buffer capacity. First, there are no good tests of the target's buffer capacity at all. Second, all validation of the testers buffering depends in some way on the accuracy of reports by the tester's own operating system. Third, there is the confusing result that in many circumstances (particularly when there is much more than sufficient average tester performance) insufficient buffering in the tester does not adversely impact measured performance. TReno reports (as calibration alarms) any events in which transmit packets were refused due to insufficient buffer space. It reports a warning if the maximum measured congestion window is larger than the reported buffer space. Although these checks are likely to be sufficient in most cases they are probably not sufficient in all cases, and will be the subject of future research. Note that on a timesharing or multi-tasking system, other activity on the tester introduces burstiness due to operating system scheduler latency. Since some queuing disciplines discriminate against bursty sources, it is important that there be no other system activity during a test. This should be confirmed with other operating system specific tools. In ICMP mode TReno measures the net effect of both the forward and return paths on a single data stream. Bottlenecks and packet losses in the forward and return paths are treated equally. In traceroute mode, TReno computes and reports the load it contributes to the return path. Unlike real TCP, TReno can not distinguish between losses on the forward and return paths, so ideally we want the return path to introduce as little loss as possible. A good way to test to see if the return path has a large effect on a measurement is to reduce the forward path messages down to ACK size (40 bytes), and verify that the measured packet rate is improved by at least factor of two. [More research is needed.] References [Brakmo95], Brakmo, S., Peterson, L., "Performance problems in BSD4.4 TCP", Proceedings of ACM SIGCOMM '95, October 1995. [Comer94], Comer, C., Lin, J., "Probing TCP Implementations", USENIX Summer 1994, June 1994. [Floyd95] Floyd, S., "TCP and successive fast retransmits", February 1995, Obtain via ftp://ftp.ee.lbl.gov/papers/fastretrans.ps. [Hoe95] Hoe, J., "Startup dynamics of TCP's congestion control and avoidance schemes". Master's thesis, Massachusetts Institute of Technology, June 1995. [Jacobson88] Jacobson, V., "Congestion Avoidance and Control", Proceedings of SIGCOMM '88, Stanford, CA., August 1988. [mathis96] Mathis, M. and Mahdavi, J. "Forward acknowledgment: Refining TCP congestion control", Proceedings of ACM SIGCOMM '96, Stanford, CA., August 1996. [RFC2018] Mathis, M., Mahdavi, J. Floyd, S., Romanow, A., "TCP Selective Acknowledgment Options", 1996 Obtain via: ftp://ds.internic.net/rfc/rfc2018.txt [Mathis97] Mathis, M., Semke, J., Mahdavi, J., Ott, T., "The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm", Computer Communications Review, 27(3), July 1997. [Ott96a], Ott, T., Kemperman, J., Mathis, M., "The Stationary Behavior of Ideal TCP Congestion Avoidance", In progress, August 1996. Obtain via pub/tjo/TCPwindow.ps using anonymous ftp to ftp.bellcore.com [Ott96b], Ott, T., Kemperman, J., Mathis, M., "Window Size Behavior in TCP/IP with Constant Loss Probability", DIMACS Special Year on Networks, Workshop on Performance of Real-Time Applications on the Internet, Nov 1996. [Paxson97a] Paxson, V "Automated Packet Trace Analysis of TCP Implementations", Proceedings of ACM SIGCOMM '97, August 1997. [Paxson97b] Paxson, V, editor "Known TCP Implementation Problems", Work in progress: http://reality.sgi.com/sca/tcp-impl/prob-01.txt [Stevens94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", Addison-Wesley, 1994. [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms", ftp://ds.internic.net/rfc/rfc2001.txt Author's Address Matt Mathis email: mathis@psc.edu Pittsburgh Supercomputing Center 4400 Fifth Ave. Pittsburgh PA 15213