Network Working Group                                    Matt Mathis
  INTERNET-DRAFT                      Pittsburgh Supercomputing Center
  Expiration Date:  Jan 1998                                 July 1997


		Empirical Bulk Transfer Capacity
		< draft-ietf-bmwg-ippm-treno-btc-01.txt >


Status of this Document

This document is an Internet-Draft.  Internet-Drafts are working documents 
of the Internet Engineering Task Force (IETF), its areas, and its working 
groups.  Note that other groups may also distribute working documents as  
 Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and 
may be updated, replaced, or obsoleted by other documents at any time.  It 
is inappropriate to use Internet-Drafts as reference material or to cite 
them other than as "work in progress."

To learn the current status of any Internet-Draft, please check the 
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow 
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au 
(Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West 
Coast).

Abstract:

Bulk Transport Capacity (BTC) is a measure of a network's ability to 
transfer significant quantities of data with a single congestion-aware 
transport connection (e.g. state-of-the-art TCP).  For many applications 
the BTC of the underlying network dominates the overall elapsed time for 
the application, and thus dominates the performance as perceived by a user.
The BTC is a property of an IP cloud (links, routers, switches, etc.) 
between a pair of hosts.  It does not include the hosts themselves (or 
their transport-layer software).  However, congestion control is crucial to 
the BTC metric because the Internet depends on the end systems to fairly 
divide the available bandwidth on the basis of common congestion behavior. 
 The BTC metric is based on the performance of a reference congestion 
control algorithm that has particularly uniform and stable behavior.

Introduction:

Bulk Transport Capacity (BTC) is a measure of a network's ability to 
transfer significant quantities of data with a single congestion-aware 
transport connection (e.g. state-of-the-art TCP).  For many applications 
the BTC of the underlying network dominates the overall elapsed time for 
the application, and thus dominates the performance as perceived by a user. 
 Examples of such applications include FTP and other network copy 
utilities.

The BTC is a property of an IP cloud (links, routers, switches, etc.) 
between a pair of hosts.  It does not include the hosts themselves (or 
their transport-layer software).  However, congestion control is crucial to 
the BTC metric because the Internet depends on the end systems to fairly 
divide the available bandwidth on the basis of common congestion behavior.

Four standard control congestion algorithms are described in RFC2001: 
Slow-start, Congestion Avoidance, Fast Retransmit and Fast Recovery.  Of 
these algorithms, Congestion Avoidance drives the steady-state bulk 
transfer behavior of TCP.  It calls for opening the congestion window by 1 
segment size on each round trip time, and closing it by 1/2 on congestion, 
as signaled by lost segments.

Slow-start is part of TCP's transient behavior.  It is used to quickly 
bring new or recently timed out connections up to an appropriate congestion 
window.

In Reno TCP, Fast Retransmit and Fast Recovery are used to support the 
Congestion Avoidance algorithm during recovery from lost segments.  During 
the recovery interval the data receiver sends duplicate acknowledgements, 
which the data sender must use to identify missing segments as well as to 
estimate the quantity of outstanding data in the network.  The research 
community has observed unpredictable or unstable TCP performance caused by 
errors and uncertainties in the estimation of outstanding data [Lakshman94, 
Floyd95, Hoe95].  Simulations of reference TCP implementations have 
uncovered situations where incidental changes in other parts of the network 
have a large effect on performance [Mathis96].  Other simulations have 
shown that under some conditions, slightly better networks (higher 
bandwidth or lower delay) yield lower throughput [This is easy to 
construct, but has it been published?].  As a consequence, even reference 
TCP implementations do not make good metrics.

Furthermore, many TCP implementations in use in the Internet today have 
outright bugs which can have arbitrary and unpredictable effects on 
performance [Comer94, Brakmo95, Paxson97a, Paxson97b].

The difficulties with using TCP for measurement can be overcome by using 
the Congestion Avoidance algorithm by itself, in isolation from other 
algorithms.  In [Mathis97] it is shown that the performance of the 
Congestion Avoidance algorithm can be predicted by a simple analytical 
model.  The model was derived in [Ott96a, Ott96b].  The model predicts the 
performance of the Congestion Avoidance algorithm as a function of the 
round trip time, and the TCP segment size and the probability of receiving 
a congestion signal (i.e. packet loss).  The paper shows that the model 
accurately predicts the performance of TCP using the SACK option [RFC2018] 
under a wide range of conditions.  If losses are isolated (no more than one 
per round trip) then Fast Recovery successfully estimates the actual 
congestion window during recovery, and Reno TCP also fits the model.

This version of the BTC metric is based on the TReno ("tree-no") 
diagnostic, which implements a protocol-independent version of the 
Congestion Avoidance algorithm.  TReno's internal protocol is designed to 
accurately implement the Congestion Avoidance algorithm under a very wide 
range of conditions, and to diagnose timeouts when they interrupt 
Congestion Avoidance.  In [Mathis97] it is observed that TReno fits the 
same performance model as SACK and Reno TCPs.   [Although the paper was 
written using an older version of TReno, which has less accurate internal 
measurements.]

Implementing the Congestion Avoidance algorithm within a diagnostic tool 
eliminates calibration problems associated with the non-uniformity of 
current TCP implementations.  However, like all empirical metrics it 
introduces new problems, most notably the need to certify the correctness 
of the implementation and to verify that there are not systematic errors 
due to limitations of the tester.

Many of the calibration checks can be included in the measurement process 
itself.  The TReno program includes error and warning messages for many 
conditions that indicate either problems with the infrastructure or in some 
cases problems with the measurement process.  Other checks need to be 
performed manually.

Metric Name: TReno-Type-P-Bulk-Transfer-Capacity
(e.g. TReno-UDP-BTC)

Metric Parameters: A pair of IP addresses, Src (aka "tester")
and Dst (aka "target"), a start time T and initial MTU.

Definition: The average data rate attained by the Congestion
Avoidance algorithm, while using type-P packets to probe the forward (Src 
to Dst) path.  In the case of ICMP ping, these messages also probe the 
return path.

Metric Units: bits per second

Ancillary results:
* Statistics over the entire test
(data transferred, duration and average rate)
* Statistics over the Congestion Avoidance portion of the test (data 
transferred, duration and average rate)
* Path property statistics (MTU, Min RTT, max cwnd during Congestion 
Avoidance and max cwnd during Slow-start)
* Direct measures of the analytic model parameters (Number of congestion 
signals, average RTT)
* Indications of which TCP algorithms must be present to attain the same 
performance.
* The estimated load/BW/buffering used on the return path
* Warnings about data transmission abnormalities.
(e.g. packets out-of-order, events that cause timeouts)
* Warnings about conditions which may affect metric accuracy. (e.g. 
insufficient tester buffering)
* Alarms about serious data transmission abnormalities.
(e.g. data duplicated in the network)
* Alarms about internal inconsistencies of the tester and events which 
might invalidate the results.
* IP address/name of the responding target.
* TReno version.

Method: Run the TReno program on the tester with the chosen packet type 
addressed to the target.  Record both the BTC and the ancillary results.
Manual calibration checks:  (See detailed explanations below).

* Verify that the tester and target have sufficient raw bandwidth to 
sustain the test.
* Verify that the tester and target have sufficient buffering to support 
the window needed by the test.
* Verify that there is not any other system activity on the tester or 
target.
* Verify that the return path is not a bottleneck at the load needed to 
sustain the test.
* Verify that the IP address reported in the replies is an appropriate 
interface of the selected target.

Version control:

* Record the precise TReno version (-V switch)
* Record the precise tester OS version, CPU version and speed, interface 
type and version.

Discussion:

Note that the BTC metric is defined specifically to be the average data 
rate between the source and destination hosts.  The ancillary results are 
designed to detect possible measurement problems, and to help diagnose the 
network.  The ancillary results should not be used as metrics in their own 
right.

The current version of TReno does not include an accurate model for TCP 
timeouts or their effect on average throughput.  TReno takes the view that 
timeouts reflect an abnormality in the network, and should be diagnosed as 
such.

There are many possible reasons why a TReno measurement might not agree 
with the performance obtained by a TCP-based application.  Some key ones 
include: older TCPs missing key algorithms such as MTU discovery, support 
for large windows or SACK, or mis-tuning of either the data source or sink. 
 
Some network conditions which require the newer TCP algorithms are 
detected by TReno and reported in the ancillary results.  Other documents 
will cover methods to diagnose the difference between TReno and TCP 
performance.

It would raise the accuracy of TReno's traceroute mode if the ICMP "TTL 
exceeded" messages were generated at the target and transmitted along the 
return path with elevated priority (reduced losses and queuing delays).
People using the TReno metric as part of procurement documents should be 
aware that in many circumstances MTU has an intrinsic and large impact on 
overall path performance.  Under some conditions the difficulty in meeting 
a given performance specifications is inversely proportional to the square 
of the path MTU.  (e.g. Halving the specified MTU makes meeting the 
bandwidth specification 4 times harder.)

When used as an end-to-end metric TReno presents exactly the same load to 
the network as a properly tuned state-of-the-art bulk TCP stream between 
the same pair of hosts.  Although the connection is not transferring useful 
data, it is no more wasteful than fetching an unwanted web page with the 
same transfer time.

Calibration checks:

The following discussion assumes that the TReno diagnostic is implemented 
as a user mode program running under a standard operating system.  Other 
implementations, such as thoes in dedicated measurement instruments, can 
have stronger built-in calibration checks.

The raw performance (bandwidth) limitations of both the tester
and target should be measured by running TReno in a controlled
environment (e.g. a bench test).  Ideally the observed
performance limits should be validated by diagnosing the nature
of the bottleneck and verifying that it agrees with other
benchmarks of the tester and target (e.g. That TReno performance
agrees with direct measures of backplane or memory bandwidth or
other bottleneck as appropriate).  These raw performance
limitations may be obtained in advance and recorded for later
reference.  Currently no routers are reliable targets, although
under some conditions they can be used for meaningful measurements.  When 
testing between a pair of modern computer systems at a few megabits per 
second or less, the tester and target are unlikely to be the bottleneck.
TReno may not be accurate, and should not be used as a formal metric, at 
rates above half of the known tester or target limits.  This is because 
during the initial Slow-start TReno needs to be able to send bursts which 
are twice the average data rate.

Likewise, if the link to the first hop is not more than twice as fast as 
the entire path, some of the path properties such as max cwnd during 
Slow-start may reflect the testers link interface, and not the path itself.
Verifying that the tester and target have sufficient buffering is 
difficult.  If they do not have sufficient buffer space, then losses at 
their own queues may contribute to the apparent losses along the path. 
There are several difficulties in verifying the tester and target buffer 
capacity.  First, there are no good tests of the target's buffer capacity 
at all.  Second, all validation of the testers buffering depends in some 
way on the accuracy of reports by the tester's own operating system. 
Third, there is the confusing result that in many circumstances 
(particularly when there is much more than sufficient average tester 
performance) insufficient buffering in the tester does not adversely impact 
measured performance.

TReno reports (as calibration alarms) any events in which transmit packets 
were refused due to insufficient buffer space.  It reports a warning if the 
maximum measured congestion window is larger than the reported buffer 
space.  Although these checks are likely to be sufficient in most cases 
they are probably not sufficient in all cases, and will be the subject of 
future research.

Note that on a timesharing or multi-tasking system, other activity on the 
tester introduces burstiness due to operating system scheduler latency. 
Since some queuing disciplines discriminate against bursty sources, it is 
important that there be no other system activity during a test.  This 
should be confirmed with other operating system specific tools.

In ICMP mode TReno measures the net effect of both the forward and return 
paths on a single data stream.  Bottlenecks and packet losses in the 
forward and return paths are treated equally.

In traceroute mode, TReno computes and reports the load it contributes to 
the return path.  Unlike real TCP, TReno can not distinguish between losses 
on the forward and return paths, so ideally we want the return path to 
introduce as little loss as possible.  A good way to test to see if the 
return path has a large effect on a measurement is to reduce the forward 
path messages down to ACK size (40 bytes), and verify that the measured 
packet rate is improved by at least factor of two.  [More research is 
needed.]

References

[Brakmo95], Brakmo, S., Peterson, L., "Performance problems in BSD4.4 TCP", 
Proceedings of ACM SIGCOMM '95, October 1995.

[Comer94], Comer, C., Lin, J., "Probing TCP Implementations", USENIX Summer 
1994, June 1994.

[Floyd95] Floyd, S., "TCP and successive fast retransmits", February 1995, 
Obtain via ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.

[Hoe95] Hoe, J., "Startup dynamics of TCP's congestion control and 
avoidance schemes".  Master's thesis, Massachusetts Institute of 
Technology, June 1995.

[Jacobson88] Jacobson, V., "Congestion Avoidance and Control", Proceedings 
of SIGCOMM '88, Stanford, CA., August 1988.

[mathis96] Mathis, M. and Mahdavi, J. "Forward acknowledgment:
Refining TCP congestion control",  Proceedings of ACM SIGCOMM '96, 
Stanford, CA., August 1996.

[RFC2018] Mathis, M., Mahdavi, J. Floyd, S., Romanow, A., "TCP Selective 
Acknowledgment Options", 1996 Obtain via:
ftp://ds.internic.net/rfc/rfc2018.txt

[Mathis97] Mathis, M., Semke, J., Mahdavi, J., Ott, T., "The Macroscopic 
Behavior of the TCP Congestion Avoidance Algorithm", Computer 
Communications Review, 27(3), July 1997.

[Ott96a], Ott, T., Kemperman, J., Mathis, M., "The Stationary
Behavior of Ideal TCP Congestion Avoidance", In progress, August
1996. Obtain via pub/tjo/TCPwindow.ps using anonymous ftp to
ftp.bellcore.com

[Ott96b], Ott, T., Kemperman, J., Mathis, M., "Window Size Behavior in 
TCP/IP with Constant Loss Probability", DIMACS Special Year on Networks, 
Workshop on Performance of Real-Time Applications on the Internet, Nov 
1996.

[Paxson97a] Paxson, V "Automated Packet Trace Analysis of TCP 
Implementations", Proceedings of ACM SIGCOMM '97, August 1997.

[Paxson97b] Paxson, V, editor "Known TCP Implementation Problems",
Work in progress: http://reality.sgi.com/sca/tcp-impl/prob-01.txt

[Stevens94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", 
Addison-Wesley, 1994.

[RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance,
Fast Retransmit, and Fast Recovery Algorithms",
ftp://ds.internic.net/rfc/rfc2001.txt


Author's Address
     Matt Mathis
     email: mathis@psc.edu
     Pittsburgh Supercomputing Center
     4400 Fifth Ave.
     Pittsburgh PA 15213