Internet Engineering Task Force Phil Karn INTERNET DRAFT Aaron Falk Joe Touch Marie-Jose Montpetit File: draft-ietf-pilc-link-design-00.txt June, 1999 Expires: December, 1999 Advice for Internet Subnetwork Designers Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document provides advice to the designers of digital communication equipment, link layer protocols and packet switched subnetworks (collectively referred to as subnetworks) who wish to support the Internet protocols but who may be unfamiliar with the architecture of the Internet and the implications of their design choices on the performance and efficiency of the Internet. This document represents an evolving consensus of the members of the IETF Performance Implications of Link Characteristics (PILC) working group. Introduction and Overview The Internet Protocol [RFC791] is the core protocol of the world-wide Internet that defines a simple "connectionless" packet-switched network. The success of the Internet is largely attributed to the simplicity of IP, the "end-to-end principle" on which the Internet is based, and the resulting ease of carrying IP on a wide variety of subnetworks not necessarily designed with IP in mind. But while many subnetworks carry IP, they do not necessarily do so with maximum efficiency, minimum complexity or minimum cost. Nor do they implement certain features to efficiently support newer Internet features of increasing importance, such as multicasting or quality of service. With the explosive growth of the Internet, IP is an increasingly large fraction of the traffic carried by the world's telecommunications networks. It therefore makes sense to optimize both existing and new subnetwork technologies for IP as much as possible. Optimizing a subnetwork for IP involves three complementary considerations: 1. Providing functionality sufficient to carry IP. 2. Eliminating unnecessary functions that increase cost or complexity. 3. Choosing subnetwork parameters that maximize the performance of the Internet protocols. Because IP is so simple, consideration 2 is more of an issue than consideration 1. I.e., subnetwork designers make many more errors of commission than errors of omission. But certain enhanced Internet features, such as multicasting and quality-of-service, rely on support from the underlying subnetworks beyond that necessary to carry "traditional" unicast, best-effort IP. A major consideration in the efficient design of any layered communication network are the appropriate layer(s) in which to implement a given feature. This issue was first addressed in the seminal paper "End-to-End Arguments in System Design" [SRC81]. This paper argued that many -- if not most -- network functions are best implemented on an end-to-end basis, i.e., at the higher protocol layers. Duplicating these functions at the lower levels is at best redundant, and can even be harmful. The architecture of the Internet was heavily influenced by this philosophy, and in our view it was crucial to the Internet's success. The remainder of this document discusses the various subnetwork design issues that the authors consider relevant to efficient IP support. Maximum Transmission Units (MTUs) and IP Fragmentation IP packets (datagrams) vary in size from 20 bytes (the size of the IP header alone) to a maximum of 65535 bytes. Subnetworks need not support maximum-sized (64KB) IP packets, as IP provides a scheme that breaks packets that are too large for a given subnetwork into fragments that travel as independent packets and are reassembled at the destination. The maximum packet size supported by a subnetwork is known as its Maximum Transmission Unit (MTU). Subnetworks may, but are not required to indicate the lengths of the packets they carry. One example is Ethernet with the DIX (not IEEE 802.3) header, which lacks a length field to indicate the true data length when the packet is padded to the 60 byte minimum. This is not a problem for IP because it carries its own length field. In IP version 4 (current IP), fragmentation can occur at either the sending host or in an intermediate router, and fragments can be further fragmented at subsequent routers if necessary. In IP version 6, fragmentation can occur only at the sending host; it cannot occur in a router. Both IPv4 and IPv6 provide a "Path MTU Discovery" procedure [RFC????] that allows the sending host to avoid fragmentation by discovering the minimum MTU along a given path and reducing its packet sizes accordingly. This procedure is optional in IPv4 but mandatory in IPv6 where there is no router fragmentation. The Path MTU Discovery procedure (and the deletion of router fragmentation in IPv6) reflects a consensus of the Internet technical community that IP fragmentation is best avoided. This requires that subnetworks support MTUs that are "reasonably" large. The smallest MTU that IPv4 can use is 28 bytes, but this is clearly unreasonable. If a subnetwork cannot directly support a "reasonable" MTU with native framing mechanisms, it should internally fragment. That is, it should transparently break IP packets into internal data elements and reassemble them at the other end of the subnetwork. This leaves the question of what is a "reasonable" MTU. Ethernet (10 and 100 Mb/s) has a MTU of 1500 bytes, and because of its ubiquity few Internet paths have MTUs larger than this value. This severely limits the utility of larger MTUs provided by other subnetworks. But larger MTUs are increasingly desirable on high speed subnetworks to reduce the per-packet processing overhead in host computers, and implementers are encouraged to provide them even though they may not be usable when Ethernet is also in the path. [add specific advice for MTUs on slow and fast networks -- make MTU a function of speed?] Framing on Connection-Oriented Subnetworks IP needs a way to mark the beginning and end of each variable-length, asynchronous IP packet. While connectionless subnetworks generally provide this feature, many connection-oriented subnetworks do not. Some examples include: 1. leased lines carrying a synchronous bit stream; 2. ISDN B-channels carrying a synchronous octet stream; 3. dialup telephone modems carrying an asynchronous octet stream; and 4. Asynchronous Transfer Mode (ATM) networks carrying an asynchronous stream of fixed-sized "cells" The Internet community has defined packet framing methods for all these subnetworks. The Point-To-Point Protocol (PPP) [] is applicable to bit synchronous, octet synchronous and octet asynchronous links (i.e., examples 1-3 above). ATM has its own framing method described in [RFC1577]. Because these framing methods are usually implemented partly or wholly in software, performance may suffer at higher speeds. At progressively lower speeds, a cell-, octet- or bit-oriented interface to a connection-oriented subnetwork may be acceptable. The definition of "low speed" depends on the nature of the hardware interface and the processing capacity available to implement the necessary framing method in software. At high speeds, a subnetwork should provide a framed interface capable of carrying asynchronous, variable-length IP datagrams. The maximum packet size supported by this interface is discussed above in the MTU/Fragmentation section. The subnetwork may implement this facility in any convenient manner. In particular, IP packet boundaries need not coincide with any framing or synchronization mechanisms internal to the subnetwork. [comments about common packet sizes and internal ATM wastage] Connection-Oriented Subnetworks IP has no notion of a "connection"; it is a purely connectionless protocol. When a connection is required by an application, it is usually provided by TCP, the Transmission Control Protocol, running atop IP on an end-to-end basis. Connection-oriented subnetworks can be (and are) widely used to carry IP, but often with considerable complexity. Subnetworks with a few nodes can simply open a permanent connection between each pair of nodes, as is frequently done with ATM. But the number of connections is equal to the square of the number of nodes, so this is clearly impractical for large subnetworks. A "shim" layer between IP and the subnetwork is therefore required to manage connections in the latter. These shim layers typically open subnetwork connections as needed when an IP packet is queued for transmission and close them after an idle timeout. There is no relation between subnetwork connections and any connections that may exist at higher layers (e.g., TCP). Because Internet traffic is typically bursty and transaction-oriented, it is often difficult to pick an optimal idle timeout. If the timeout is too short, subnetwork connections are opened and closed rapidly, possibly over-stressing its call management system (especially if was designed for voice traffic holding times). If the timeout is too long, subnetwork connections are idle much of the time, wasting any resources dedicated to them by the subnetwork. The ideal subnetwork for IP is connectionless. Connection-oriented networks that dedicate minimal resources to each connection (e.g., ATM) are a distant second, and connection-oriented networks that dedicate a fixed amount of bandwidth to each connection (e.g., the PSTN, including ISDN) are the least efficient. If such subnetworks must be used to carry IP, their call-processing systems should be capable of rapid call set-up and tear-down. Bandwidth on Demand (BoD) Subnets (Aaron Falk) Wireless networks, including both satellite and terrestrial, may use Bandwidth on Demand (BoD). Bandwidth on demand, which is implemented at the link layer by Demand Assignment Multiple Access (DAMA) in TDMA systems, is currently one of the proposed mechanism to efficiently share limited spectrum resources amongst a large number of users. The design parameters for BoD are similar to those in connection oriented subnetworks, however the implementations may be very different. In BoD, the user typically requests access to the shared channel for some duration. Access may be allocated in terms of a period of time at a specific rate, a certain number of packets, or until the user chooses to release the channel. Access may be coordinated through a central management entity or through using a distributed algorithm amongst the users. The resource shared may be a terrestrial wireless hop, a satellite uplink, or an end-to-end satellite channel. Long delay BoD subnets pose problems similar to the Connection Oriented networks in terms of anticipating traffic arrivals. While connection oriented subnets hold idle channels open expecting new data to arrive, BoD subnets request channel access based on buffer occupancy (or expected buffer occupancy) on the sending port. Poor performance will likely result if the sender does not anticipate additional traffic arriving at that port during the time it takes to grant a transmission request. It is recommended that the algorithm have the capability to extend a hold on the channel for data that has arrived after the original request was generated (this may done by piggybacking new requests on user data). There are a wide variety of BoD protocols available and there has been relatively little comprehensive research on the interactions between the BoD mechanisms and Internet protocol performance. A tradeoff exists balancing the time a user can be allowed to hold a channel to drain port buffers with the additional imposed latency on other users who are forced to wait to get access to the channel. It is desireable to design mechanisms that constrain the BoD imposed latency variation. This will be helpful in preventing spurious timeouts from TCP. Reliability and Error Control In the Internet architecture, the ultimate responsibility for error recovery is at the end points. The Internet may occasionally drop, corrupt, duplicate or reorder packets, and the transport protocol (e.g., TCP) or application (e.g., if UDP is used) must recover from these errors on an end-to-end basis. Error recovery in the subnetwork is therefore justified only to the extent that it can enhance overall performance. Internet transport protocols usually cannot distinguish between packet loss due to congestion and packet loss due to a subnet or link error (e.g., ; it is the responsibility of the end-to-end protocol (e.g., TCP) or the application (if UDP is used) to detect and recover from these events. Excessive subnetwork is therefore a performance issue, not a reliability The ultimate responsibility for errr [true reliability can only be provided on an end-to-end basis; subnet reliability can be sometimes justified as a performance enhancement. Transport protocols must avoid congestion, which implies lousy performance on links with high random error rates due to noise. Subnet reliability should be "lightweight", i.e., it only has to be "good enough", *not* perfect. "good enough" means less than one end-to-end error per round trip time; transport protocol performance decreases dramatically when this rate is exceeded. FEC is best implemented in the subnet. interleaving delays < RTT acceptable] Quality of Service, Fairness vs Performance, Congestion signalling [subnet hooks for QOS bits] Delay Characteristics [self clocking TCP, (re)transmission shaping] Bandwidth Asymmetries Some subnetworks may provide asymmetric bandwidth and the Internet protocol suite will generally still work fine. However, there is a case when such a scenario reduces TCP performance. Since TCP data segments are ``clocked'' out by returning acknowledgments TCP senders are limited by the rate at which ACKs can be returned [BPK98]. Therefore, when the ratio of the bandwidth of the channel carrying the data to the bandwidth of the channel carrying the acknowledgments (ACKs) is too large, the slow return of the ACKs directly impacts performance. Since ACKs are generally smaller than data segments, TCP can tolerate some asymmetry. One way to cope with asymmetric subnetworks is to increase the size of the data segments as much as possible. This allows more data to be sent per ACK, and therefore mitigates the slow flow of ACKs. Using the delayed acknowledgment mechanism {Bra89], which reduces the number of ACKs transmitted by the receiver by roughly half, can also improve performance by reducing the congestion on the ACK channel. Several other coping strategies exist (ack filtering, ack congestion control, etc.). Buffering, flow & congestion control [atm dropping individual cells in a packet means the entire packet must be dropped] Compression [Best done end-to-end. The required processing is more available there, and the benefits are realized by more network elements. If compression is provided in a subnetwork, it *must* detect incompressible data and "get out of the way", i.e., not make the compressed data larger in an attempt to compress it further, and it must not degrade throughput. Another consideration: even when the user data is compressible, subnetwork compression effectiveness is sometimes limited by the speed of the interface to the subnetwork.] Packet Reordering The Internet architecture does not guarantee that packets will arrive in the same order in which they were originally transmitted. However, we recommend that subnetworks should attempt to gratuitously re-order segments. Since TCP returns a cumulative acknowledgment (ACK) indicating the last in-order segment that has arrived, out-of-order segments cause a TCP receiver to transmit a duplicate acknowledgment. When the TCP sender notices three duplicate acknowledgments it assumes that a segment was dropped by the network and uses the fast retransmit algorithm [Jac90,APS99] to resend the segment. In addition, the congestion window is reduced by half, effectivly halving TCP's sending rate. If a subnetwork badly re-orders segments such that three duplicate ACKs are generated the TCP sender needlessly reduces the congestion window, and therefore performance. Mobility [best provided at a higher layer, for performance and flexibility reasons, but some subnet mobility can be a convenience as long as it's not too inefficient with routing] Multicasting Similar to the case of broadcast and discovery, multicast is more efficient on shared links where it is supported natively. Native multicast support requires a reasonable number (?? - over 10, under 1000?) of separate link-layer broadcast addresses. One such address SHOULD be reserved for native link broadcast; other addresses SHOULD be provided support separate multicast groups (and there SHOULD be at least 10?? such addresses). The other criteria for native multicast is a link-layer filter, which can select individual or sets of broadcast addresses. Such link filters avoid having every host parse every multicast message in the driver; a host receives, at the network layer, only those packets that pass its configured link filters. A shared link SHOULD support multiple, programmable link filters, to support efficient native multicast. [Multicasting can be simulated over unicast subnets by sending multiple copies of packets, but this is wasteful. If the subnet can support native multicasting in an efficient way, it should do so] Broadcasting and Discovery Link layers fall into two categories: point-to-point and shared link. A point-to-point link has exactly two endpoint components (hosts or gateways); a shared link has more than two, either on an inherently broadcast media (e.g., ethernet, radio) or on a switching layer hidden from the network layer (switched ethernet, Myrinet, ATM). There are a number of Internet protocols which make use of link layer broadcast capabilities. These include link layer address lookup (ARP), auto-configuration (RARP, BOOTP, DHCP), and routing (RIP). These protocols require broadcast-capable links. Shared links SHOULD support native, link layer subnet broadcast. The lack of broadcast can impede the performance of these protocols, or in some cases render them inoperable. ARP-like link address lookup can be provided by a centralized database, rather than owner response to broadcast queries. This comes at the expense of potentially higher response latency and the need for explicit knowledge of the ARP server address (no automatic ARP discovery). For other protocols, if a link does not support broadcast, the protocol is inoperable. This is the case for DHCP, for example. Routing [what is proper division between routing at the Internet layer and routing in the subnet? Is it useful or helpful to Internet routing to have subnetworks that provide their own internal routing?] Security [Security mechanisms should be placed as close as possible to the entities that they protect. E.g., mechanisms that protect host computers or users should be implemented at the higher layers and operate on an end-to-end basis under control of the users. This makes subnet security mechanisms largely redundant unless they are to protect the subnet itself, e.g., against unauthorized use.] References [APS99] Mark Allman, Vern Paxson, W. Richard Stevens. TCP Congestion Control, April 1999. RFC 2581. [BPK98] Hari Balakrishnan, Venkata Padmanabhan, Randy H. Katz. The Effects of Asymmetry on TCP Performance. ACM Mobile Networks and Applications (MONET), 1998. [Jac90] Van Jacobson. Modified TCP Congestion Avoidance Algorithm. Email to the end2end-interest mailing list, April 1990. URL: ftp://ftp.ee.lbl.gov/email/vanj.90apr30.txt. [SRC81] Jerome H. Saltzer, David P. Reed and David D. Clark, End-to-End Arguments in System Design. Second International Conference on Distributed Computing Systems (April, 1981) pages 509-512. Published with minor changes in ACM Transactions in Computer Systems 2, 4, November, 1984, pages 277-288. Reprinted in Craig Partridge, editor Innovations in internetworking. Artech House, Norwood, MA, 1988, pages 195-206. ISBN 0-89006-337-0. Also scheduled to be reprinted in Amit Bhargava, editor. Integrated broadband networks. Artech House, Boston, 1991. ISBN 0-89006-483-0. http://people.qualcomm.com/karn/library.html. [RFC791] Jon Postel. Internet Protocol, September 1981. RFC 791. [RFC1577] M. Laubach. Classical IP and ARP over ATM, January 1994. RFC 1577. Author's Addresses: Phil Karn (karn@qualcomm.com) Aaron Falk (afalk@panamsat.com) Joe Touch (touch@isi.edu) Marie-Jose Montpetit (marie@teledesic.com)