Internet Engineering Task Force                                Phil Karn
INTERNET DRAFT                                                Aaron Falk
							       Joe Touch
						    Marie-Jose Montpetit
File: draft-ietf-pilc-link-design-00.txt                      June, 1999
                                                 Expires: December, 1999


                Advice for Internet Subnetwork Designers


Status of this Memo
                                    
    This document is an Internet-Draft and is in full conformance with
    all provisions of Section 10 of RFC2026.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note that
    other groups may also distribute working documents as
    Internet-Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other documents
    at any time.  It is inappropriate to use Internet- Drafts as
    reference material or to cite them other than as "work in progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.

Abstract

    This document provides advice to the designers of digital
    communication equipment, link layer protocols and packet switched
    subnetworks (collectively referred to as subnetworks) who wish to
    support the Internet protocols but who may be unfamiliar with the
    architecture of the Internet and the implications of their design
    choices on the performance and efficiency of the Internet.

    This document represents an evolving consensus of the members of the
    IETF Performance Implications of Link Characteristics (PILC) working
    group.

Introduction and Overview

    The Internet Protocol [RFC791] is the core protocol of the
    world-wide Internet that defines a simple "connectionless"
    packet-switched network.  The success of the Internet is largely
    attributed to the simplicity of IP, the "end-to-end principle" on
    which the Internet is based, and the resulting ease of carrying IP
    on a wide variety of subnetworks not necessarily designed with IP in
    mind.

    But while many subnetworks carry IP, they do not necessarily do so
    with maximum efficiency, minimum complexity or minimum cost. Nor do
    they implement certain features to efficiently support newer
    Internet features of increasing importance, such as multicasting or
    quality of service.

    With the explosive growth of the Internet, IP is an increasingly
    large fraction of the traffic carried by the world's
    telecommunications networks. It therefore makes sense to optimize
    both existing and new subnetwork technologies for IP as much as
    possible.

    Optimizing a subnetwork for IP involves three complementary
    considerations:

    1. Providing functionality sufficient to carry IP.

    2. Eliminating unnecessary functions that increase cost or
    complexity.

    3. Choosing subnetwork parameters that maximize the performance of
    the Internet protocols.

    Because IP is so simple, consideration 2 is more of an issue than
    consideration 1. I.e., subnetwork designers make many more errors of
    commission than errors of omission.  But certain enhanced Internet
    features, such as multicasting and quality-of-service, rely on
    support from the underlying subnetworks beyond that necessary to
    carry "traditional" unicast, best-effort IP.

    A major consideration in the efficient design of any layered
    communication network are the appropriate layer(s) in which to
    implement a given feature. This issue was first addressed in the
    seminal paper "End-to-End Arguments in System Design" [SRC81]. This
    paper argued that many -- if not most -- network functions are best
    implemented on an end-to-end basis, i.e., at the higher protocol
    layers.  Duplicating these functions at the lower levels is at best
    redundant, and can even be harmful. The architecture of the Internet
    was heavily influenced by this philosophy, and in our view it was
    crucial to the Internet's success.

    The remainder of this document discusses the various subnetwork
    design issues that the authors consider relevant to efficient IP
    support.

Maximum Transmission Units (MTUs) and IP Fragmentation

    IP packets (datagrams) vary in size from 20 bytes (the size of the
    IP header alone) to a maximum of 65535 bytes. Subnetworks need not
    support maximum-sized (64KB) IP packets, as IP provides a scheme
    that breaks packets that are too large for a given subnetwork into
    fragments that travel as independent packets and are reassembled at
    the destination. The maximum packet size supported by a subnetwork
    is known as its Maximum Transmission Unit (MTU).

    Subnetworks may, but are not required to indicate the lengths of the
    packets they carry.  One example is Ethernet with the DIX (not IEEE
    802.3) header, which lacks a length field to indicate the true data
    length when the packet is padded to the 60 byte minimum.  This is
    not a problem for IP because it carries its own length field.

    In IP version 4 (current IP), fragmentation can occur at either the
    sending host or in an intermediate router, and fragments can be
    further fragmented at subsequent routers if necessary.

    In IP version 6, fragmentation can occur only at the sending host;
    it cannot occur in a router.

    Both IPv4 and IPv6 provide a "Path MTU Discovery" procedure
    [RFC????]  that allows the sending host to avoid fragmentation by
    discovering the minimum MTU along a given path and reducing its
    packet sizes accordingly. This procedure is optional in IPv4 but
    mandatory in IPv6 where there is no router fragmentation.

    The Path MTU Discovery procedure (and the deletion of router
    fragmentation in IPv6) reflects a consensus of the Internet
    technical community that IP fragmentation is best avoided. This
    requires that subnetworks support MTUs that are "reasonably"
    large. The smallest MTU that IPv4 can use is 28 bytes, but this is
    clearly unreasonable.

    If a subnetwork cannot directly support a "reasonable" MTU with
    native framing mechanisms, it should internally fragment. That is,
    it should transparently break IP packets into internal data elements
    and reassemble them at the other end of the subnetwork.

    This leaves the question of what is a "reasonable" MTU.  Ethernet
    (10 and 100 Mb/s) has a MTU of 1500 bytes, and because of its
    ubiquity few Internet paths have MTUs larger than this value.  This
    severely limits the utility of larger MTUs provided by other
    subnetworks. But larger MTUs are increasingly desirable on high
    speed subnetworks to reduce the per-packet processing overhead in
    host computers, and implementers are encouraged to provide them even
    though they may not be usable when Ethernet is also in the path.

    [add specific advice for MTUs on slow and fast networks -- make MTU
    a function of speed?]

Framing on Connection-Oriented Subnetworks

    IP needs a way to mark the beginning and end of each
    variable-length, asynchronous IP packet.  While connectionless
    subnetworks generally provide this feature, many connection-oriented
    subnetworks do not.  Some examples include:

    1. leased lines carrying a synchronous bit stream;
    2. ISDN B-channels carrying a synchronous octet stream;
    3. dialup telephone modems carrying an asynchronous octet stream;
    and 
    4. Asynchronous Transfer Mode (ATM) networks carrying an
    asynchronous stream of fixed-sized "cells"

    The Internet community has defined packet framing methods for all
    these subnetworks. The Point-To-Point Protocol (PPP) [] is
    applicable to bit synchronous, octet synchronous and octet
    asynchronous links (i.e., examples 1-3 above). ATM has its own
    framing method described in [RFC1577].

    Because these framing methods are usually implemented partly or
    wholly in software, performance may suffer at higher speeds. At
    progressively lower speeds, a cell-, octet- or bit-oriented
    interface to a connection-oriented subnetwork may be acceptable.
    The definition of "low speed" depends on the nature of the hardware
    interface and the processing capacity available to implement the
    necessary framing method in software.

    At high speeds, a subnetwork should provide a framed interface
    capable of carrying asynchronous, variable-length IP datagrams.  The
    maximum packet size supported by this interface is discussed above
    in the MTU/Fragmentation section.  The subnetwork may implement this
    facility in any convenient manner. In particular, IP packet
    boundaries need not coincide with any framing or synchronization
    mechanisms internal to the subnetwork.

    [comments about common packet sizes and internal ATM wastage]

Connection-Oriented Subnetworks

    IP has no notion of a "connection"; it is a purely connectionless
    protocol.  When a connection is required by an application, it is
    usually provided by TCP, the Transmission Control Protocol, running
    atop IP on an end-to-end basis.

    Connection-oriented subnetworks can be (and are) widely used to
    carry IP, but often with considerable complexity.  Subnetworks with
    a few nodes can simply open a permanent connection between each pair
    of nodes, as is frequently done with ATM. But the number of
    connections is equal to the square of the number of nodes, so this
    is clearly impractical for large subnetworks. A "shim" layer between
    IP and the subnetwork is therefore required to manage connections in
    the latter.

    These shim layers typically open subnetwork connections as needed
    when an IP packet is queued for transmission and close them after an
    idle timeout. There is no relation between subnetwork connections
    and any connections that may exist at higher layers (e.g., TCP).

    Because Internet traffic is typically bursty and
    transaction-oriented, it is often difficult to pick an optimal idle
    timeout. If the timeout is too short, subnetwork connections are
    opened and closed rapidly, possibly over-stressing its call
    management system (especially if was designed for voice traffic
    holding times). If the timeout is too long, subnetwork connections
    are idle much of the time, wasting any resources dedicated to them
    by the subnetwork.

    The ideal subnetwork for IP is connectionless. Connection-oriented
    networks that dedicate minimal resources to each connection (e.g.,
    ATM) are a distant second, and connection-oriented networks that
    dedicate a fixed amount of bandwidth to each connection (e.g., the
    PSTN, including ISDN) are the least efficient. If such subnetworks
    must be used to carry IP, their call-processing systems should be
    capable of rapid call set-up and tear-down.

Bandwidth on Demand (BoD) Subnets (Aaron Falk)

    Wireless networks, including both satellite and terrestrial, may use
    Bandwidth on Demand (BoD). Bandwidth on demand, which is implemented at
    the link layer by Demand Assignment Multiple Access (DAMA) in TDMA
    systems, is currently one of the proposed mechanism to efficiently share
    limited spectrum resources amongst a large number of users.

    The design parameters for BoD are similar to those in connection
    oriented subnetworks, however the implementations may be very
    different. In BoD, the user typically requests access to the shared
    channel for some duration. Access may be allocated in terms of a
    period of time at a specific rate, a certain number of packets, or
    until the user chooses to release the channel. Access may be
    coordinated through a central management entity or through using a
    distributed algorithm amongst the users. The resource shared may be
    a terrestrial wireless hop, a satellite uplink, or an end-to-end
    satellite channel.

    Long delay BoD subnets pose problems similar to the Connection
    Oriented networks in terms of anticipating traffic arrivals. While
    connection oriented subnets hold idle channels open expecting new
    data to arrive, BoD subnets request channel access based on buffer
    occupancy (or expected buffer occupancy) on the sending port. Poor
    performance will likely result if the sender does not anticipate
    additional traffic arriving at that port during the time it takes to
    grant a transmission request. It is recommended that the algorithm
    have the capability to extend a hold on the channel for data that
    has arrived after the original request was generated (this may done
    by piggybacking new requests on user data).

    There are a wide variety of BoD protocols available and there has
    been relatively little comprehensive research on the interactions
    between the BoD mechanisms and Internet protocol performance. A
    tradeoff exists balancing the time a user can be allowed to hold a
    channel to drain port buffers with the additional imposed latency on
    other users who are forced to wait to get access to the channel. It
    is desireable to design mechanisms that constrain the BoD imposed
    latency variation. This will be helpful in preventing spurious
    timeouts from TCP.

Reliability and Error Control

    In the Internet architecture, the ultimate responsibility for error
    recovery is at the end points. The Internet may occasionally drop,
    corrupt, duplicate or reorder packets, and the transport protocol
    (e.g., TCP) or application (e.g., if UDP is used) must recover from
    these errors on an end-to-end basis.  Error recovery in the
    subnetwork is therefore justified only to the extent that it can
    enhance overall performance.

    Internet transport protocols usually cannot distinguish between
    packet loss due to congestion and packet loss due to a subnet or
    link error (e.g.,

    ; it is the responsibility of the end-to-end protocol (e.g., TCP) or
    the application (if UDP is used) to detect and recover from these
    events. Excessive subnetwork is therefore a performance issue, not a
    reliability

The ultimate responsibility for errr

    [true reliability can only be provided on an end-to-end basis;
    subnet reliability can be sometimes justified as a performance
    enhancement.  Transport protocols must avoid congestion, which
    implies lousy performance on links with high random error rates due
    to noise.  Subnet reliability should be "lightweight", i.e., it only
    has to be "good enough", *not* perfect. "good enough" means less
    than one end-to-end error per round trip time; transport protocol
    performance decreases dramatically when this rate is exceeded.  FEC
    is best implemented in the subnet. interleaving delays < RTT
    acceptable]

Quality of Service, Fairness vs Performance, Congestion signalling

    [subnet hooks for QOS bits]

Delay Characteristics

    [self clocking TCP, (re)transmission shaping]

Bandwidth Asymmetries

    Some subnetworks may provide asymmetric bandwidth and the Internet
    protocol suite will generally still work fine.  However, there is a
    case when such a scenario reduces TCP performance.  Since TCP data
    segments are ``clocked'' out by returning acknowledgments TCP
    senders are limited by the rate at which ACKs can be returned
    [BPK98].  Therefore, when the ratio of the bandwidth of the channel
    carrying the data to the bandwidth of the channel carrying the
    acknowledgments (ACKs) is too large, the slow return of the ACKs
    directly impacts performance.  Since ACKs are generally smaller than
    data segments, TCP can tolerate some asymmetry.

    One way to cope with asymmetric subnetworks is to increase the size
    of the data segments as much as possible.  This allows more data to
    be sent per ACK, and therefore mitigates the slow flow of ACKs.
    Using the delayed acknowledgment mechanism {Bra89], which reduces
    the number of ACKs transmitted by the receiver by roughly half, can
    also improve performance by reducing the congestion on the ACK
    channel.

    Several other coping strategies exist (ack filtering, ack congestion
    control, etc.).

Buffering, flow & congestion control

    [atm dropping individual cells in a packet means the entire packet
    must be dropped]

Compression

    [Best done end-to-end. The required processing is more available
    there, and the benefits are realized by more network elements. If
    compression is provided in a subnetwork, it *must* detect
    incompressible data and "get out of the way", i.e., not make the
    compressed data larger in an attempt to compress it further, and it
    must not degrade throughput.  Another consideration: even when the
    user data is compressible, subnetwork compression effectiveness is
    sometimes limited by the speed of the interface to the subnetwork.]

Packet Reordering

    The Internet architecture does not guarantee that packets will
    arrive in the same order in which they were originally transmitted.
    However, we recommend that subnetworks should attempt to
    gratuitously re-order segments.  Since TCP returns a cumulative
    acknowledgment (ACK) indicating the last in-order segment that has
    arrived, out-of-order segments cause a TCP receiver to transmit a
    duplicate acknowledgment.  When the TCP sender notices three
    duplicate acknowledgments it assumes that a segment was dropped by
    the network and uses the fast retransmit algorithm [Jac90,APS99] to
    resend the segment.  In addition, the congestion window is reduced
    by half, effectivly halving TCP's sending rate.  If a subnetwork
    badly re-orders segments such that three duplicate ACKs are
    generated the TCP sender needlessly reduces the congestion window,
    and therefore performance.

Mobility

    [best provided at a higher layer, for performance and flexibility
    reasons, but some subnet mobility can be a convenience as long as
    it's not too inefficient with routing]

Multicasting

    Similar to the case of broadcast and discovery, multicast is more
    efficient on shared links where it is supported natively. Native
    multicast support requires a reasonable number (?? - over 10, under
    1000?) of separate link-layer broadcast addresses. One such address
    SHOULD be reserved for native link broadcast; other addresses
    SHOULD be provided support separate multicast groups (and there SHOULD
    be at least 10?? such addresses).

    The other criteria for native multicast is a link-layer filter, which
    can select individual or sets of broadcast addresses. Such link
    filters avoid having every host parse every multicast message in the
    driver; a host receives, at the network layer, only those packets that
    pass its configured link filters. A shared link SHOULD support
    multiple, programmable link filters, to support efficient native
    multicast.

    [Multicasting can be simulated over unicast subnets by sending
    multiple copies of packets, but this is wasteful. If the subnet can
    support native multicasting in an efficient way, it should do so]

Broadcasting and Discovery

    Link layers fall into two categories: point-to-point and shared
    link. A point-to-point link has exactly two endpoint components
    (hosts or gateways); a shared link has more than two, either on
    an inherently broadcast media (e.g., ethernet, radio) or on a
    switching layer hidden from the network layer (switched ethernet,
    Myrinet, ATM).

    There are a number of Internet protocols which make use of link
    layer broadcast capabilities. These include link layer address
    lookup (ARP), auto-configuration (RARP, BOOTP, DHCP), and routing
    (RIP). These protocols require broadcast-capable links. Shared
    links SHOULD support native, link layer subnet broadcast.

    The lack of broadcast can impede the performance of these protocols,
    or in some cases render them inoperable. ARP-like link address
    lookup can be provided by a centralized database, rather than 
    owner response to broadcast queries. This comes at the expense
    of potentially higher response latency and the need for explicit
    knowledge of the ARP server address (no automatic ARP discovery).

    For other protocols, if a link does not support broadcast, the
    protocol is inoperable. This is the case for DHCP, for example.

Routing

    [what is proper division between routing at the Internet layer and
    routing in the subnet? Is it useful or helpful to Internet routing
    to have subnetworks that provide their own internal routing?]

Security

    [Security mechanisms should be placed as close as possible to the
    entities that they protect. E.g., mechanisms that protect host
    computers or users should be implemented at the higher layers and
    operate on an end-to-end basis under control of the users. This
    makes subnet security mechanisms largely redundant unless they are
    to protect the subnet itself, e.g., against unauthorized use.]

References

    [APS99] Mark Allman, Vern Paxson, W. Richard Stevens.  TCP
        Congestion Control, April 1999.  RFC 2581.
    
    [BPK98] Hari Balakrishnan, Venkata Padmanabhan, Randy H. Katz.  The
        Effects of Asymmetry on TCP Performance.  ACM Mobile Networks
        and Applications (MONET), 1998.

    [Jac90] Van Jacobson.  Modified TCP Congestion Avoidance Algorithm.
        Email to the end2end-interest mailing list, April 1990.  URL:
        ftp://ftp.ee.lbl.gov/email/vanj.90apr30.txt.
    
    [SRC81] Jerome H. Saltzer, David P. Reed and David D. Clark,
        End-to-End Arguments in System Design.  Second International
        Conference on Distributed Computing Systems (April, 1981) pages
        509-512. Published with minor changes in ACM Transactions in
        Computer Systems 2, 4, November, 1984, pages 277-288. Reprinted
        in Craig Partridge, editor Innovations in
        internetworking. Artech House, Norwood, MA, 1988, pages
        195-206. ISBN 0-89006-337-0. Also scheduled to be reprinted in
        Amit Bhargava, editor. Integrated broadband networks.  Artech
        House, Boston, 1991. ISBN 0-89006-483-0.
        http://people.qualcomm.com/karn/library.html.

    [RFC791] Jon Postel.  Internet Protocol, September 1981.  RFC 791.
    
    [RFC1577] M. Laubach.  Classical IP and ARP over ATM, January 1994.
        RFC 1577.

Author's Addresses:

Phil Karn (karn@qualcomm.com)
Aaron Falk (afalk@panamsat.com)
Joe Touch (touch@isi.edu)
Marie-Jose Montpetit (marie@teledesic.com)