Network Working Group                                         R. Whittle
Internet-Draft                                          First Principles
Intended status: Experimental                             March 27, 2007
Expires: September 28, 2007


   SRAM-based IP Forwarding Eliminates the Need for Route Aggregation
                draft-whittle-sram-ip-forwarding-00.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on September 28, 2007.

Copyright Notice

   Copyright (C) The IETF Trust (2007).


Whittle                Expires September 28, 2007               [Page 1]

Internet-Draft          SRAM-based IP Forwarding              March 2007


Abstract

   I propose a simple, low-cost, low-power, Static RAM (SRAM) based
   architecture for the Forwarding Information Base (FIB) function of
   transit and border routers in the Default Free Zone (DFZ) of the
   Internet.  This will provide direct hardware forwarding irrespective
   of the size of the "global BGP routing table", within the current
   IPv4 convention of limiting advertised prefixes to no longer than
   /24.  Routers with this or a similar architecture provide the only
   elegant hardware solution to the problem of route disaggregation,
   which is unavoidable due to increasing numbers of ISPs and end-users
   who need to advertise their prefixes on topologically diverse parts
   of the network, for purposes including multihoming and traffic
   engineering.

   Router hardware limitations with respect to route disaggregation
   could also be eliminated for IPv6 by adding further SRAMs, provided
   the existing 2000::/3 global unicast allocations are reallocated to a
   smaller range, for instance 2000::/10.  This would provide for
   Provider Independent /32 allocations to 4 million ISPs and multihomed
   end-users.  Each /32 assignment could be advertised as up to eight
   /35 prefixes - each of which provides 8192 /48 user networks.


Whittle                Expires September 28, 2007               [Page 2]

Internet-Draft          SRAM-based IP Forwarding              March 2007


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  Summary  . . . . . . . . . . . . . . . . . . . . . . . . . . .  6
   3.  Route Aggregation and FIB Technologies . . . . . . . . . . . .  8
     3.1.  Route aggregation  . . . . . . . . . . . . . . . . . . . .  8
     3.2.  Route disaggregation . . . . . . . . . . . . . . . . . . .  9
     3.3.  The Default Free Zone  . . . . . . . . . . . . . . . . . .  9
     3.4.  Routing Information Base (RIB) . . . . . . . . . . . . . . 10
     3.5.  Forwarding Information Base (FIB)  . . . . . . . . . . . . 11
     3.6.  Forwarding Equivalent Class (FEC)  . . . . . . . . . . . . 11
     3.7.  Linear search list implementation of the FIB . . . . . . . 12
     3.8.  Tree-structured implementation of the FIB  . . . . . . . . 12
     3.9.  TCAM implementation of the FIB . . . . . . . . . . . . . . 13
       3.9.1.  TCAM devices . . . . . . . . . . . . . . . . . . . . . 13
       3.9.2.  TCAM and SRAM produce FEC  . . . . . . . . . . . . . . 15
       3.9.3.  TCAM power consumption and other problems  . . . . . . 15
       3.9.4.  TCAM capacity is driven by route disaggregation  . . . 16
     3.10. Proposed SRAM architecture for the FIB . . . . . . . . . . 17
   4.  The Crisis in Routing and Addressing . . . . . . . . . . . . . 19
     4.1.  Scalability  . . . . . . . . . . . . . . . . . . . . . . . 19
     4.2.  Addressing, Topology and Rekhter's Law . . . . . . . . . . 19
     4.3.  IPv6 - Future Routing Swamp? . . . . . . . . . . . . . . . 20
     4.4.  Costs, Benefits and IETF Policy  . . . . . . . . . . . . . 20
     4.5.  Multihoming is mandatory for ISPs and many users . . . . . 20
     4.6.  Who, Where and extending the TCP/IP protocols  . . . . . . 20
     4.7.  Moore's Law  . . . . . . . . . . . . . . . . . . . . . . . 21
     4.8.  Power consumption and heat dissipation . . . . . . . . . . 21
     4.9.  Incremental changes  . . . . . . . . . . . . . . . . . . . 21
   5.  SRAM-based FIB for IPv4  . . . . . . . . . . . . . . . . . . . 23
     5.1.  The SRAM chip  . . . . . . . . . . . . . . . . . . . . . . 23
     5.2.  Encoding FEC . . . . . . . . . . . . . . . . . . . . . . . 24
     5.3.  IPv4 usage and policy  . . . . . . . . . . . . . . . . . . 26
     5.4.  BGP performance and stability  . . . . . . . . . . . . . . 26
     5.5.  Alternative arrangements for IPv4  . . . . . . . . . . . . 27
   6.  SRAM-based FIB and addressing changes for IPv6 . . . . . . . . 28
     6.1.  Impractical with current global unicast allocations  . . . 28
     6.2.  Reallocating IPv6 global unicast addresses . . . . . . . . 28
   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 30
   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 31
   9.  Informative References . . . . . . . . . . . . . . . . . . . . 32
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 34
   Intellectual Property and Copyright Statements . . . . . . . . . . 35


Whittle                Expires September 28, 2007               [Page 3]

Internet-Draft          SRAM-based IP Forwarding              March 2007


1.  Introduction

   The purpose of this Internet Draft is to argue that future designs of
   high-end routers used in the Internet's Default Free Zone (DFZ) be
   equipped with a Static RAM (SRAM) based hardware architecture to
   achieve single clock cycle Forwarding Information Base (FIB)
   classification of incoming packets.  This architecture would be
   tailored to current global administrative arrangements regarding IPv4
   address management and routing.  An extension to this architecture
   for IPv6 would match a proposed compact reallocation of IPV6 global
   unicast addresses.

   Based on a single 1.4 watt, USD$70 72 Mbit SRAM, this system can
   classify 250M IPv4 packets/sec to be forwarded by one of 14
   interfaces.  By mapping destination address bits 31 to 8 to the SRAM
   address, the correct interface number can be read in 4ns, and can be
   different for each of the 14.6 million /24 prefixes in which the IPv4
   address space could be separately advertised.  This system can be
   extended with a second SRAM to routers with up to 510 interfaces.

   I propose that by carefully coordinating global policies for address
   allocation and BGP prefix advertisement with a new standard of router
   hardware optimised for the Internet's DFZ, that one of the major
   threats to Internet communications can be eliminated.  This is the
   growth in what is often referred to as the "global BGP routing
   table", which is projected to overwhelm the hardware capabilities of
   transit and border routers in the next five or so years.

   Part of the problem of routing table growth is the increased BGP
   traffic and the stability, memory and CPU utilisation problems this
   entails.  The other part of the problem is that the current
   generation of routers will soon be unable to support full line-speed
   packet rates with a sufficiently large number of FIB entries.
   Efforts to constrain routing table growth are likely to fail because
   a growing number of ISPs and end-users can only achieve their
   robustness and performance imperatives by separately advertising
   prefixes for purposes including multihoming and traffic engineering.

   This looming crisis was the subject of a two day IAB Routing and
   Addressing Workshop, held in Amsterdam, Netherlands, in October 2006.
   [I-D.iab-raws-report] [IAB-RAWS-website] .  The participants foresaw
   no solution to the problems.  They were unable to see how routing
   efficiency could be maintained with the growing advertisement of
   prefixes at locations where network topology largely prevents them
   from being aggregated.  Yet the needs of multihomed organisations for
   stable, provider independent (PI) IP addresses with multiple
   topologically diverse upstream connections necessarily results in
   many advertised prefixes not aggregating with neighbouring prefixes,


Whittle                Expires September 28, 2007               [Page 4]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   except perhaps to distant routers.

   I first review the problems related to routing, including the
   hardware forwarding, routing table management in routers, the BGP
   protocol and how address allocation has to date been constrained by
   the need for route aggregation.  I then describe the scope of the
   proposed upgrade to routers.  I present a single-chip design for IPv4
   for small router - those with 14 or fewer interfaces - and discuss
   its expansion for larger routers.  I then propose an IPv6
   implementation with a particular set of constraints on IPv6 global
   unicast allocation.  Following this is a discussion of some other
   options which might be considered now and for the future.  Finally, I
   suggest a timeline by which the proposed changes to policy might be
   made to give router manufacturers and their customers confidence,
   firstly in the continued routability of the Internet and secondly in
   the capacity of a new generation of routers to handle the demands of
   growing Internet traffic for longer than the current typical 5 year
   lifespan.  In Appendix 1 I discuss some low-level hardware details of
   how the proposed architecture could be implemented with currently
   available SRAMs.


Whittle                Expires September 28, 2007               [Page 5]

Internet-Draft          SRAM-based IP Forwarding              March 2007


2.  Summary

   In order to gather together the main points of this Internet Draft
   near the beginning, here is a summary of the major points, stripped
   of most of the details and qualifications.

   SRAM table-based FIBs were never considered practical because they
   couldn't handle the full range of router functionality.  As a result,
   it is widely believed that the only way to route packets is with a
   small enough number of rules to fit into TCAM (Ternary Content
   Addressable Memory) or iterative tree-structured FIB systems.  This
   has lead to two decades of apparently absolute belief in the
   requirement for route aggregation - a position perhaps strengthened
   by various unsuccessful attempts to find alternatives by introducing
   new elements into the TCP/IP protocols.

   There are a rapidly growing number of ISPs and end-users who
   absolutely require multihoming (of entire networks, not just
   individual nodes as SHIM6 is intended to provide) and who also
   strongly desire or need some traffic engineering capacity.  These can
   only be provided by this increasing number of users having an
   increasing number of prefixes which they advertise in topologically
   diverse ways - which is completely at odds with the requirement of
   route aggregation.

   The Internet routing system is shared global resource - since all
   users depend upon it, burden it with traffic and pay for it
   indirectly.  So there has been an increasing concern about the future
   of the Internet, manifesting in calls for greater pressure or
   constraints to be placed upon ISPs and end-users to curtail their
   multihoming and prefix-splitting ways.  In the absence of any
   constraints to route disaggregation and assuming that router
   technology does not change in principle, then the capacity of routers
   in the DFZ to handle traffic in the future clearly depends on
   continued purchases of new routers.  An attempt has been made to
   secure IPv6 routability by banning end-users from having Provider
   Independent (PI) addresses.  This places the custodians of the
   Internet (RIRs and whoever acts to protect the interests of operators
   of routers in the DFZ) in opposition with the immediate interests of
   most large ISPs and other AS operators.  Being the Fun-Police in the
   global Internet is a thankless - and probably futile - task.

   Fortunately, modern SRAM chips - which are far beyond what could have
   been imagined when IPv4 or even IPv6 was designed - are a perfect fit
   for a fast, low-power, elegant, simple and easy-to-program hardware-
   based FIB system for existing IPv4 usage in the DFZ.  An SRAM-based
   FIB needs to be an adjunct to existing TCAM etc. architectures,
   rather then replacing them, because these systems are capable of many


Whittle                Expires September 28, 2007               [Page 6]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   important tasks the SRAM system can't perform.  However, for most or
   all of the traffic of transit routers - and for the upstream traffic
   of border routers - the SRAM-based FIB is sufficient for almost all
   packets.  An SRAM-based system needs to be able to map particular
   prefixes to be handled by the traditional TCAM etc.  FIB
   architecture, for instance to handle packets which match a small
   number of longer prefixes.

   It is difficult or impossible to imagine an easier hardware solution
   than this SRAM-based architecture for simple IP forwarding in the
   DFZ.  It is faster and simpler than MPLS, and requires no changes to
   current IPv4 usage.  The cost of implementing this or a similar
   approach for IPv4 will be far less than not implementing it and
   having to pay much higher prices for routers with power-hungry, over-
   complex TCAM or other types of FIB, scaled up for the larger routing
   tables of the near future.  Nonetheless, border and transit routers
   will typically require some TCAM or other more flexible systems to
   cope with MPLS and/or the few prefixes which are longer than /24.

   By prompting discussion of this architecture now, I hope to hasten
   its arrival for IPv4.  I also suggest that this SRAM FIB architecture
   is the only way IPv6 traffic can be globally routed if and when it
   becomes widely adopted.  So I argue that urgent consideration be
   given to a new, standardised, SRAM-based FIB architecture for routers
   in the DFZ and to the re-allocation of IPv6 global unicast addresses
   to suit this architecture.

   Assuming BGP and FIB software can be made stable and responsive with
   much larger numbers of prefixes and updates, I propose that the new
   FIB architecture will enable a much more extensive splitting up of
   the IPv4 address space into freely advertisable and generally small
   PI prefixes.  This would be anathema to route aggregation principles
   which prevail in the absence of an SRAM-based FIB architecture - but
   should enable a much more flexible and efficient use to be made of
   the IPv4 address space, while enabling ISPs and end-users to split
   and advertise their prefixes freely.


Whittle                Expires September 28, 2007               [Page 7]

Internet-Draft          SRAM-based IP Forwarding              March 2007


3.  Route Aggregation and FIB Technologies

   This section begins with basic principles which will be well
   understood by all readers who are familiar with the crisis.  However,
   this section introduces a new approach to thinking about forwarding
   and therefore routing.  This new approach is the basis of the SRAM
   FIB architecture and of my suggestion for aligning address allocation
   and BGP management policies with a router architecture optimised to
   solve the biggest problem the Internet faces today.

3.1.  Route aggregation

   In the absence of a hardware FIB architecture such as proposed here,
   it seems that all participants - ISPs, major end-users, protocol
   designers, address administrators and router manufacturers - will
   continue to act as if it is impractical to route packets on the
   global Internet unless the forces which lead to 'route
   disaggregation' are strictly controlled.  Full route aggregation
   occurs when each topological branch of the Internet carries only IP
   addresses which are part of a given prefix, and where each other
   branch carries addresses within another non-overlapping prefix.  This
   means that if there were four 'first level' branches from the
   notional 'centre' of the Internet, all the sub-branches of branch A
   would contain only IP addresses within a prefix 'a' and likewise all
   the sub-branches of branch B would contain only IP addresses within a
   second prefix 'b', which is not a subset of prefix 'a'.  Then, a
   router at the junction of these four first-level branches could have
   a very simple routing table.

   This centre router would have four interfaces, one connecting to each
   branch, which I will also name A, B, C and D. (An "Interface" in this
   context means an Ethernet, ADSL, SDH/SONET fibre connection etc.
   This is commonly referred to as a "port" of the router, but I will
   use "interface" to avoid confusion with "ports" in the context of
   TCP, UDP and SCTP.)  When a packet arrives at any of the four
   interfaces, the routing table contains rules by which it should be
   forwarded to one of these four interfaces.  Simply testing the
   packet's destination address against these four rules will lead to it
   either being forwarded to one of the four interfaces, or being
   dropped.  (For simplicity, I am ignoring the fact that the router
   itself must have some IP addresses outside the four prefixes a, b, c
   and d.)  This example of complete route aggregation leads to a tree-
   structured network, without redundant paths to cope with failure of
   links or routers.


Whittle                Expires September 28, 2007               [Page 8]

Internet-Draft          SRAM-based IP Forwarding              March 2007


3.2.  Route disaggregation

   An example of complete route disaggregation would be a situation in
   which 256 separate prefixes were allocated to users, and these users
   were connected to various branches and sub-branches, with no
   discernable pattern in the addresses of these prefixes with respect
   to their location in the network topology.  In this instance, the
   router at the centre would need as many rules about forwarding as
   there were prefixes, except for the rare occasion in which two
   prefixes with adjacent address ranges could be accessed by the one
   interface.  In that case, a single rule covering the total range of
   the two prefixes would work just as well as a separate rules for
   each.

3.3.  The Default Free Zone

   A router within a network which uses a small proportion of the
   Internet's address space needs only to maintain rules for the
   prefixes in that address space, with one final rule as the default to
   be followed when a packet's destination address does not match any of
   the other rules.  The default rule forwards packets to whichever
   interface leads towards the "rest of the Internet".  For instance, a
   border router which has a single connection to an ISP on interface D,
   has a series of rules for each of the network's prefixes, followed by
   the final default rule: to direct all non-matching packet to the ISP,
   where they will be routed to their destinations on the rest of the
   Internet.  In internal router will have its default rule to forward
   packets to whichever interface is the best route to the border
   router.

   There are two primary types of router which are considered to be in
   the "Default Free Zone", by virtue of them not being able to rely on
   a single default rule to forward all the packets which do not match
   any one of a relatively short list of rules.  The first type is a
   border router with two links to two or more separate ISPs (or peering
   points, or routers of other systems) where both links carry outgoing
   packets to the rest of the Internet.  Such a border router needs
   rules for every globally advertised BGP prefix, because the best path
   for some prefixes will be to one link and the best for the others
   will be to another.  The second type of router in the DFZ is a
   "transit router", which is not serving a local network, but connects
   to two or more other routers handling general Internet traffic.  As
   with the multihomed (two links to ISPs) border router, the transit
   router needs to participate in the global BGP routing system and
   maintain separate rules for which interface to forward packets to,
   for each of the tens of thousands of advertised prefixes.

   The proposed SRAM-based FIB architecture is only required for routers


Whittle                Expires September 28, 2007               [Page 9]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   in the DFZ.  Internal routers and those with a single outgoing link
   do not need an RIB and FIB with separate entries for each advertised
   BGP prefix, so conventional router architectures are perfectly
   adequate.  Nonetheless, it is possible that economies of scale and
   the desire for flexibility may result in the SRAM-based approach
   becoming standard in all high-end routers, including those which are
   not initially deployed in the DFZ.

3.4.  Routing Information Base (RIB)

   The Routing Information Base (RIB) is the body of data maintained by
   each router which contains these rules.  The RIB's rules may be
   manually configured, or some mixture of manual and automatically
   generated rules.  For our discussion of the problems faced by routers
   in the Internet's DFZ, most of the rules are generated automatically
   by the router's software running a Border Gateway Protocol (BGP)
   agent which communicates with other similar routers, so that each
   router can decide the best interface to forward packets to, depending
   on their destination address.  In this Internet Draft, we are only
   concerned with the external BGP (eBGP) interaction between transit
   routers and the border routers of Autonomous Systems (ASes).  The
   border routers of some ASes also communicate via an internal BGP
   (iBGP) system.

   For the purpose of this discussion, the action of "forwarding" refers
   to directing a packet to one of the interfaces, to dropping it, or
   perhaps to subjecting it to some other processing.  For the main body
   of payload traffic (as distinct from administrative traffic) the
   action of forwarding must be performed very quickly.  Because the
   major problem faced by transit and border routers is the task of
   simply forwarding packets, rather than doing anything more complex
   with them such as queuing them in the output interface, I will
   consider "forwarding" to be simply making the first, and usually the
   only, decision regarding the packet: which interface to send it to,
   if any.

   Typically, the RIB is processed by software in the router to generate
   a simpler body of data more suitable for rapid classification of
   packets.  This body of is known as the 'Forwarding Information Base'
   (FIB), but I will also use this term to refer to the hardware and
   software which processes the packets according to this body of data.

   Routers may use some combination of software, specialised hardware
   and software, or purely hardware (without any conventional CPU or
   software) to classify the packets regarding which interface they
   should be forwarded to.  Originally, routers had a central design
   with relatively "dumb" interfaces.  All high-end routers now place an
   FIB system on each interface.  This is for several reasons.  Firstly,


Whittle                Expires September 28, 2007              [Page 10]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   the total data rate of the router exceeds that of any single FIB.
   Secondly, funnelling packets to a central point when some of them
   might be sent back to the ingress interface is inefficient.  Thirdly,
   local processing in the interface enables the interface to decide,
   without any per-packet central involvement, which interface to send
   it to via the router's "backplane" or "switching fabric" - a fast,
   any-to-any interconnect between all the interfaces.

   The RIB is traditionally structured as a list of prefixes, each of
   which has an associated body of data.  One entry in the RIB may refer
   to an entire /8 prefix - which in IPv4 covers 16,777,216 IP
   addresses.  Another may cover a /24 or longer prefix, covering 256 or
   less IPv4 addresses.  This organisation reflects the way routing
   information is structured in BGP and in most other contexts.

   The RIB of a DFZ router is used for more than simply generating an
   FIB.  Firstly, the router uses the RIB to store some of the
   information it receives from its peers - other transit and border
   routers which participate in the global BGP system.  Secondly, the
   RIB is used to generate BGP messages sent to these peers.  Often, the
   RIB is processed to generate a similar, simplified and separate body
   of data on which the BGP outgoing messages are based.

3.5.  Forwarding Information Base (FIB)

   Traditionally, the FIB is structured similarly to the RIB - as a set
   of rules, each applying to a particular prefix.  Where two rules A
   and B refer to prefixes a and b respectively, and where prefix b is a
   subnet of a, then the FIB must be structured so that packets whose
   destination address is within prefix b are subject to rule B rather
   than rule A. Rule A is applied to all packets within a but not within
   b.  This algorithm of giving precedence to the routing rule with the
   "most specific" address match is known as "longest prefix match".
   For instance, rule A is for a subnet with addresses such as "0110
   01xx" (where 'x' means the address bits can be 0 or 1).  This is a
   prefix fixing 6 bits.  Rule B is for addresses in the range "0110
   010x" - a 7 bit long prefix.

3.6.  Forwarding Equivalent Class (FEC)

   For simplicity, in what follows, I will use the term "FEC" to refer
   to a numeric value which the router interface needs to compute
   rapidly for each incoming packet.  This value controls whether the
   packet will be dropped, forwarded out the same interface, forwarded
   to another interface or subjected to further analysis and processing.
   In practice there may be other aspects of FEC which can be derived
   from other attributes of the packet, such as the DiffServ Code Points
   which are used to select which output queue the packet is sent to on


Whittle                Expires September 28, 2007              [Page 11]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   the output interface.  However, for this discussion, I will consider
   "FEC" to be simply a binary number created by the input interface's
   FIB, based solely on the packet's destination address.

3.7.  Linear search list implementation of the FIB

   The simplest software approach to forwarding involves an iterative,
   linear search through the FIB's list of rules, comparing each rule
   with the packet's destination address.  In this approach, the FIB's
   rules are either the same as those in the RIB or are a somewhat
   simplified version, such as by combining two rules with the same
   forwarding outcome (the same drop, process or deliver to a particular
   interface information) which have adjacent and aggregatable address
   ranges.  For instance, if the same rule applies to "0010 000x" and
   "0010 001x" then this can be replaced by a single rule for "0010
   00xx".  Likewise "0010 001x" and "0010 0xxx" can be combined into
   "0010 0xxx" if their rules have the same forwarding outcome.

   While the order of rules in the RIB may not be important, in the FIB
   the order must follow the "longest prefix match" principle.  Any
   longer prefix must appears before the shorter prefix which
   encompasses its address range.  For instance Rule B in the previous
   example, with its longer prefix "b", must be found by the search
   algorithm before it finds rule A.

   Each rule contains a number which directs the router to forward the
   packet to a particular interface, to drop it, or to subject it to
   further processing.  There is no other absolute requirement about the
   ordering of the rules, but shorter processing times would be achieved
   by placing those rules at the front of the FIB which match the
   largest proportion of packets in the current traffic environment.
   This linear search algorithm could also be implemented in hardware,
   or by a micro-programmed processor specifically designed for the
   task.

3.8.  Tree-structured implementation of the FIB

   Where the number of rules exceeds a few dozen, it would typically be
   faster to find the correct rule for each packet by structuring the
   rules in a tree-like manner in memory, so software or dedicated
   hardware could locate the correct rule in a limited number of
   iterated cycles.  For instance, an algorithm might first select one
   of two first level branches in the FIB depending on the state of bit
   31 of an IPv4 address.  Then it selects between two second level
   branches from whichever first level branch it chose in the first
   cycle.  This process continues, potentially for 32 cycles, until it
   finds a leaf - a node in the tree which is the longest prefix match
   for this address, and so has no further branches leading from it.


Whittle                Expires September 28, 2007              [Page 12]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   This is an onerous task with IPv4 in the DFZ, because a significant
   number of packets need to be matched to prefixes 15 to 24 bits long.
   The average length of longest prefixes matched would depend on the
   specific location of the router and the type of traffic.  Processing
   speed is boosted firstly by switching initially on the most
   significant 8 bits, since all routing rules have prefixes at least
   this long, and by more sophisticated approaches to the tree
   structure.  For instance some chains of such a tree have many levels
   and no branches, which increases the time to reach the end node and
   the storage requirements.  A "Patricia trie" is an improvement on the
   standard binary radix tree which solves this problem.  Detailed
   explanations of routing and forwarding approaches can be found at
   Pankaj Gupta's site [Gupta] - in particular Chapter 2 of his thesis.
   Trees can also be made with more than two branches per node.  For
   instance a 16-way branch handles 4 address bits per node traversal
   operation, potentially reducing the search time, but this raises
   problems with memory storage efficiency.

3.9.  TCAM implementation of the FIB

   In high-end routers, the most common technique for classifying
   incoming packets is dedicated hardware based on Ternary Content
   Addressable Memory (TCAM) chips.  "Ternary" means that each
   functional cell of the TCAM has three states: "match 0", "match 1"
   and "don't care".  TCAM is always used in routers, rather than the
   simple "CAM", in which each functional cell has only two states:
   "match 0" and "match 1".  Nonetheless, the term "TCAM" is sometimes
   loosely shortened to "CAM" in discussions about routers.  Ethernet
   switches need to match every bit of a 48 bit MAC address, so they use
   true CAM (Content Addressable Memory), but routers need to be more
   flexible, and be able to ignore the state of many bits.

   While TCAM and some highly optimised iterative techniques are the
   fastest approaches for the very broad general purpose nature of
   router functionality, they do not scale easily - or perhaps at all -
   to handling millions of prefixes, each with a potentially different
   "Forwarding Equivalent Class" (FEC).

3.9.1.  TCAM devices

   The large, fast, TCAM chips needed for high-end routers are exotic,
   complex, power-hungry devices.  Data is written into the memory cells
   (usually Static RAM flip-flops) by the CPU of the router or of the
   interface card on which the FIB resides.  There is more to a complete
   FIB than one or more TCAMS, but in this explanation we will consider
   the use of a single TCAM and a second, conventional, SRAM chip,
   solely for determining the FEC of each incoming packet.  I will
   describe a TCAM and its associated SRAM in some detail, because this


Whittle                Expires September 28, 2007              [Page 13]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   technology is the most likely one to be scaled up in order to cope
   with the routing table explosion, unless a direct SRAM-based
   technique such as I am proposing is employed.

   I will describe a simple, imaginary, TCAM with 32 addresses and 8
   data bits.  Each "cell" consists of two flip-flops, and all the cells
   in our example have previously been written by the router's CPU to
   implement the currently required rules for classifying packets to
   create the proper FEC for each one.  The TCAM is the first and most
   demanding part of the process.  This example FIB can contain up to 32
   rules, and in this example will be working from an 8 bit destination
   address of a packet.  The 8 "data" input signals enter at the top of
   the device, and each one is split into two lines which run vertically
   to the bottom edge.  For instance, for bit 0, there is a true bit 0
   line and an inverted bit 0 line.  So there are 16 lines which may
   change state every time some new "data" is presented to the chip.
   The 32 "addresses" are implemented as 32 horizontal rows.  Each
   intersection of an "address" row and a pair of "data" lines contains
   two memory cells, as just described, and two comparators.  The
   outputs of all the 16 comparators in an "address" row can pull down a
   horizontal line I will call the "match" line for this address.

   Each pair of flip-flops and their associated comparators implements
   the ternary comparison function.  When both flip-flops are low, the
   cell does not care about the state of the true and inverted data
   lines which pass downwards across it.  When the left flip flop is set
   to "1", its comparator will pull down the match line if the true data
   line is low.  Similarly, when the right flip flop is set to "1", its
   comparator will pull down the match line if the inverted data line is
   low.  Further details can be found in [Taylor-Spitznagel], where
   power dissipation figures of 20 to 30 watts are quoted for 2002
   technology TCAMs of 18Mbit capacity.

   In our example, the first address row at the top - address 31 - has
   its cells set to match the following pattern of address bits, where
   "x" means "don't care": "0110 1xxx".  (In these examples, the most
   significant bit is on the left.)  Address 30 is set to match "0001
   111x" address 29 is set to match "100x xxxx".  It can easily be seen
   how a TCAM can, in a single clock cycle, compare the address bits of
   an incoming packet, which are driven to the "data" pins of the TCAM,
   with the rules which have previously been stored inside the device.

   Along the right edge of the example device, the 32 match lines enter
   a priority encoder, which is a simple arrangement of logic gates so
   that a 5 bit binary number emerges from output pins, corresponding to
   the highest numbered match line which remains high.  (The term "pin"
   refers to physical electrical connections of an integrated circuit,
   despite most large chips now being in packages which use solder balls


Whittle                Expires September 28, 2007              [Page 14]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   rather than pins.)

   At the start of the cycle, all match lines and all the true and
   inverted "data" lines are pulled high.  When the true and inverted
   data lines are set according to the packet's address, none, one or
   multiple match lines will remain high.  In this example, the packet's
   address bits are "1000 1101" so the match line of address row 29
   remains high.  Other lower numbered match lines may be high as well,
   but the priority encoder ignores them.  The TCAM chip produces an
   output (on its five "address" pins) of the binary number for "29".

   There is some specialised terminology for TCAMs.  They are sometimes
   marketed as "network search engines".  When performing their
   comparison function, the input, to the "data" pins in this example,
   is known as the "key" and the output is called the "address".  A
   recent paper describing TCAM usage in IP packet classification, with
   a particular emphasis on optimising the speed of rewriting the cells
   when a routing withdrawal or addition occurs, is: Gesan Wang and
   Nian-Feng Tzeng 2006 [Wang-Tzeng].

3.9.2.  TCAM and SRAM produce FEC

   In the example, the TCAM output 29 does not tell the router which
   interface to send the packet to.  This information is stored in a
   standard SRAM chip, which has its address inputs driven by the output
   of the TCAM.  The router's software has previously written, to each
   location in the SRAM, the correct FEC data for each rule in the TCAM.

   More complex processing can be achieved by extending this
   architecture to involve analysing a packet with one set of TCAM
   rules, with the data read out from the SRAM determining whether the
   packet will be matched against further rules, or whether the result
   of the previous operation contains sufficient information to
   determine the packet's fate.  In this way, complex multi-cycle
   programs of analysis can be performed on packets.

3.9.3.  TCAM power consumption and other problems

   A TCAM is a sophisticated, flexible, massively parallel comparison
   system.  TCAM chips are relatively exotic devices - since they are
   primarily used only in routers and networking equipment.  Some
   reasons for their high power consumption include that each "bit"
   really consisting of two flip-flops and two comparators, and that in
   every comparison cycle, all the match lines are precharged high,
   after which most or all of them will be pulled low.

   TCAM's power consumption and limited capacity are significant
   problems.  The devices are often partitioned so only certain sections


Whittle                Expires September 28, 2007              [Page 15]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   are active for a particular "search" cycle.  Modern devices, such as
   the Renesas R8A20211BG operate at high rates, such as 133M "searches"
   per second for a 72 bit key with 262144 address rows.  No power
   consumption figures are available for this device [Renesas].  Its
   sample cost in 2005 was about USD$175.

   Another major problem is that updates to the routing table often
   require significant rewriting of the contents of the TCAM and the
   SRAM, since rules may be added and deleted.  The order of rules is
   crucial, since the order determines which of multiple true "match"
   lines will be recognised by the priority encoder.  When a change to
   the RIB occurs, implementing the required change in the FIB (in this
   case consisting of the data in the TCAM and the SRAM) may involve
   many rules being rewritten to other locations in order to fit the
   newly structured list of rules into the available space.  While the
   data is being rewritten, the FIB cannot be used to classify packets,
   so packets may be dropped.  Devavrat Shah and Pankaj Gupta, in 2001,
   considered optimisations for the way data is structured to improve
   upon occasional worst-case rewrites involving 64k locations, at 50MHz
   cycle time, which would hold up packet processing for 1.2ms
   [Shah-Gupta].  The SRAM would require the same demanding pattern of
   writes by the router's CPU when large numbers of rules are moved in
   the TCAM.

3.9.4.  TCAM capacity is driven by route disaggregation

   TCAMs are a necessary part of most router architectures.  However,
   when an ISP or end-user adds another prefix to the global routing
   table and when (as is often the case) the prefix is advertised in a
   location such that from the point of view of a subset of transit and
   border routers, packets addressed to this prefix must be forwarded to
   a different interface from those addressed to the neighbouring
   prefixes, then TCAM-based routers in this subset will need to use
   another address line of their TCAMs.

   This adds directly to the costs of thousands of routers, and directly
   contributes to their energy consumption.  TCAM memory is often
   difficult or impossible to expand.  In order to ensure routers can
   handle whatever expansion of the "global BGP routing table" that may
   occur in their 5 or more year projected service life, network
   operators typically need to pay up front for this expensive hardware
   in every interface, and the router software to manage it.

   If no single TCAM chip can store the required number of rules, they
   may be used sequentially or in parallel arrays.  Both these
   approaches slow down processing and add power consumption.


Whittle                Expires September 28, 2007              [Page 16]

Internet-Draft          SRAM-based IP Forwarding              March 2007


3.10.  Proposed SRAM architecture for the FIB

   I propose simple implementation of the FIB, which involves the
   router's CPU processing the RIB to create a 4 to 9 bit word
   (depending on the number of interfaces the router has) for the FEC of
   every prefix of a certain size.  For IPv4, this is the /24 prefix.
   2^24 of these must be calculated and stored in one or more SRAM
   chips.  Then, destination address bits 31 to 24 are used to drive the
   address pins of the SRAM, in read mode, with the output being the FEC
   for this packet.  There are no complex algorithms for ordering rules,
   or requirements to move other rules as new rules are added.  The most
   common update, which is either the withdrawal of a /24 or a change in
   its FEC, involves a single write operation.  The hardware multiplexes
   access to the address lines of the SRAM so that CPU write cycles can
   be interspersed with read cycles for packet classification.  Appendix
   1 contains further low level details of how this might be done.

   Here I describe a practical approach to implementing this
   architecture with currently available memory chips.  There may be
   other approaches, including extending existing router architectures
   with additional SRAM to achieve the same functionality.  In practice,
   this architectural block would be part of the larger FIB function,
   with packets first being handled by the SRAM system.  Address ranges
   for packets which require further processing by TCAM or other
   techniques will have the SRAM data for the /24 prefixes in each range
   set to a values which selects this further processing.

   The SRAM would be driven by hard-wired, FPGA or micro-coded systems
   which firstly determine the nature of the packet, such as IPv4, IPv6
   or MPLS.  Although it would be possible to map the 1M MPLS label
   space into unused parts of an SRAM which is used for IPv4, MPLS
   requires other functions and data storage, including the storage of a
   20 bit new label value to write to the packet, and possibly
   information about how to prioritise it in one of the potentially
   multiple output queues in the interface which sends it to the next
   hop.  In this section, I will consider only IPv4 packets.

   Certain conditions must be met for the system to be effective.
   Firstly, the vast majority of packets - essentially all user traffic
   packets - must have their FEC defined in a single cycle of this SRAM-
   based FIB.  Secondly, the system must be able to cope reliably, but
   not necessarily as quickly, with the smaller number of packets which
   need to be matched to prefixes longer than /24.

   The third requirement is a fast, simple method of updating the FIB
   when the RIB changes.  The SRAM design provides this, unless a very
   short prefix is changed, which would require writing thousands or
   hundreds of thousands of locations.  Fortunately, while such an


Whittle                Expires September 28, 2007              [Page 17]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   extensive rewrite of the locations covered by this prefix was taking
   place, it could be interspersed with accesses for packet
   classification.  During this time, some of the /24 prefixes within
   the larger prefix being updated would have the old FEC value and
   others would have the new.  This would result in some packets being
   sent to the wrong interface, but it is the interface they were
   previously sent to.  During the rewrite, there is no impact on
   packets outside the range being changed.  Changing an entire /8 would
   take 64k cycles, which might take a few milliseconds.  Worst-case
   TCAM updates may take this long, but the TCAM generally cannot be
   used while its contents are being rewritten.

   In terms of simplicity, low power consumption and compact size, the
   SRAM approach could only be bettered, perhaps, by use of less
   expensive DRAM.  However, DRAM cycle times are much longer than
   SRAMs, which can typically produce their read results in a nanosecond
   or so, and complete a read or write cycle in 4 nanoseconds.  Whereas
   TCAMs and iterative search approaches are complex, in need of many
   optimisations and so are the subject of much academic research, there
   is little to write about using a simple SRAM chip except that it is a
   straightforward engineering solution which is easy to understand and
   program.

   The greatest single benefit of the SRAM approach is that its
   performance is optimal no matter how many rules are contained in the
   RIB.  This includes the worst-case situation of complete route
   disaggregation, in which packets addressed to every successive /24
   are forwarded to a different interface.  Implementing this
   architecture in all the Internet's DFZ routers would remove all
   hardware-based pressure to achieve route aggregation.

   In sections below discuss specific hardware and BGP policy proposals
   for both IPv4 and IPv6.


Whittle                Expires September 28, 2007              [Page 18]

Internet-Draft          SRAM-based IP Forwarding              March 2007


4.  The Crisis in Routing and Addressing

   The report and presentations from the October 2006 IAB Routing and
   Addressing Workshop in October 2006 is the best reference for the
   problem I am addressing [I-D.iab-raws-report] [IAB-RAWS-website].
   Below, I quote some of the key statements of the report and
   presentations.

4.1.  Scalability

   From the report: "While several scalability features of the routing
   and addressing systems were discussed, most related to the size of
   the DFZ routing table (frequently referred to as the Routing
   Information Base, or RIB) and its implications.  Those implications
   included (but were not limited to) the sizes of the DFZ RIB and FIB
   (the Forwarding Information Base), the cost of recomputing the FIB,
   concerns about the BGP convergence times in the presence of growing
   RIB and FIB sizes, and the costs and power (and hence heat
   dissipation) properties of the hardware needed to route traffic in
   the core of the Internet."

4.2.  Addressing, Topology and Rekhter's Law

   Yakov Rekhter's "Rekhter's Law" was cited as one of the fundamental
   assumptions underlying the scalability of routing systems:
   "Addressing can follow topology or topology can follow addressing.
   Choose one."  I can find no mention of new hardware FIB designs which
   are not impacted by the route disaggregation which "Rekhter's Law" is
   intended to prevent.  However, this assumption of the apparent
   futility of hoping for such an approach is noted in the following
   paragraph:

   "A refinement to Rekhter's Law, then, is that for a routing system to
   scale, the locator part of IP address must be assigned in such a way
   that it is congruent with the Internet's topology.  However, as
   identifiers are typically assigned based upon organizational (not
   topological) structure and have stability as a desirable property, a
   'natural incongruence' arises.  As a result, it is difficult (if not
   impossible) to make a single number space serve both purposes
   efficiently.  Of course this conclusion assumes, as mentioned above,
   that no effective 'non-topological routing system' exists."

   The purpose of this Internet Draft is to suggest that a simple
   hardware forwarding system does exist which is not impacted by
   address assignments which have no correlation with topology.


Whittle                Expires September 28, 2007              [Page 19]

Internet-Draft          SRAM-based IP Forwarding              March 2007


4.3.  IPv6 - Future Routing Swamp?

   Regarding IPv6: "The primary issue with IPv6 deployment was that, in
   the absence of a scalable routing strategy, IPv6 has the potential to
   exacerbate today's problems simply by the virtue of its much larger
   address space." and "Thus the opportunity exists to create a "swamp"
   (unaggregatable address space) that can be many orders of magnitude
   larger than what we faced with IPv4."

4.4.  Costs, Benefits and IETF Policy

   Regarding the impact of activities such as multihoming by ISPs and
   end-users on the costs of purchasing and running transit and border
   routers, "the workshop participants felt that the costs and benefits
   in today's routing system are misaligned.  While the IETF does not
   typically consider the "business model" impacts various technology
   choices directly, many participants felt that perhaps the time has
   come to review that philosophy."  The high cost of renumbering an
   end-user network was acknowledged together with the observation that
   "no strong disincentive exists to discourage the increasing use of
   Provider Independent address space".

4.5.  Multihoming is mandatory for ISPs and many users

   Multihoming for end-user organisations was recognised as being "in
   some circumstances, mandatory due to contract or law."  Uses of
   Traffic Engineering were recognised as being mandatory - for goals
   including load balancing, low-cost path selection, maintaining
   peering agreements and for ensuring that packets must follow, or not
   follow, certain paths.  There is also a statement to the effect that
   ARIN has been allocating Provider Independent IPv6 /48 prefixes for
   end users, but my understanding of the policy statements at the ARIN
   site is that this is only for "infrastructure providers", such as
   Internet exchanges.

4.6.  Who, Where and extending the TCP/IP protocols

   Section 2.2 of the report discusses how the two-layer domain name /
   IP address split has become overloaded with functions, since the IP
   address which rightfully specifies "where" (in contrast to the DNS
   text name's "who") is no longer a direct function of "where" the node
   is within the network's topology.  There is a review of various
   approaches to inserting a third layer into IP protocols, where a
   middle level, reasonably stable "locator" is mapped in real-time to
   one or more relatively unstable "identifiers", enabling higher level
   protocols to function as usual while the physical location of nodes
   changes.


Whittle                Expires September 28, 2007              [Page 20]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   The first part of David Wheeler's aphorism is cited: "There is no
   problem in computer science that cannot be solved by an extra level
   of indirection,", but not the second: "but that usually will create
   another problem."

4.7.  Moore's Law

   There was considerable debate about the ability of Moore's Law to
   keep up with the growth in the size of the global BGP routing table.
   Concerns were raised about the chips used in high-end routers being
   low volume devices with very high design costs, which only benefited
   marginally from the spectacular leading edge of semiconductor
   development which is focused on mass market CPUs and memory.

4.8.  Power consumption and heat dissipation

   The report's paragraphs on heating and power appear in full below:

   "Transistors consume power both when idle ("leakage current") and
   when switching.  The smaller the transistors, the larger the leakage
   current.  The overall power consumption is not linear with the
   density increase.  Thus, as the need for more powerful routers
   increases, cooling technology grows more taxed.  At present, the
   existing air cooling system is starting to be a limiting factor for
   scaling high-performance routers.

   "A key metric for system evaluation is now the unit of forwarding
   bandwidth per Watt-- [(Mb/s)/W].  About 60% of the power goes to the
   forwarding engine circuits, with the rest divided between the
   memories, route processors, and interconnect.  Using parallelization
   to achieve higher bandwidths can aggravate the situation, due to
   increased power and cooling demands.

   "[Editor's note: Many in the community have commented that heat power
   utilization and the attendant heat dissipation, along with size
   limitations of fabrication processes are the current limiting
   factors.]"

   I note that a 2006 report [Gartner] states "Gartner roughly estimates
   that during operation, today's servers and PCs account for about
   0.75% of global carbon dioxide emissions (based on direct power
   consumption, not including cooling)."

4.9.  Incremental changes

   In Section 8, Criteria for Solution Development, the report states:
   "In the routing system itself, the solutions must allow incremental
   changes from the current operational Internet.  The solutions should


Whittle                Expires September 28, 2007              [Page 21]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   be backward compatible with the routing protocols in use today,
   including BGP, OSPF, IS-IS, and others, possibly with incremental
   enhancements.  The data path should support IPv4 and IPv6."

   I believe this SRAM-based FIB proposal meets all these criteria, as
   long as the re-allocation of IPv6 global unicast address is accepted
   - along with the necessity, over the natural replacement cycle, of
   all transit and border routers conforming to a new standard based on
   stable, agreed, administrative limits to the bits which vary in the
   inter-AS user traffic of the Internet.

   I expect the SRAM IPv4 FIB approach would be naturally implemented
   without any work prompted by this Internet Draft.  My aim is to bring
   this development forward for both IPv4 and IPv6 in a standardised
   manner, by prompting discussion and hopefully agreement on the BGP
   policies for both protocols, and by enabling the SRAM IPv6 solution
   by a compact reallocation of the global unicast address space before
   it is more widely used.


Whittle                Expires September 28, 2007              [Page 22]

Internet-Draft          SRAM-based IP Forwarding              March 2007


5.  SRAM-based FIB for IPv4

   In this section, I propose a hardware design an IPv4 FIB function,
   based on a specific already mass-produced Static RAM (SRAM) device,
   the 4ns cycle time, 8M x 9 bit, Samsung K7R640982M [Samsung SRAM].
   While I will not discuss all technical details of this device, I will
   provide enough to enable readers to envisage the physical
   implementation of the system I am proposing.  There are a number of
   other devices, including those of other manufacturers, which could
   provide the same functions, but this particular device is currently
   the best for explaining the design.  My aim is to show that this
   solution is practical and elegant.  By the time the system is built
   into routers, there are likely to be further choices in how to
   implement it.

5.1.  The SRAM chip

   The K7R640982M is part of a "Quad Data Rate II" (QDRII) family of
   devices.  The electrical and physical specifications for this family
   are standardised by the Quad Data Rate Consortium [QDR Consortium],
   of which Samsung is a member.  None of the other members - Cypress,
   Renesas, IDT and NEC - currently make a device with the K7R640982M's
   features: 72 megabits with 9 bit wide data inputs and outputs.  Other
   family members, such as with 18 bit inputs and outputs, could be used
   for the FIB system, but the 9 bit device is probably most convenient.

   The SRAM measures 17 x 15 x 1.3mm when soldered flat to the printed
   circuit boards via its 165 solder balls.  The price I was quoted by
   Samsung, in February 2007, for sample quantities was USD$70.  The
   maximum power dissipation is 1.45 watts, but actual power dissipation
   is likely to be less, since even with a 40Gbps stream of packets, the
   device will not be running at its maximum 250 million memory cycles
   per second.  A single such device holds a 4 bit FEC value for every
   unique IPv4 /24 prefix.

   The device has 9 data in and 9 data out pins.  This separation of
   input and output pins is convenient since the router's CPU only needs
   to write (except for memory test purposes) and the FIB function only
   needs to read.  I will present the device as if it has 23 address
   pins, but in fact it has 22, corresponding to A22 through A1.  A0 is
   generated internally, in each cycle, and two 9 bit bytes are read on
   every 4ns read cycle.  I will not discuss the straightforward low-
   level arrangements for using this 9 bit plus 9 bit read or write
   memory cycle and will portray it as a simple "8M x 9 bit" SRAM chip,
   with 23 address lines, reading or writing 9 bits of data.


Whittle                Expires September 28, 2007              [Page 23]

Internet-Draft          SRAM-based IP Forwarding              March 2007


5.2.  Encoding FEC

   The address pins of the SRAM system need to be driven by either the
   FIB hardware - by bits 32 through 24 of an IPv4 packet's destination
   address - or by an address presented by the router's CPU.  At first,
   I will describe the FIB read cycle for a router with up to 14
   interfaces.  The FIB function is implemented identically on every
   interface, with the same data being written to each SRAM, assuming
   there is no need for different interfaces to forward IPv4 packets
   differently.  Relatively straightforward hardware detects the IPv4
   packet, switches its address bits to the SRAM, collects either the
   low 4 bits of the 9 bit read operation or the high 4 bits (one bit is
   unused in this design) and then uses those four bits to determine
   where to switch the packet to for forwarding.  Thus, for each of the
   16,677,216 /24 prefixes, the SRAM reads out a specific value of FEC.

   Table 1 shows an example of the meanings of the values of a 4 bit
   FEC.

                 Example of functions of 4 bit FEC values

              +-----------+--------------------------------+
              | FEC value | Action                         |
              +-----------+--------------------------------+
              |         0 | Drop packet                    |
              |           |                                |
              |         1 | Analyse packet by other means  |
              |           |                                |
              |         2 | Forward packet to interface 0  |
              |           |                                |
              |         3 | Forward packet to interface 1  |
              |           |                                |
              |       ... | ...                            |
              |           |                                |
              |        14 | Forward packet to interface 12 |
              |           |                                |
              |        15 | Forward packet to interface 13 |
              +-----------+--------------------------------+

                                  Table 1

   Where the router has more than 14 interfaces, or where there is a
   requirement to use more of these values to select alternative methods
   of processing, the next obvious option with currently available
   memory chips is to use 9 bits.  This provides for up to 510 output
   interfaces and requires two of the currently available 8M x 9 bit
   chips, with a little hardware to select one or the other as the
   active device.


Whittle                Expires September 28, 2007              [Page 24]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   It may be attractive to fix the meaning of "2" to be "forward the
   packet by this interface", but that would require different data to
   be written into the SRAMs of each of the router's interfaces.
   Probably, the writing would be done by the interface's CPU rather
   than a central CPU, to allow full flexibility.

   For a "small" router - one with 14 or fewer interfaces - A31 to A9 of
   the packet's destination address drives the SRAM itself (A22 through
   A0) and A8 of the destination address is used by the hardware to
   select the high or low 4 bit nybble from the 9 read data pins.  For a
   large router, two chips would be needed, in parallel, with packet
   address bit A8 selecting one chip or the other to perform a read
   operation.  This yields a 9 bit FEC value.

   If every packet was known to be an IPv4 packet and all prefixes in
   the routing table were known to be /24 or shorter, then the above
   arrangement would be sufficient.  In practice, some elaborations
   would be necessary.

   Firstly, hardware would probably detect packets addressed to
   224.0.0.0/4 (broadcast) or 240.0.0.0/4 (reserved), although the
   latter range is a candidate for global routing in the future.

   Secondly, the router needs to be able to cope quickly with the main
   volume of AS to AS traffic, which will be forwarded according to
   prefixes of /24 to /8, while also being capable of correctly
   forwarding packets which match longer prefixes.  This can be achieved
   by making the SRAM FIB the first stage for all IPv4 packets, with a
   packet being sent to the TCAM section if its FEC result is 1, meaning
   "Analyse packet by other means.".  For instance, if any /24 prefix
   includes a longer prefix which has different FEC than the rest of the
   /24, then all packets addressed to this /24's address range should be
   analysed by the conventional TCAM etc. based packet classification
   system.

   The routing table for transit routers (and I assume multihomed border
   routers) such as that prepared daily by Geoff Huston [GIH BGP
   prefixes] have a small number of prefixes longer than /24.  These
   prefixes are analysed and listed separately, for each prefix length,
   at [RW BGP prefixes analysis].  These are presumably routes for
   connecting to routers themselves, rather than for handling Internet
   user traffic.  The small number of routes of this nature and the
   relatively small traffic volumes, primarily BGP updates, carried on
   these routes, should not tax the storage capacity or the speed of
   TCAM or of other approaches to forwarding.


Whittle                Expires September 28, 2007              [Page 25]

Internet-Draft          SRAM-based IP Forwarding              March 2007


5.3.  IPv4 usage and policy

   This proposal requires no change in IPv4 usage or in the current
   policy of accepting /24 prefixes into the global BGP routing system
   and rejecting longer prefixes.  As far as I know, this is not a
   formal policy, but is widely adopted by all network operators.
   (There is provision for RIRs requiring limits on the size of prefixes
   added to routing tables in section 4.5 of [RFC3177].)  Ideally, to
   give router manufacturers and purchasers confidence, the proposed
   SRAM FIB approach, or its functional equivalent, would be
   standardised together with some kind of standard or agreement that
   this /24 limit would be retained for a long time, such as fifteen
   years or more.

   This proposal would be more attractive if it was accompanied by broad
   agreement about how IPv4 addresses are to be allocated to ISPs and
   other users with Autonomous Systems, particularly regarding what, if
   any, expectations there would be regarding how the ISPs and other
   users would split and separately advertise their address space.  The
   proposal is intended to facilitate an explosion in the number of IPv4
   BGP routes, to enable a greater number of users to make more
   efficient use of their address space.  However, this will only be
   practical after a number of years in which the SRAM-equipped routers
   replace those which cannot handle the growing number of RIB entries.

5.4.  BGP performance and stability

   If this proposal is implemented, and the expected growth of
   advertised BGP prefixes occurs, all participating routers will be
   required to handle a much greater number of routes and routing
   changes.  It may be expected that improved router CPU speed and
   memory capacity will be able to cope with this, with suitable
   planning.  However it cannot be assured that the global BGP system
   will remain stable and responsive enough under this increased load.
   The volume of data transacted as part of the BGP protocols and the
   time delays in transferring it are unlikely to be prohibitive.
   However, the time it takes for the increasing number of routers to
   collectively settle after a change in advertised prefixes is likely
   to be a major challenge.

   Perhaps the regular /24 boundaries of the new hardware architecture,
   and the likely increased advertisement of /24 prefixes, will enable
   BGP communication to be optimised in terms of network efficiency or
   to achieve faster convergence and greater stability.

   If this proposal is implemented, there will probably need to be
   changes to the BGP protocol and to administrative standards for
   advertising prefixes (there are few, if any, at present) in order to


Whittle                Expires September 28, 2007              [Page 26]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   cope with the greater tasks imposed on the global BGP routing system.
   At present, a great deal of the BGP protocol traffic concerns long
   prefixes (small numbers of addresses) which change relatively often.
   This is documented in the "Adds and Wdls per Prefix Length" section
   of Geoff Huston's CIDR report [GIH CIDR Report].

   If there were disincentives to such rapid changes in advertised
   routes - implemented in the configuration of routers, in RIP
   guidelines and rules, and/or in the BGP protocol itself, then the
   changes which are advertised would be more constrained to those which
   are "most necessary".  This value judgement needs to be according to
   the interests of the majority of operators of DFZ routers, who
   directly bear the burden of each BGP change.  An example of BGP
   changes which might be deemed unreasonable in this framework is the
   multiple changes per day generated by an ISP who uses BGP multihoming
   to achieve traffic engineering, for instance for load balancing as
   traffic moves from a business-hours pattern to a residential pattern.

   Disincentives could be imposed on the ISPs and other AS users who
   make changes which are deemed to be excessive.  For instance, if a
   user often changed how they advertised a prefix - whether
   deliberately or due to instability in their network, then beyond some
   kind of limit, this prefix would have longer periods of
   unreachability due to some or many routers giving this prefix's
   changes a low priority.  I understand that this is already
   implemented within many routers, primarily to improve stability by
   not propagating fluctuating changes too rapidly.

5.5.  Alternative arrangements for IPv4

   For the sake of discussion, if the proposed changes are accepted as
   desirable, we might ask how could they be improved or extended,
   either initially or at some time in the future.

   The most obvious option would be to double the SRAM requirement in
   each interface of each DFZ router and extend the BGP prefix length
   limit to /25.  This would have advantages, including enabling a finer
   use of IP address space, such as allowing ISPs and AS users who only
   require a handful of IP addresses to be multihomed to do so with
   smaller allocations of address space.

   It is my impression [RW BGP prefixes analysis] that there is so much
   unused, or very sparsely used, IPv4 address space at present that the
   changes proposed in this Internet Draft would be sufficient to enable
   much better use of address space.  Assuming IPv4 is widely used in
   the decades to come, a long-range plan might be made in the future to
   allow advertisement of /25 prefixes - once the cost and power
   dissipation of the required SRAM is much lower than it is today.


Whittle                Expires September 28, 2007              [Page 27]

Internet-Draft          SRAM-based IP Forwarding              March 2007


6.  SRAM-based FIB and addressing changes for IPv6

6.1.  Impractical with current global unicast allocations

   Current IPv6 address management policy [IPv6-Policies] [RFC3177]
   [RFC4291] provides for allocation of global unicast addresses within
   the prefix 2000::/3, which fixes the most significant three bits of
   the address to "001".  In general, a /48 prefix will be assigned to
   each end user. 45 bits - 124 to 80 inclusive - vary in this scheme.
   If the longest prefix admitted to the global BGP IPv6 routing system
   is a /32 (as, I think, is current practice) then this still requires
   routers to classify packets based on 29 bits - bits 124 to 96
   inclusive.

   In principle an SRAM system could be used to map these 29 bits
   directly to four or nine bits of FEC data.  However, for a small
   router (one with up to 14 interfaces) this would require 32 72Mb SRAM
   chips.  This is impractical in terms of cost, space and power
   consumption, unless perhaps the one set could be used for the entire
   router, rather than on each interface.  I think it is unlikely that
   this amount of RAM would be practical to install in each interface of
   tens of thousands of routers in the next ten years - or perhaps at
   any time in the future.

6.2.  Reallocating IPv6 global unicast addresses

   Assuming the /32 limit is maintained, an SRAM-based FIB architecture
   would be practical if it could be known that all global unicast
   addresses would fall within a smaller range than is currently the
   case.  A router would first process an IPv6 packet to determine
   whether its destination address was within the restricted range
   covered by the SRAM system, and if so use that system to determine
   its FEC.  As with the IPv4 proposal, one value from the SRAM would
   cause the packet to be dropped, another would cause it to be
   processed by conventional (TCAM etc.) techniques and the rest of the
   possible values would specify which interface the packet should be
   forwarded to.

   There are several options for RAM size and the range within which all
   global unicast addresses must be constrained to.  I propose the
   decision be based on ensuring a long-lasting standard, with hardware
   implementation costs which are practical in the short term.  In the
   event that this space becomes overly restrictive at some time in the
   future, such as in one or two decades, a decision could be made to
   double the address space and therefore the SRAM requirements for
   routers.

   One option is to devote two 72Mb SRAMs for small routers and four for


Whittle                Expires September 28, 2007              [Page 28]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   larger routers - those with between 15 and 510 interfaces - to the
   IPv6 global unicast FIB function.  If IPv6 global unicast addresses
   were reallocated to 2000::10, this would provide direct hardware
   support for 33,554,432 /35 prefixes, each of which provides 8192 /48
   user networks.  This scheme would support 4,194,304 /32 allocations
   to each ISP or AS end-user - each of which could be broken into eight
   independently advertisable /35 prefixes.

   To halve the amount of RAM, several changes could be made to this
   scheme.  One is halving the total global unicast space so that
   2,097,152 /32 allocations could be made.  This might be a reasonable
   choice, if it could be shown that that this would probably cover
   demand for 15 years or so.  Another approach would be to map /34
   prefixes, rather than /35.  This reduces the maximum number of
   separately advertisable subnets in each /32 from eight to four.
   Larger organisations would then be required to use more than one /32,
   for instance to robustly multihome more than one site.

   Reallocating currently allocated IPv6 global unicast addresses to a
   range 1/128 the size of their current spread raises some
   difficulties.  Firstly, it would require almost all current users to
   renumber their networks.  However, RIRs have long insisted that all
   users, other than large ISPs, should renumber their networks whenever
   they change their connection to the IPv6 Internet - so it does not
   seem unreasonable to expect the RIRs and ISPs to undergo a once-only
   renumbering at this early stage of IPv6 adoption.

   The second change will need to be in the minds of users and
   administrators, who formerly saw the vast spaces of IPv6 as an asset.
   The goal of address aggregation was seen as being somewhat easier to
   achieve with a vast and uncluttered address space, because even a
   tiny fraction of the total space provides billions of public IP
   addresses.  The trouble with this approach is that it precludes the
   use of SRAM based FIB techniques, leaving IPv6 routing to costly,
   power-hungry, unwieldy techniques such as TCAMs.

   By a stroke of luck, the number of bits in play in IPv4 addressing
   neatly matches mid-2000s SRAM capacities.  The extra 7 or so bits in
   flux within the current IPv6 allocations precludes the use of SRAM -
   the only cost-effective, power-efficient, hardware routing technology
   which currently seems to be available.  IPv6's current address spread
   in bits 124 to 118 might therefore be considered harmful to the long-
   term routability of the Internet.  Fortunately, even if these bits
   are fixed to zero to enable the SRAM FIB architecture, the vast
   capabilities of bits 0 to 92 should not result in shortages of IP
   addresses or inflexibility in usage for many decades, or perhaps
   forever.


Whittle                Expires September 28, 2007              [Page 29]

Internet-Draft          SRAM-based IP Forwarding              March 2007


7.  Security Considerations

   None.


Whittle                Expires September 28, 2007              [Page 30]

Internet-Draft          SRAM-based IP Forwarding              March 2007


8.  IANA Considerations

   If this proposal, or something like it, is adopted for IPv6, then
   significant changes will need to be made to the IPv6 Address
   Allocation and Assignment Policy [IPv6-Policies].  Section 3.4,
   concerning address aggregation, would no longer apply.

   These proposals for an SRAM-based FIB architecture for IPv4 may not
   require any changes to Internet usage or IANA standards.  However,
   once implemented globally, the requirement to distribute addresses
   hierarchically to facilitate routing scalability, as expressed in
   section 1.2 of [RFC2050] would no longer apply.  This RFC, in
   November 1996, anticipated improved routing technologies in the
   future: "In the event that routing or router technology develops to
   the point that adequate routing aggregation can be achieved by other
   means or that routers can deal with larger routing and more dynamic
   tables, it may be appropriate to review these constraints."


Whittle                Expires September 28, 2007              [Page 31]

Internet-Draft          SRAM-based IP Forwarding              March 2007


9.  Informative References

   [GIH BGP prefixes]
              Huston, G., "Geoff Huston's BGP prefixes.txt", March 2007.

   [GIH CIDR Report]
              Huston, G., "CIDR Report", March 2007.

   [Gartner]  Mingay, S., "The IT Industry Is Part of the Climate Change
              and Sustainability Problem", November 2006.

   [Gupta]    Gupta, P., "Pankaj Gupta's thesis and other material on
              routing and forwarding", August 2006.

   [I-D.iab-raws-report]
              Meyers, D., "Report from the IAB Workshop on Routing and
              Addressing", draft-iab-raws-report-01 (work in progress),
              February 2007.

   [IAB-RAWS-website]
              Meyers, D., "IAB Workshop on Routing and Addressing -
              resources and presentations", December 2006.

   [IPv6-Policies]
              IANA, "IPv6 Allocation and Assignment Policy", June 2005.

   [QDR Consortium]
              QDR Consortium, "QDR Consortium", March 2007.

   [RFC2050]  Hubbard, K., Kosters, M., Conrad, D., Karrenberg, D., and
              J. Postel, "INTERNET REGISTRY IP ALLOCATION GUIDELINES",
              BCP 12, RFC 2050, November 1996.

   [RFC3177]  IAB and IESG, "IAB/IESG Recommendations on IPv6 Address
              Allocations to Sites", RFC 3177, September 2001.

   [RFC4291]  Hinden, R. and S. Deering, "IP Version 6 Addressing
              Architecture", RFC 4291, February 2006.

   [RW BGP prefixes analysis]
              Whittle, R., "Probing the density of ping-responsive-hosts
              in each /8 IPv4 prefix and in different sizes of BGP
              advertised prefix", March 2007.

   [Renesas]  Renesas, "R8A20210BG data sheet", February 2005.

   [Samsung SRAM]
              Samsung, "72Mb QDRII data sheets", February 2006.


Whittle                Expires September 28, 2007              [Page 32]

Internet-Draft          SRAM-based IP Forwarding              March 2007


   [Shah-Gupta]
              Shah, D. and P. Gupta, "Fast incremental updates on
              Ternary-CAMs for routing lookups and packet
              classification", January 2001.

   [Taylor-Spitznagel]
              Taylor, D. and E. Spitznagel, "On using content
              addressable memory for packet classification", March 2005.

   [Wang-Tzeng]
              Wang, G. and N. Tzeng, "TCAM-Based Forwarding Engine with
              Minimum Independent Prefix Set (MIPS) for Fast Updating",
              February 2006.


Whittle                Expires September 28, 2007              [Page 33]

Internet-Draft          SRAM-based IP Forwarding              March 2007


Author's Address

   Robin Whittle
   First Principles

   Email: rw@firstpr.com.au
   URI:   http://www.firstpr.com.au/ip/


Whittle                Expires September 28, 2007              [Page 34]

Internet-Draft          SRAM-based IP Forwarding              March 2007


Full Copyright Statement

   Copyright (C) The IETF Trust (2007).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Acknowledgment

   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).


Whittle                Expires September 28, 2007              [Page 35]