Internet Engineering Task Force                            M. Scott, Ed.
Internet-Draft                                            D. Wagner-Hall
Intended status: Informational                              J. Crowcroft
Expires: April 21, 2011                          University of Cambridge
                                                        October 18, 2010


           Addressing the Scalability of Ethernet with MOOSE
                        draft-malc-armd-moose-00

Abstract

   Ethernet does not scale well to large networks.  The flat MAC address
   space, whilst having obvious benefits for the user and administrator,
   is the primary cause of this poor scalability; other recent efforts
   to improve upon Ethernet's scalability have addressed symptoms,
   rather than this underlying cause.  MOOSE, Multi-level Origin-
   Organised Scalable Ethernet, is an Ethernet switch architecture that
   performs in-place rewriting of MAC addresses in order to impose a
   hierarchy upon the address space without reconfiguration or
   modification of connected devices.  This removes the need for
   switches to maintain large forwarding databases, is of direct use in
   implementing improved routing, and allows for a variety of other
   scalability and security innovations.  MOOSE also includes a
   globally-scalable, distributed and resilient protocol for the
   automatic assignment of addresses to switches, and for detecting and
   cheaply resolving addressing conflicts.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on April 21, 2011.

Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the


Scott, et al.            Expires April 21, 2011                 [Page 1]

Internet-Draft                    MOOSE                     October 2010


   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
     1.1.  Requirements Language  . . . . . . . . . . . . . . . . . .  5
   2.  Ethernet's Underlying Problem  . . . . . . . . . . . . . . . .  5
   3.  Related Work . . . . . . . . . . . . . . . . . . . . . . . . .  6
   4.  MOOSE Architecture . . . . . . . . . . . . . . . . . . . . . .  8
     4.1.  Shortest Path Routing  . . . . . . . . . . . . . . . . . . 11
     4.2.  Address Selection and Conflict Resolution  . . . . . . . . 11
     4.3.  Broadcast and Multicast  . . . . . . . . . . . . . . . . . 14
     4.4.  Example  . . . . . . . . . . . . . . . . . . . . . . . . . 15
     4.5.  Directory Service  . . . . . . . . . . . . . . . . . . . . 16
     4.6.  Mobility . . . . . . . . . . . . . . . . . . . . . . . . . 16
   5.  Interoperability Considerations  . . . . . . . . . . . . . . . 18
     5.1.  Layer-violating Protocols  . . . . . . . . . . . . . . . . 18
     5.2.  Edge Virtual Bridging  . . . . . . . . . . . . . . . . . . 19
   6.  Prototype Implementation . . . . . . . . . . . . . . . . . . . 20
   7.  Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 20
   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 20
   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 20
   10. Informative References . . . . . . . . . . . . . . . . . . . . 20
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22


Scott, et al.            Expires April 21, 2011                 [Page 2]

Internet-Draft                    MOOSE                     October 2010


1.  Introduction

   Ethernet has lasted well since its inception in the '70s with
   Ethernet frame-structure and addressing remaining ubiquitous in the
   data centre environment as in many others.  Alongside IP and IP-
   transported services such as iSCSI, it is now commonplace to see
   converged network services such as physical disk interfaces and
   cluster interconnects layered directly over Ethernet (e.g.  ATA-over-
   Ethernet and variants of Infiniband).  However, Ethernet exhibits
   scalability issues on networks of more than a few thousand devices,
   such as costly and energy-dense address table logic and storms of
   broadcast traffic.

   Aside from more physical devices, virtualised infrastructure further
   increases the density of Ethernet addresses in data centres.  Widely-
   used layer-2 virtualisation [Cl05] mandates a unique Ethernet address
   per virtual machine.  This means that each physical machine in a data
   centre may represent many tens of Ethernet devices.

   The traditional method of avoiding such problems is the artificial
   subdivision of a network, but this introduces an administrative
   burden, requires significant routing equipment and also precludes
   seamless migration--a necessity for virtualised infrastructure.
   While IP Mobility [RFC3344] addresses the problem of maintaining
   higher-layer connections when roaming between subnets, it requires
   client support that is neither ubiquitous or reliable.  Common
   practice sees the provision of one physical Ethernet network covering
   an entire data centre, or even an entire WAN of data centres.

   Our approach, Multi-level Origin-Organised Scalable Ethernet (MOOSE),
   provides all the advantages of an Ethernet network without the
   capital and running costs and administrative overhead of a IP router-
   based approach.  MOOSE does this by providing a hierarchical
   addressing scheme without requiring host reconfiguration or
   modification.

   Ethernet's scalability is limited firstly by the forwarding database
   that every switch in an Ethernet [802.1D] network must maintain.  A
   switch's forwarding database contains one entry per source address
   seen in any frame passing through that switch, and stores that MAC
   address together with the learnt location of that address--the port
   on which packets from that address were last seen.  This is later
   used to determine on which port to transmit frames destined for that
   address.  Devices frequently broadcast frames throughout the network
   (e.g.  ARP queries) so active devices on the network are listed in
   most switches' forwarding databases most of the time.

   In modern switches the capacity of this database is generally of the


Scott, et al.            Expires April 21, 2011                 [Page 3]

Internet-Draft                    MOOSE                     October 2010


   order of 16,000 entries.  (Higher-capacity forwarding databases exist
   but are currently constrained to very high-end switches.)  On a
   moderately large network, full databases are a serious risk.  If the
   database becomes full, entries will be discarded; frames for unknown
   addresses are flooded to all ports and the resulting traffic storm
   could cause major problems, especially in the presence of low-
   capacity edge links.

   Traditionally the forwarding database has been stored in a content-
   addressable memory (CAM) as lookups must be very fast, particularly
   as 10 Gbit/s Ethernet becomes ubiquitous.  As networks grow, the
   number of entries in a switch's forwarding database must naturally
   increase; however, increasing the capacity of CAMs without
   sacrificing speed whilst constraining energy consumption is proving
   to be challenging.  Cheaper switches use DRAM in place of a CAM, but
   this is likely to remain slower especially for large tables.

   Secondly, Ethernet's inability to handle networks containing loops
   also presents a scalability problem.  The Rapid Spanning Tree
   Protocol, RSTP, must remove loops by disabling any redundant links.
   On a dense mesh network, RSTP will disable a large proportion of
   links; this constrains frames to suboptimal routes and may introduce
   bottlenecks in the network, particularly around the root of the
   spanning tree.  In a data centre environment, this potentially
   amounts to a very large proportion of capacity being wasted wherever
   redundant fibres are installed, e.g. between cabinet switches and
   between data centres.

   Thirdly, not only does Ethernet flood frames destined for unknown
   hosts, but it also uses--and encourages higher-layer protocols to
   use-- broadcast for control messages.  For example, ARP [RFC0826]
   performs address resolution via broadcast queries, and DHCP [RFC2131]
   uses broadcast messages for automatic configuration.  It is
   impractical to replace these protocols entirely as this would require
   software upgrades to every device, but it would be desirable for the
   network to minimise the amount of broadcast traffic required to be
   forwarded.

   In this document we identify the relevant underlying problems in the
   design of Ethernet, review previous work and present the MOOSE switch
   architecture, which addresses inadequacies in the fundamental
   operation of Ethernet in a novel yet backwards-compatible way.  By
   revisiting the addressing scheme itself, rather than simply
   addressing symptoms of the problem as many previous proposed
   solutions have done, we can go about solving all of the above
   scalability problems and more.


Scott, et al.            Expires April 21, 2011                 [Page 4]

Internet-Draft                    MOOSE                     October 2010


1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119.


2.  Ethernet's Underlying Problem

   The original Ethernet was a shared-medium network, where every frame
   was broadcast and no switching took place.  Modern-day wired
   Ethernet-based networks instead consist almost entirely of point-to-
   point links; as a result of this, the distinction between unicast,
   broadcast and multicast has become more important. 802.11 wireless
   LANs are the one remaining vestige of Ethernet operating over shared
   media, where one switch (access point) serves many hosts on the same
   radio channel.

   Ethernet's poor scalability arises in various guises, as outlined
   above.  It would seem at first glance that these are entirely
   distinct and unrelated.  However, there is a common underlying cause:
   that MAC addresses provide no location information.

   Globally-unique MAC addresses are structured such that the first
   three bytes of a device's address contain an organisationally unique
   identifier (OUI) allocated to the device's manufacturer by the IEEE,
   with the remaining three bytes allocated by the manufacturer.  This
   hierarchy exists solely for the purpose of allocating unique
   addresses in a decentralised fashion, and is of no use to Ethernet
   switches, which must treat the unicast address space as flat.

   A flat address space has the advantage that no configuration of
   devices is required; a device can use its unique, manufacturer-
   assigned MAC address anywhere on any network.  However, this leaves
   each switch with the task of discovering and storing the location of
   every addressable device.

   If the MAC address space were not flat, but instead contained enough
   information to locate the device possessing the address, several
   advantages would be gained.  Firstly, large forwarding databases
   would no longer have to be maintained on every switch.  This location
   information could instead be distributed across the network so that
   frames are directed towards their destinations according to
   successive stages of a hierarchy.

   Secondly, a hierarchical MAC address space would also make the
   addition of shortest-path routing considerably easier.  Shortest-path
   routing is clearly a desirable property for a network, yet it is one


Scott, et al.            Expires April 21, 2011                 [Page 5]

Internet-Draft                    MOOSE                     October 2010


   that Ethernet does not provide.  Flat addressing does not lend itself
   to easy routing: any address can be located anywhere on the network,
   which means either advertising every host's MAC address via the
   routing protocol--which scales very poorly--or providing some other
   location lookup service.  The use of hierarchical addresses, with
   each switch handling a block of sequential addresses akin to an IP
   subnet, would reduce the routing problem to the one that routing
   protocols were designed to solve.

   Thirdly, this would allow for reduction of broadcast traffic in a
   variety of different ways.  Hierarchical MAC addresses could, for
   example, be mapped directly and deterministically onto the IP address
   space, if appropriate for the specific deployment.  This would allow
   switches to respond directly and simply to DHCP and ARP queries,
   avoiding the need to forward the most common sources of broadcast
   frames.  Alternatively, a distributed directory service can be used,
   which is less limiting and is thus our preferred approach as detailed
   below.

   The facility for network administrators to assign locally
   administered addresses (LAAs) to devices has existed for as long as
   Ethernet.  However, configuring and maintaining the LAA on every
   device based upon where they are connected would be a considerable
   and unwelcome administrative overhead.  We therefore present MOOSE, a
   system for applying hierarchical addressing to an Ethernet
   transparently and without any configuration to edge devices.


3.  Related Work

   It is well-known that traditional Ethernet scales poorly, and there
   have been various attempts in recent years to rectify this.  The most
   widely-used of these in real-world networks is MPLS-VPLS [RFC3031]
   (Multiprotocol Label Switching--Virtual Private LAN Service).  This
   connects Ethernet islands together through tunnels across a MPLS
   cloud.  MPLS works by adding one or more labels to the start of every
   frame, i.e. encapsulating the frame inside its own protocol.

   In MPLS-VPLS, the label edge routers (LERs) must determine the
   frame's initial label(s) based upon the destination address via a
   lookup table.  Frames follow prenegotiated label-switched paths
   (LSPs) that, unlike Ethernet, are not constrained to follow a
   spanning tree; LSPs are precomputed at connection setup time and the
   relevant next hop is stored in a lookup table on each intermediate
   switch.  Each switch must hence use each frame's label to index into
   this lookup table to determine how to switch the frame.

   The effect, once the connection has been negotiated, is to provide


Scott, et al.            Expires April 21, 2011                 [Page 6]

Internet-Draft                    MOOSE                     October 2010


   what appears to be one or more large Ethernet networks, transparently
   overlaid on the MPLS cloud.  Whilst this solves effectively the
   problem of shortest-path routing across the MPLS cloud, the overlay
   Ethernets are still susceptible to the usual scalability problems--
   and in fact VPLS adds further large lookup tables on every switch
   that can in some configurations scale even worse than Ethernet's
   forwarding databases.  LERs must map every MAC address to a LSP;
   label switch routers (LSRs) must store the next hop for every LSP in
   which they participate, which in the core of the network could scale
   as O(hosts^2).

   A similar scheme is proposed by Hadzic [Ha01], with the difference
   that Ethernet-inside-Ethernet encapsulation is used rather than a new
   protocol.  This has the advantage that less processing is required on
   intermediate switches in the backbone network.  However, routes
   across the backbone are constrained to a spanning tree, and
   encapsulating switches must obtain a new destination address for
   every frame using a lookup table that--like Ethernet's forwarding
   database--must contain every transmitting MAC address.  Due to its
   heavy basis on Ethernet, this shares many of Ethernet's scalability
   problems.

   SmartBridge [Ro00] and RBridges [Pe04] (TRILL [RFC5556]) both
   encapsulate Ethernet frames in a new inter-switch protocol, and run a
   link-state routing protocol between switches.  The link state graph
   includes the location of every MAC address--necessary because the
   address space remains flat and any address could appear
   anywhere--i.e. it again contains every host.  Furthermore, switches
   must perform expensive computation to update routing tables whenever
   a MAC address joins or leaves the network.

   Myers et al [My04] suggest that Ethernet's main failing is its
   broadcast service, and propose a new architecture in which hosts make
   explicit use of directory services operated by switches rather than
   broadcasting queries.  It is clear that switches' participation is
   necessary in order to deal with the broadcast problem; however the
   modifications to Ethernet suggested are not backwards-compatible and
   would require at least software modifications to all connected
   devices.  Ethernet is, perhaps unfortunately, too widespread for this
   to be practical; transparent interception of broadcast frames and
   subsequent local handling or redirection via multicast or unicast
   remains the only practical solution.  The use of hierarchical
   addressing is a useful stepping-stone to such a system, and our
   architecture includes a transparent directory service (ELK) for this
   purpose.

   SEATTLE [Ki08] takes a more scalable approach.  A routing protocol is
   operated between switches, but in contrast to the approaches


Scott, et al.            Expires April 21, 2011                 [Page 7]

Internet-Draft                    MOOSE                     October 2010


   described above and in common with MOOSE, the routing protocol only
   propagates switch location information, rather than every MAC address
   on the network.  Flat MAC addresses are still used, and hence a
   mechanism is required to look up the switch to which a given address
   is connected.  This is achieved by using a distributed hash table
   (DHT) operating on participating switches with local caching to
   alleviate load.  This is certainly a step in the right direction but
   introduces considerable complexity to switches, since they now must
   maintain and update the DHT continually, and it is clear that a
   SEATTLE switch would have a significant software component in the
   data path.  MOOSE alleviates some of the complexity of SEATTLE by a
   combination of hierarchical addresses and delegation to a separate
   directory service.


4.  MOOSE Architecture

   The basic operation of MOOSE is to assign a new hierarchical MAC
   address to each host on the network, assigned dynamically and
   automatically from the unicast LAA space.  This dynamically-assigned
   address is referred to as a MOOSE address to avoid confusion with
   hosts' static, manufacturer-assigned MAC addresses.

   Every frame entering the network has its source address rewritten in-
   place to the sending host's MOOSE address by the first MOOSE-aware
   switch it traverses.  The switch that performs address rewriting for
   a host--i.e. the closest MOOSE switch to that host--is the host's
   home switch and is responsible for assigning a MOOSE address to that
   host.  (If non-MOOSE switches or hubs are in use, a host may have
   more than one "closest" MOOSE switch, in which case an RSTP-like
   protocol must be used to elect a switch to handle each edge segment.)

   The destination address is left intact in the expectation that it
   already is a MOOSE address.  Hosts' ARP caches will already contain
   the MOOSE addresses of any hosts being communicated with as any
   packet received will already have had its source address rewritten; a
   host's manufacturer-assigned MAC address is never seen outside of the
   segment containing that host.  This is a crucial point since
   encapsulation-based technologies such as MPLS do not reveal to the
   destination host the address used for routing; as a result, switches
   must also convert destination as well as source addresses of frames
   entering the network.  In other words, once again switches must
   maintain large tables of remote hosts on the network.  The only
   destination rewriting that MOOSE switches perform, however, is of the
   destination addresses of frames destined for local hosts back to
   their manufacturer-assigned MAC addresses; this is simple as the
   required information is already known, and necessary because
   otherwise that host's network interface card would discard the frame


Scott, et al.            Expires April 21, 2011                 [Page 8]

Internet-Draft                    MOOSE                     October 2010


   as misaddressed.

   A MOOSE address consists of a switch identifier followed by a host
   identifier.  For our examples, we simply use a fixed three-byte
   switch identifier followed by a fixed three-byte host identifier:

   +----------+     +----------+
   |  switch  |_____|  switch  |_ _ _ _ hosts 02:22:22:00:00:01,
   | 02:11:11 |     | 02:22:22 |              02:22:22:00:00:02, etc.
   +----------+     +----------+
                         |
                         |
                    +----------+
                    |  switch  |_ _ _ _ hosts 02:33:33:00:00:01,
                    | 02:33:33 |              02:33:33:00:00:02, etc.
                    +----------+

   Since these two identifiers when concatenated must form a unicast
   LAA, the settings of two bits in the first byte of the switch
   identifier are fixed: the least significant bit must be 0 to indicate
   a unicast address, and the second-least significant bit must be 1 to
   indicate a LAA.  To cater for variable length switch identifiers,
   some means of introducing separation between the switch and host
   identifiers is required.  Two possible implementations would be for:

   1.  the first three bits of the address to indicate how many of the
       following 5-bit blocks make up the switch prefix;

   2.  some constant delimiter to appear between the switch identifier
       and host identifier, with switch identifiers not allowed to
       contain the delimiter.

   The former is simple and gives eight classes of switch identifier.
   Because the size of a MOOSE network is limited by the placement of IP
   routers, these classes should be sufficient.  Additionally, because
   switches are free to change their identifiers, they may trivially
   switch to a larger class if they have too many attached hosts, or if
   a smaller class becomes full.

   The latter removes the fixed classes, allowing for more flexibility
   with the sizes of switch identifiers, at the cost of complexity, and
   a reduction in the available address space.

   Each switch can select for itself a unique switch identifier, as
   identifier conflict resolution is cheap (see below).  When first
   joining the routing protocol, conflict should be very unlikely, as
   the switch will in the process gain an up-to-date list of in-use
   identifiers.  Depending on requirements, the switch identifier may


Scott, et al.            Expires April 21, 2011                 [Page 9]

Internet-Draft                    MOOSE                     October 2010


   itself be a hierarchical address--e.g. six bits to identify a network
   area followed by two bytes to identify a switch within that area--
   which could then be used to aid routing decisions.

   Each host is assigned a host identifier by its home switch from the
   pool of identifiers available to that switch.  Only a host's home
   switch ever bases a switching decision on the host identifier, so the
   detail of how these are allocated can vary from switch to switch.
   Suitable schemes include:

   1.  sequential assignment;

   2.  the port number followed by a sequential portion (to allow for
       multiple hosts connected to one port);

   3.  a hash of the host's real MAC address.

   The latter two approaches are preferable to a simple sequential
   assignment, as they better isolate certain kinds of denial-of-service
   attack in which a malicious host attempts to use up all available
   host identifiers on the switch.  They also require less state to be
   shared between ports.  The third option has the further advantage
   that it is deterministic and hence can be recovered easily in the
   event of a crash.

   It is hence possible to route frames through the network to remote
   hosts by simply inspecting the switch identifier in the destination
   address, and ignoring the host identifier until the frame reaches the
   destination host's home switch.  Switches no longer need to keep a
   table of all MAC addresses seen recently; they only need store the
   locations of other switches and of any directly-connected hosts.

   As well as reducing the amount of data that must be consulted in
   order to make switching decisions, this provides extra resilience by
   making this data much more predictable.  The number of MAC addresses
   in a network can increase unexpectedly in the event of an address
   flooding attack or even under normal operation if the network
   contains open wireless access points; relying on the MAC address list
   for forwarding leads to some of the vulnerabilities of Ethernet.  The
   set of switch identifiers participating in MOOSE switching, on the
   other hand, is kept predictable and manageable by ensuring that
   neighbouring switches (discovered using LLDP [802.1AB]) are
   authenticated before they can participate in the routing protocol.
   This authentication can be achieved at layer 3 using the security
   features found in most popular routing protocols and/or at layer 2
   [802.1X].  As the switch identifier is the only address consulted for
   forwarding decisions, a MOOSE switch is likely to remain reliable in
   the face of attacks that could have brought down a traditional


Scott, et al.            Expires April 21, 2011                [Page 10]

Internet-Draft                    MOOSE                     October 2010


   Ethernet.  Furthermore, any attacks based upon MAC address spoofing
   cannot function on a MOOSE network as the user-provided MAC address
   is translated immediately.

4.1.  Shortest Path Routing

   As described so far, MOOSE switches must still forward frames along a
   spanning tree.  As discussed above, this is an undesirable property
   of Ethernet as it can cause frames to take a highly suboptimal path
   through the network.  The foundations are in place to do much better
   than this using shortest-path routing.

   For the purpose of frame forwarding, a MOOSE switch can be considered
   akin to a layer 3 router; it has one locally-connected subnet--
   containing all addresses starting with its switch identifier--and
   delivers frames to other subnets by passing them to an appropriate
   neighbouring switch.  Bearing this in mind, the switch can run a
   routing protocol of the kind normally used for IP, such as a variant
   of OSPF [RFC2328].  This allows frames to be routed along the
   shortest available path, rather than being constrained to a spanning
   tree.  A multipath variant such as OSPF-OMP may be particularly
   desirable due to its ability to make use of multiple equal-cost
   routing paths in order to improve performance.

4.2.  Address Selection and Conflict Resolution

   For reasons akin to those of the flaws of Ethernet, it is undesirable
   to guarantee universally unique pre-determined MOOSE switch
   identifiers.  Due to the reduced size of the switch ID space compared
   to the MAC address space, this would also be infeasible.  We
   therefore propose that each switch selects an initial address for
   itself during startup.  This could result in more than one switch
   claiming an address, which would be undesirable, so to mitigate the
   potential for MOOSE addresses to find themselves in conflict we
   additionally propose a simple and inexpensive conflict resolution
   protocol.

   Suppose two switches each have the same identifer.  We note that if
   these switches are on separate MOOSE networks (on disconnected
   networks, or separated by an IP router), this situation brings no
   issue.  Should they be on the same MOOSE network, however, a conflict
   exists and must be resolved.  Any routing protocol would require a
   switch to know which port other switches are connected to, for
   instance by OSPF neighbour lists, or simply by receiving frames and
   noting the switch port and source MOOSE address.  When a switch
   receives a MOOSE frame, it looks up the source switch in its
   forwarding database, which is likely in fast Content Addressable
   Memory.  If it finds that source switch to be on a port other than


Scott, et al.            Expires April 21, 2011                [Page 11]

Internet-Draft                    MOOSE                     October 2010


   that which it recognises from its table, one of three situations may
   be possible:

   1.  the source switch may be the same as the known switch, and have
       physically moved, or a topology change has occurred;

   2.  the source switch may be a different one to the known switch, and
       they are in conflict;

   3.  the source switch may be the same as the known switch, but is
       sending frames down a different route to the last used route.

   To avoid disruption to the network in the first case, and to give
   scope for switches to migrate within the network, the switch which
   detected the possible conflict should ascertain whether the known
   switch is still alive and present.  The conflict-resolving switch
   thus attempts to send a unicast frame to the known switch, via the
   port stored in the forwarding database, asking whether it is there at
   a regular interval until a timeout.  This will reach the known switch
   rather than the new switch if it is still present as other switches
   beyond that port must not have detected the conflict yet.  The nature
   of the timeout we leave unspecified, and can be implementation
   specific.  It may, for instance, be a pre-defined constant, or it may
   vary based on QoS information gathered if such capabilities are
   supported.  When a MOOSE switch receives such a frame, it should
   promptly respond with an acknowledgement frame, showing that it is
   alive.

   If, within the timeout period, the conflict resolver finds the known
   host not to be alive, no conflict exists, so the switch updates its
   view of the network by removing the old entry from its forwarding
   database and triggering a routing protocol refresh.

   If, on the other hand, the host is found to be alive, a conflict
   exists.  The conflict resolver then sends a frame to the more
   recently found switch indicating that it is in conflict and should
   change its address.  That switch, upon receiving this frame, changes
   its address and sends a gratuitous ARP for each of its connected
   hosts, so that the rest of the network is aware of the change.  To
   mitigate the risks of a denial of service attack, or faulty equipment
   sending out conflict frames, an exponential backoff algorithm should
   be used when receiving conflict notification frames.

   A switch should have a timer, and counter influencing the maximum
   value of the timer, both initialised to 0.  When a conflict
   notification frame is received, the counter is incremented (subject
   to a saturation value to avoid excessive timeouts).  After a conflict
   has been resolved--i.e. the switch has changed its address--a timer


Scott, et al.            Expires April 21, 2011                [Page 12]

Internet-Draft                    MOOSE                     October 2010


   starts counting down from some time exponential in that counter;
   subsequently the switch will only change its address if the timer has
   returned to 0 by the time the conflict frame is received.  The
   counter should be reset to 0 when the timer reaches 0.  Using this
   scheme the event of true conflict is handled quickly, even in the
   unlikely case that the newly acquired address is also in conflict.
   Any node emitting malicious or erroneous conflict notifications,
   however, is rate-limited enough that their damage potential is much
   restricted, subject to a sufficient timer being chosen.

   Pseudocode: Conflict resolution backoff:

   if timer > 0:
       if counter < counter_max:
           counter = counter + 1
       # Discard conflict notification frame
   else:
       timer = k^counter
       change_address()

   Pseudocode: Conflict resolution timer:

   foreach clock tick do:
       if timer > 0:
           timer = timer - 1
       else:
           counter = 0

   This could be further enhanced by detecting repeated conflicts
   involving the same switch or switches, in a manner similar to BGP
   Route Flap Damping [RFC2439], and performing more aggressive steps to
   avoid further conflicts--for example using a significantly increased
   timeout, and/or having *both* switches in conflict select new
   addresses.

   The conflict resolution algorithm brings a marked improvement on the
   equivilent vulnerability of Ethernet, that MAC addresses can be
   spoofed.  We build in a flexible, well-defined system of recovery.
   The decentralised nature of the system makes it much less open to
   denial of service attack than any centralised directory may be.
   Having every MOOSE switch acting as a barrier to the propagation of
   packets from addresses in conflict provides a strong separation
   between recently bridged networks with conflicting addresses, so that
   communication within the individual networks may continue without
   modification, until bridge-crossing traffic appears, at which point
   resolution quickly happens.  We also remove the possibility for
   forwarding databases to frequenty have to switch their entry for a
   conflicted address, which can happen with MAC conflicts in


Scott, et al.            Expires April 21, 2011                [Page 13]

Internet-Draft                    MOOSE                     October 2010


   traditional Ethernet.  Additionally, in the case of a switch
   identifier spoofing attack, the conflict resolver acts as a hard
   boundary for the effects of such an attack.

   It is possible that the switch performing conflict resolution could
   send a suggested replacement switch address to the switch in
   conflict, known by the conflict resolver to have a low probability of
   being present on the network (because it is not present in its
   forwarding database).  This would reduce the chance of repeated
   collisions, and potentially allow for longer backoff periods, but may
   be premature optimisation.

   Because multi-path routing is often desirable, we could introduce an
   extra datum during the source address rewriting performed by MOOSE
   switches.  When an ingress MOOSE switch rewrites the source address
   of an Ethernet frame to a MOOSE address, it could also prepend some
   hash of its manufacturer-assigned MAC address to the data field, and
   increment the length field as necessary.  The egress switch, when
   rewriting the MOOSE destination address to a host's MAC address, then
   strips out this added datum.  This allows the conflict resolver to
   check whether conflicts actually exist by local lookup, rather than
   probing other switches, at the cost of added memory requirements in
   every switch.  This may push the frame to be larger than Ethernet's
   maximum, so may require fragmenting the packet into two, at small
   added cost.  Alternatively, assuming jumbo frames are permitted by
   the hardware, the maximum frame size could be marginally reduced to
   allow for this in the same manner as for 802.1Q VLAN tags.

   From the cheapness of conflict resolution, certain other address
   management tasks become simple.  A switch is free to choose its
   address when it joins the network however it wishes--attempting to
   re-use its last-used address, from a list of preferred addresses, or
   by generating an address entirely at random.  More intricate
   addressing schemes may be used on managed networks if desired,
   perhaps encapsulating deeper layers of hierarchy.

4.3.  Broadcast and Multicast

   Since Ethernet does still need to support arbitrary broadcast frames,
   these must still be forwarded along a spanning tree in order that
   they reach each host exactly once.  An explicit spanning tree
   protocol is not required however, as the tree can be deduced from the
   routing table via reverse path forwarding in a similar manner to
   Protocol-Independent Multicast (PIM) [RFC3973].  In other words,
   broadcast packets are routed as if they had been sent to the all-
   hosts multicast group.

   More general multicast groups can be implemented using a combination


Scott, et al.            Expires April 21, 2011                [Page 14]

Internet-Draft                    MOOSE                     October 2010


   of IGMP snooping [RFC4541] as used by modern Ethernet switches, and
   participation of the MOOSE switches in PIM routing.

4.4.  Example

   To illustrate the basic behaviour of MOOSE switches, before we go on
   to describe further features, we will offer a simple example.  We
   will describe the steps involved in forwarding a broadcast frame
   containing a query in some higher-layer IPv4-based protocol, and
   subsequent unicast frame containing the response, between two hosts A
   and B via three MOOSE switches 02:11:11, 02:22:22 and 02:33:33.

4.4.1.  Query

   1.  Host A transmits the broadcast query frame as it would on any
       Ethernet network, with its own manufacturer-assigned MAC address
       in the Ethernet header's source field and the broadcast address
       (FF:FF:FF:FF:FF:FF) in the destination field.

   2.  The frame is received by switch 02:11:11, which observes the non-
       MOOSE address in the frame's source field, and rewrites the
       source field into a MOOSE address containing the switch
       identifier and the appropriate host identifier.  As this is Host
       A's first frame, the switch must allocate a host identifier (in
       this case 00:00:01, making Host A's complete MOOSE address 02:11:
       11:00:00:01).

   3.  The three switches broadcast the frame using reverse path
       forwarding away from Host A.

   4.  The frame is received by Host B (and any other hosts on the
       network) in its current form; no further rewriting is performed.

4.4.2.  Response

   1.  Host B looks up Host A's IP address in its ARP cache to determine
       a suitable destination address for the response frame.  Since the
       rewritten query frame arrived at Host B with the source field
       containing the MOOSE address 02:11:11:00:00:01, this is the
       address returned by the cache lookup.

   2.  As above, switch 02:33:33 assigns a MOOSE address to Host B (02:
       33:33:00:00:01) and rewrites the source address of the frame.

   3.  The frame is now routed through the network based solely on the
       destination switch identifier--the host identifier is ignored for
       now.  The routing table is consulted for the location of switch
       02:11:11 and the frame is forwarded accordingly.


Scott, et al.            Expires April 21, 2011                [Page 15]

Internet-Draft                    MOOSE                     October 2010


   4.  On receiving the frame, switch 02:11:11 observes that it is
       destined for a directly-connected host (02:11:11:00:00:01).  It
       prepares the frame for transmission along its final hop by
       rewriting the destination address to Host A's manufacturer-
       assigned MAC address.  The source field of the frame is again
       left as the MOOSE address of Host B in order that this address is
       used for any further communication with Host B.

4.5.  Directory Service

   A directory service, Enhanced Lookup (ELK), runs in conjunction with
   the basic MOOSE switch described so far.  ELK exists to handle ARP
   and DHCP queries in a broadcast-free manner by learning mappings from
   IP addresses to MOOSE addresses.  The master ELK directory is served
   by one or multiple systems for resilience and is reached using an
   anycast MOOSE address; the layer-2 anycast feature is a convenient
   side-effect of running a routing protocol.  Slave copies of the
   directory can be held nearer the edge of the network in order to take
   load away from the masters; slaves can be reached for lookups via a
   separate anycast address, and the entire herd of ELK can be kept
   synchronised via the masters using a combination of multicast and
   unicast.

   MOOSE switches intercept ARP and DHCP packets broadcast by hosts and
   convert them into anycast ELK queries to the nearest slave (for ARP)
   or master (for DHCP).  (DHCP handling could make use of the
   protocol's existing DHCP relay mechanism.)  The ELK slave answers ARP
   queries directly using information in the directory; as it does so,
   if the query is from a host not in the directory, it learns the
   sender's IP address to MOOSE address mapping.  The ELK master can
   also act as a DHCP server, populating the ELK directory as it grants
   IP address leases to clients.

   The one case in which the ELK directory will not contain the answer
   to a query is when answering an ARP request for a host that is not
   configured to use DHCP and that has not yet itself sent an ARP packet
   (i.e. has not yet communicated via IP).  This must be dealt with by
   flooding the query to every active switch port, in a manner akin to
   current Ethernet switches, and caching the result in the ELK
   directory.  Although this is not ideal, it is necessary in order to
   deal with this scenario in a compatible manner, and is unlikely to
   happen frequently.

4.6.  Mobility

   A consequence of introducing location-based hierarchy into MAC
   addresses is the need to explicitly handle host mobility.  In a
   traditional Ethernet, hosts can migrate between switches as the


Scott, et al.            Expires April 21, 2011                [Page 16]

Internet-Draft                    MOOSE                     October 2010


   switches will learn the host's new location as soon as it sends a
   frame.  With MOOSE, if a host relocates to a new switch its address
   changes and any ARP cache entries on other hosts pertaining to the
   migrated host become incorrect; frames will continue to be sent to
   the host's old location for a while.  There are two strategies for
   dealing with this, which can be used separately or in conjunction:

   1.  The previous home switch of the migrated host can forward frames
       sent to the host's old address until outdated ARP cache entries
       expire.  This is similar to IP Mobility: the previous home switch
       essentially becomes a care-of agent for the host.  However,
       unlike IP Mobility, it requires no host support.  A handover
       protocol is necessary for the old and new home switches to set up
       such forwarding: on the arrival of a new host at a switch, that
       switch would ask all other switches (via multicast) whether any
       had seen this host before, identifying it using its manufacturer-
       assigned MAC address, and would instruct such switches to
       redirect frames.

   2.  A broadcast ARP announcement (or "gratuitous ARP") can be sent by
       the new home switch to immediately update remote ARP caches and
       the ELK directory with the new MOOSE address.  This is the
       technique used by Xen when migrating live virtual machines.
       Unlike the previous approach, this works even if the previous
       switch is no longer reachable, for example if this host migration
       was as a result of a switch failure.  This is a simpler approach
       as a handover protocol is not required, but results in additional
       broadcast traffic.

   Unless the frequency of host migrations is very high, the additional
   load introduced by either mobility approach is expected to be
   negligible.


Scott, et al.            Expires April 21, 2011                [Page 17]

Internet-Draft                    MOOSE                     October 2010


   Illustration of the two ways to handle a host A roaming onto another
   switch whilst maintaining communication with another host B:

      (1)           +--------+
   ##============== | Host B | <=== ARP ===##  (2) gratuitous
   ||               +--------+             ||   ARP sent by
   ||                   |                  ||   new home switch
   ||                 +---+                ||
   ||    .------------| X |------------.   ||
   ||   /             +---+             \  ||
   \/  |                                 | ||
     +---+     (1) data forwarded      +---+
     | X | ==========================> | X |
     +---+      by care-of switch    ||+---+
       |                             \/  |
   + - - - +                         +--------+
   |       |- - host relocated to - >| Host A |
   + - - - +       new switch        +--------+


5.  Interoperability Considerations

5.1.  Layer-violating Protocols

   In an ideal world, free from layering violations, all layer 3
   protocols would operate correctly on top of MOOSE in exactly the same
   way that they currently operate on top of Ethernet, with no protocol-
   specific handling necessary in the switch.  In reality, however,
   protocols abound which use hosts' MAC addresses for purposes other
   than layer 2 addressing or which place MAC addresses in the frame
   payload.  DHCP and ARP have already been mentioned as such protocols
   which must be specifically handled by edge switches in order to
   operate; luckily, the rewriting required for these important
   protocols is simple.

   Of particular concern are recent standards for layering on top of
   Ethernet protocols which were previously used solely on dedicated
   hardware interconnects, such as Fibre Channel over Ethernet (FCoE
   [FC-BB-5]).  In order to support FCoE and similar protocols on a
   MOOSE network, each edge switch will need to be able to interpret and
   rewrite individual protocols that are in use.  A production MOOSE
   switch would, therefore, need to be implemented such that it is
   possible to add rewriting support for additional protocols after
   manufacture, for example by loading an additional software or FPGA
   configuration module.

   Ultimately, in the general case, this problem could be addressed more


Scott, et al.            Expires April 21, 2011                [Page 18]

Internet-Draft                    MOOSE                     October 2010


   satisfactorily by extending the Ethernet standard to provide a
   protocol-agnostic method for a layer 2 network to inform hosts of
   their own addresses; LLDP [802.1AB] would make a good basis for this
   extension.  This would allow the use of network-assigned MAC
   addresses for any protocol, with some rewriting performed either
   partially (within the frame payload) or fully by the host itself, and
   furthermore would allow higher-layer protocols to respond to changes
   of the host's network-assigned address (e.g. due to mobility).  Such
   a mechanism could be deployed incrementally as needed, with switches
   able to perform address rewriting for hosts which are not able to do
   this themselves.  This is, however, a very long-term solution, and
   protocol-specific rewriting on the switch is likely to be required
   for the foreseeable future.

   FCoE in particular is unusual, however, as it already does its own
   dynamic allocation of MAC address to devices.  It is conceivable that
   an extension to FCoE could be developed which allows a network-wide
   dynamic address assignment scheme such as MOOSE to be exploited to
   provide addresses directly to fibre channel devices.

5.2.  Edge Virtual Bridging

   The rise of virtualisation has caused an unanticipated proliferation
   of software switches, usually in the host operating system or
   hypervisor which provides network connectivity to multiple virtual
   machines.  Since software switches are almost always neither fast nor
   centrally manageable in the same way as hardware switches, there is
   ongoing work to standardise--by Cisco as Port Extension and by the
   IEEE as Edge Virtual Bridging [P802.1Qbg]--a means of making these
   software switches act merely as additional ports which are logically
   part of a more central hardware switch.  This reduces the work
   required by a virtual edge switch: frames from local virtual edge
   ports can be forwarded straight out via the uplink to a physical
   switch without consideration, and frames from the uplink will arrive
   simply tagged with a virtual edge port identifier.

   (The scope of Port Extension in particular is greater than this, and
   allows for physical port extenders to exist in place of switches
   where a large number of ports but a small amount of processing is
   required, but virtualisation is likely to be the most significant use
   case.)

   Edge Virtual Bridging and Port Extension require very little
   adaptation to be implemented on a MOOSE switch.  It is unlikely,
   although too early in the standardisation process to say for certain,
   that the virtual bridge will need to be MOOSE-aware.  A virtual-
   bridging-aware physical MOOSE switch will thus simply need to take
   into account the possibility that one physical port may hide a large


Scott, et al.            Expires April 21, 2011                [Page 19]

Internet-Draft                    MOOSE                     October 2010


   number of virtual ports when allocating host identifiers, as it would
   if it had an Ethernet switch connected on that port.  If, however,
   the virtual bridge is made MOOSE-aware, the hierarchical addressing
   of MOOSE could be exploited to allow the virtual bridge to allocate
   host identifiers itself, given that it is likely to be aware of the
   exact number and nature of virtual edge ports.  The parent MOOSE
   switch would accordingly allocate an address prefix to each child
   virtual bridge, and hosts' full MOOSE addresses could be formed as:

   SWITCH ID   :   CHILD ID   :   HOST ID
    (parent)      (allocated     (allocated
                   by parent)     by child)


6.  Prototype Implementation

   We have implemented a MOOSE switch in OpenFlow and NOX, which can be
   run on off-the-shelf switches.  Details can be found in our paper
   [Wa10].


7.  Conclusions

   Ethernet remains popular due to its simplicity and ubiquity, but is
   showing its age and exhibits serious scalability issues in large
   deployments.  Previously-proposed improvements address either a few
   of the problems in a simple way, or most of the problems in a highly
   complex or backwards-incompatible way.  We have demonstrated a
   simple, novel and easily-implementable approach for significantly
   boosting the scalability of Ethernet, which has a working prototype
   switch firmware implementation.


8.  IANA Considerations

   This memo includes no request to IANA.


9.  Security Considerations

   Security will be considered in a later revision of this document.


10.  Informative References

   [802.1AB]  IEEE, "802.1AB: Station and Media Access Control
              Connectivity Discovery", 2009.


Scott, et al.            Expires April 21, 2011                [Page 20]

Internet-Draft                    MOOSE                     October 2010


   [802.1D]   IEEE, "802.1D: Standard for Local and Metropolitan Area
              Networks: Media Access Control (MAC)", 2004.

   [802.1X]   IEEE, "802.1X: Port Based Network Access Control", 2004.

   [Cl05]     Clark, C. and others, "Live Migration of Virtual
              Machines", USENIX NSDI 2005, 2005.

   [FC-BB-5]  T11 FC-BB-5 working group, "Fibre Channel Backbone - 5",
              June 2009.

   [Ha01]     Hadzic, I., "Hierarchical MAC Address Space in Public
              Ethernet Networks", IEEE GLOBECOM vol 3, 2001, 2001.

   [Ki08]     Kim, C., Caesar, M., and J. Rexford, "Floodless in
              SEATTLE: A Scalable Ethernet Architecture for Large
              Enterprises", ACM SIGCOMM 2008, 2008.

   [My04]     Myers, A., Ng, E., and H. Zhang, "Rethinking the Service
              Model: Scaling Ethernet to a Million Nodes", ACM SIGCOMM
              Workshop on Hot Topics in Networking 2004, November 2004.

   [P802.1Qbg]
              Jeffree, A., Congdon, P., and J. Pelissier, "P802.1Qbg:
              Edge Virtual Bridging", September 2009.

   [Pe04]     Perlman, R., "RBridges: Transparent Routing", Proc.
              INFOCOM vol 2, 2005, March 2004.

   [RFC0826]  Plummer, D., "Ethernet Address Resolution Protocol: Or
              converting network protocol addresses to 48.bit Ethernet
              address for transmission on Ethernet hardware", STD 37,
              RFC 826, November 1982.

   [RFC2131]  Droms, R., "Dynamic Host Configuration Protocol",
              RFC 2131, March 1997.

   [RFC2328]  Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998.

   [RFC2439]  Villamizar, C., Chandra, R., and R. Govindan, "BGP Route
              Flap Damping", RFC 2439, November 1998.

   [RFC3031]  Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol
              Label Switching Architecture", RFC 3031, January 2001.

   [RFC3344]  Perkins, C., "IP Mobility Support for IPv4", RFC 3344,
              August 2002.


Scott, et al.            Expires April 21, 2011                [Page 21]

Internet-Draft                    MOOSE                     October 2010


   [RFC3973]  Adams, A., Nicholas, J., and W. Siadak, "Protocol
              Independent Multicast - Dense Mode (PIM-DM): Protocol
              Specification (Revised)", RFC 3973, January 2005.

   [RFC4541]  Christensen, M., Kimball, K., and F. Solensky,
              "Considerations for Internet Group Management Protocol
              (IGMP) and Multicast Listener Discovery (MLD) Snooping
              Switches", RFC 4541, May 2006.

   [RFC5556]  Touch, J. and R. Perlman, "Transparent Interconnection of
              Lots of Links (TRILL): Problem and Applicability
              Statement", RFC 5556, May 2009.

   [Ro00]     Rodeheffer, T., Thekkath, C., and D. Anderson,
              "SmartBridge: A Scalable Bridge Architecture", ACM
              SIGCOMM 2000, 2000.

   [Wa10]     Wagner-Hall, D., "A Prototype Implementation of MOOSE on a
              NetFPGA/OpenFlow/NOX Stack", First European NetFPGA
              Developers' Workshop Cambridge, September 2010.


Authors' Addresses

   Malcolm Scott (editor)
   University of Cambridge
   15 JJ Thomson Ave
   Cambridge,   CB3 0FD
   UK

   Phone: +44 1223 763500
   Fax:   +44 1223 334678
   Email: Malcolm.Scott@cl.cam.ac.uk
   URI:   http://www.cl.cam.ac.uk/~mas90/MOOSE/


   Daniel Wagner-Hall
   University of Cambridge

   Email: dwh@cantab.net


Scott, et al.            Expires April 21, 2011                [Page 22]

Internet-Draft                    MOOSE                     October 2010


   Jon Crowcroft
   University of Cambridge
   15 JJ Thomson Ave
   Cambridge,   CB3 0FD
   UK

   Phone: +44 1223 763500
   Fax:   +44 1223 334678
   Email: Jon.Crowcroft@cl.cam.ac.uk


Scott, et al.            Expires April 21, 2011                [Page 23]