Network Working Group                                     I. van Beijnum
Internet-Draft                                              June 9, 2006
Expires: December 11, 2006


                     Inter-Domain Link Enumeration
                     draft-van-beijnum-idle-00.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on December 11, 2006.

Copyright Notice

   Copyright (C) The Internet Society (2006).

Abstract

This document discusses the architecture of a successor to BGP4. The
problems surrounding inter-domain routing and the BGP protocol have been
studied within the IETF, the IRTF and elsewhere, but what's required to
actually solve some of these issues hasn't been explored much. That's
what this document tries to accomplish.

1 Introduction

Within the IETF, the notion that a problem statement and a list of
requirements must precede all protocol development is very pervasive.
And for good reason: without knowing what problem to solve and which
requirement a solution must address, it's impossible to even determine
success or failure. As such, the effort presented here may seem


Van Beijnum             Expires December 11, 2006               [Page 1]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


premature. On the other hand, "when the only tool you have is a hammer,
every problem looks like a nail."
Currently, there are no efforts within the IETF to make more than fairly
modest evolutionary improvements to inter-domain routing. Given the
facts that anything more ambitious is going to take a lot of time, and
that improving the security of BGP as is now being considered in the
RPSEC and SIDR working groups will require significant efforts
regardless of the size of the changes to the existing BGP protocol, it
makes a lot of sense to entertain the idea of a new inter-domain routing
protocol at this time.

The author is of the opinion that the current state of BGP makes it hard
to impossible to fix one flaw because of all the others. This document
tries to make the case that biting the bullet and fix all significant
flaws in one go is a better way forward.

The basic idea is that a router or AS has a link state view of part of
the network, and prefixes are injected at the edges of that part of the
network. So when reachability close to a router changes, the router
learns this through link state updates, but when reachability far away
changes, the router learns this through prefix binding updates. This is
not unlike iBGP behavior for a part of the network that's larger than
the local AS.

This is a very early version of this document and should be read as
such. Especially the notion of super- and sub-ASes hasn't been worked
out yet.

2 Direction

2.1 Problems with BGP

Today, the BGP4 protocol is used for inter-domain routing. There are
several problems with BGP. By design, it is a "path vector" protocol:
basically a distance vector protocol with path information added. As
such, it suffers from some of the problems inherent to distance vector
routing, such as slow convergence and the count to infinity problem
(although the AS path in BGP helps a lot here). Another problem area for
BGP is the fact that all processing happens on a per-prefix basis: there
is no way to communicate reachability or policy changes except to update
all impacted prefixes. BGP is extremely agnostic as to the underlying
path selection algorithm in order to accommodate as much policy control
as possible. Unfortunately, this makes it very hard to predict BGP's
behavior and the default behavior (especially with today's rather flat
AS hierarchy) is more often than not suboptimal. BGP allows harmful
policies that keep the protocol from converging to a stable state. Lack
of workable aggregation mechanisms means that once an address block is
deaggregated, it's almost impossible to get rid of the resulting long
prefixes, leading to excessive growth of the internet's global routing


Van Beijnum             Expires December 11, 2006               [Page 2]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


table. Coarseness of the only available end-to-end metric (the AS path)
pushes operators to deaggregation for traffic engineering purposes. The
way BGP operates within a single AS requires an additional intra-domain
routing protocol and suboptimal engineering tradeoffs by requiring
having a full mesh between all BGP routers within the AS or having route
reflectors or a confederation. There is no validation of routing
information beyond the next hop. A BGP speaker only communicates its
best path (if any) to a neighbor, with no way to tie additional
information to the nonexistence of a path and no way to accomplish type
of service routing or install backup paths. Paths must be explicitly
revoked, which in practice requires a BGP speaker to keep track of which
paths were communicated to which peer. BGP requires fairly extensive
configuration (setting up filters) before it's useful.

2.2 BGP's Strengths

BGP has two very important strengths: it imposes very few limitations on
the policies that can be used (the main one is that only the hop-by-hop
forwarding paradigm is supported) and the distributed computation and
data dissemination. BGP scalability is determined almost exclusively by
the number of prefixes in the global routing table and the topology of
the AS in question; the topology of the rest of the network has no
impact on scalability to speak of.

2.3 Goals for a New Inter-Domain Routing Protocol

First and foremost, a new inter-domain protocol must be able to replace
BGP with no real loss in functionality and second, it has to provide
benefits in the areas of performance, scalability, security and
features. Last but not least, there must be a viable migration path from
BGP to the new protocol.

3 The Protocol

An important underlying principle of this routing protocol is the
separation between topology and prefix information. Prefixes of a
certain address family are grouped together and such groups are injected
into the network by an AS or sub-AS, regardless of their reachability.
To evaluate reachability and distance, routers must be aware of the
network topology between them and the places where prefixes are
injected. It is unfeasible to disseminate topology information for the
entire network throughout the entire network on an internet-wide scale,
so to aid scalability there is aggregation of topology information and
two types of prefix aggregation: aggregation of prefixes into groups,
where a transit AS re-advertises prefixes learned from other ASes, and
aggregation of multiple prefixes into one shorter prefix. In the latter
case, loop detection and (un)reachability information is lost, so this
type of aggregation may not be appropriate under all circumstances. In
the former case, AS path information is retained for loop detection


Van Beijnum             Expires December 11, 2006               [Page 3]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


and/or backward compatibility with BGP. Since one prefix may thusly be
advertised by more than one AS and to avoid having to remove and
reinsert prefixes into a group when reachability changes, local policy,
upstream policy/preference and reachability information is exchanged
separately from AS-prefix bindings. The properties of these four types
of information are as follows:

* Prefix binding information is generated by the advertising or a
  re-advertising AS, and barring further re-advertising, is exchanged
  without modification and independent of reachability. This information
  may be retained across topology events, reboots and even be exchanged
  out of band. Signatures are used to secure the binding between
  prefixes and their source AS. Securing the binding between prefixes
  and a re-advertising AS with permission from the source AS are
  optional and happens out of band. Prefix binding information is
  updated by sending out diff-style updates that include a signature
  over the full set of prefixes after the update is applied. Prefixes
  are listed in order. Prefix binding information is identified by the
  source AS (or sub- or super-AS) and a timestamp.

* Local policy information is local to an AS (or equivalent: a sub-AS or
  super-AS) and may or may not be kept confidential. Local policies
  consist of access list style entries that apply to one or more sets of
  prefix bindings. Policies that can be expressed this way are blocking
  of ASes and prefixes or applying a strong preference value that
  overrides the inter-AS metric or applying a "bonus" to the inter-AS
  metric. Policies must be identical across an AS or sub-AS.

    Discussion:

    The requirement that policy is identical across a (sub-) AS makes it
    impossible to have inconsistent policies within an AS that get in
    the way of convergence to a single stable state. However, ANY policy
    gets in the way of simple link state behavior. The way around this
    is to "black out" paths as unreachable that would otherwise have
    been selected as better than the one selected by policy (see below).
    When this becomes too excessive, it's easier to just re-advertise
    prefixes within the AS itself in order to present a single coherent
    view rather than many fragmented ones. Taken to its extreme, where
    an AS only propagates a full list of prefixes re-advertised to a
    neighbor, makes the protocol look very similar to BGP to a
    neighboring router.

* Upstream policy information consists of AS paths and metrics for each
  prefix for a set of prefixes grouped together to be (re-)advertised.
  This information is more volatile than the prefix bindings it relates
  to and is only required in places where overlapping sets of prefixes
  with different advertising ASes are seen. Source ASes use this
  information for inbound traffic engineering.


Van Beijnum             Expires December 11, 2006               [Page 4]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


* Reachability information is a bitmap that applies to a set of prefix
  bindings, where a zero bit indicates that the corresponding prefix is
  unreachable, either physically, for policy reasons or because it's
  reachable through some other path even though this fact can't be
  inferred from other information (i.e., this is the result of a policy
  decision). A one bit indicates that the corresponding prefix is
  reachable over the indicated path, and this path is used as per
  (publicly) visible metrics. When no reachability information is
  available, unreachability is assumed. This includes the case where
  prefix binding information is updated by inserting a new prefix but
  the latest reachability bitmap still refers to an older version of the
  prefix-AS binding information.

    Discussion:

    Since reachability information and prefix binding information come
    from different places (and very possible over different paths) and
    the former depends on the latter, it's important that there are
    robust mechanisms for updating and synchronizing the information.
    Additionally, if information is cryptographically signed, it's
    possible that newer information is rejected for security reasons so
    a situation where different parts of the network use different
    versions of certain information may persist for significant periods
    of time.

Local policy and reachability information is not specifically protected
since it may be changed in transit end-to-end and transport mechanisms
are assumed to protect all information on a hop-by-hop basis. Upstream
policy information is authenticated by the source AS.

3.1 Path Selection

The protocol distributes link states between neighboring ASes. The set
of link states may either be complete or incomplete. When the set of
link states is complete the state for all links within an AS or super-AS
is known (the smallest entity the protocol deals with is sub-ASes, not
individual routers). Alternatively, link states that are not part of the
currently selected path for one or more destinations (i.e., the path
with the lowest metric) can be left out of updates to other ASes at the
local AS its discretion. Note that when the set of link states is
complete, only incremental changes need to be communicated to a neighbor
AS, and there is no need for the local AS to converge before updates are
propagated to another AS. When the set of link states is incomplete, the
AS must first converge and determine the new best path so that the
missing link states for the best path can be communicated to the
downstream AS.


Van Beijnum             Expires December 11, 2006               [Page 5]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


    Discussion:

    Is it useful to first revoke link states that have failed as soon as
    possible, and then send out link states for the new best path later,
    or should all of this be done at the same time? In the former case,
    a downstream AS has the opportunity to reroute traffic over another
    AS, but this introduces additional volatility. Having link states
    for a backup path could be a good alternative, but it's not clear
    whether that's feasible.

Within an AS, standard shortest path first path selection is used. But
subsequent ASes must calculate the path to a non-local destination using
the restriction that AS crossings are one-way and an AS may only appear
once in the path to a given destination.
Link state mechanics impose certain policy restrictions, such as the
requirement that all traffic can flow over all paths (within an AS, at
least) while in current BGP it's possible for two destination prefixes
in the same AS to be reachable over different AS paths. Link state
operation also requires a new way to convey policy: when AS Y connects
to ASes X and Z and doesn't want to provide transit between X and Z,
this means that the flow of link state information towards X and Z must
be limited in some way. Things get more complex is Y has an alternative
way to reach Z that X is authorized to use, but which has currently a
higher metric than the path to Z that X isn't authorized to use. (If the
AS supports ToS routing, this problem might be solved by having a
separate ToS for this.)

    Discussion:

    Would it make sense to build in the notion of transit/upstream, peer
    and customer/downstream relationships? This removes the need for
    much configuration, which should be especially helpful for leaf
    ASes. It also makes it possible for many ASes to run without any
    policies imposed, which makes operation of the protocol and its
    predictability much better.

Note: link states contain address family information that indicates
which protocols are supported over the link in question.

3.2 Inter-AS Metric

The primary metric in this protocol is the inter-AS metric, not to be
confused with the BGP Multi Exit Discriminator. There are no MEDs and
the AS path is not used as a metric. I.e., a short AS path is not
considered better than a long AS path. In many cases, no policy will be
applied, so the inter-AS metric (or "metric") for short determines the
flow of traffic. The metric must be adjusted such that it reflects an
increase of at least 1, at most 127 and default 11 for both input and


Van Beijnum             Expires December 11, 2006               [Page 6]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


output on all intermediate routers within an AS or sub-AS. The maximum
value of the metric is 64897 (255 hops).

    Discussion:

    Would allowing having different metrics in the two directions for a
    link open the door to loops?

3.3 Type-Of-Service Routing

In addition to the general metric that applies to all traffic of all
supported address families, link states may contain additional metrics
for specific service classes. Service classes are communicated through a
numerical value that identifies a globally defined service class or a
service class relevant to the AS in question. Some service classes may
be implemented with diffserv. Neighboring ASes exchange information
about which diffserv code points correspond to which service classes.


3.4 Prefix Aggregation

Aggregating a set of prefixes into a shorter prefix is a very powerful
mechanism to reduce the processing and storage requirements for routing
protocols, along with the size of FIB tables. In many cases, the loss of
information that results from aggregation is not problematic, but in
certain cases it is. For instance, in the case where AS A connects to
ASes B and C, which both aggregate A's prefix into a larger one. If then
the link between A and B goes down, B no longer has any way to reach A.
In order to avoid this situation, it's important that suppression of
more specifics that are covered by an aggregate only happens when both
the aggregate and the more specific are sourced by an upstream (transit)
or peer AS, NOT the local AS or a downstream (customer) AS.

    Discussion:

    In BGP, an AS aggregates a number of more specifics into an
    aggregate and then propagates the aggregate. The logic behind this
    is probably that the more specifics are all downstream from the
    local AS. However, this doesn't work in the situation when one of
    the more specifics is sourced by an AS that is both a customer of
    the local AS and connects to the network elsewhere, and then the
    direct link to the source of the more specific goes down. Another
    possibility is that another AS also generates the aggregate and if
    their direct link to the AS in question goes down, they wouldn't see
    the alternative path over the first (local) AS if it was aggregated
    away. For these reasons, it's probably unavoidable to always
    propagate prefixes from customers, even when those are covered by an
    aggregate. All of this doesn't apply to aggregatable prefixes that


Van Beijnum             Expires December 11, 2006               [Page 7]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


    belong to single homed ASes, but those are easily dealt with through
    manual configuration.

Note that suppressing a more specific learned from a peer when the
aggregate is learned from an upstream AS is probably undesirable.

Aggregates are injected into the protocol with a metric that is the
weighted average of the metrics for all underlying more specific
prefixes. Holes (unreachable parts) in the address space covered by the
aggregate receive a metric of 64897. The weight of a prefix is equal to
the relative part of the aggregate it covers. So a /32 more specific in
a /24 aggregate contributes 1/256th to the metric of the aggregate.

    Discussion:

    Does it make sense to have different metrics for individual
    prefixes, grouped prefixes and aggregates? For instance, an
    individual prefix could receive a metric of 11 per hop, while a
    grouped prefix gets 8 and an aggregate 5. This way, even if the
    aggregate starts with a higher initial metric than the individual
    route, the individual route's metric increases faster, so that after
    a certain number of hops, the metric for the individual route
    becomes higher than that of the aggregate, at which point the

    individual route can be dropped in favor of the aggregate. This
    mechanism makes it possible to limit propagation of individual
    routes and grouped routes to a certain number of hops.

3.5 Unreachability Propagation

When a link between two (sub-) ASes goes down, it's important that this
information is communicated throughout the entire network as quickly as
possible. However, experience with BGP shows that propagation of routing
changes across a network as large as the internet simply takes time, and
unfortunately, the further away an AS is from the affected link, the
longer it takes for the update to reach that AS, leading to BGP's
version of the "count to infinity" problem inherent to distance vector
protocols. Even though the protocol suggested here contains important
link state aspects, the aggregation of topology information necessary to
scale to the size of the internet would this protocol vulnerable to the
same problem without additional mechanisms. The assumption is, that
several ASes removed from the AS where a link goes down, there is no
longer a full view of all links that may be used to route around the
failed link.

In order to avoid flap amplification and "count to infinity" as seen in
BGP, whenever a link goes up or down, an identifying token is generated.
Both affected ASes must generate the same token that is unique for a
certain link within a reasonable time period, and the token must make it


Van Beijnum             Expires December 11, 2006               [Page 8]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


possible to determine the order in which events occurred when multiple
updates become available at the same time or in quick succession.
Intermediate (sub-) ASes add their AS number to updates they propagate.
When a router receives an update, over one path, it can now determine
whether another path is also effected (because it shares part of the AS
path that the update travelled through). When an update that announces
reachability becomes available, the tokens can be compared to make sure
that the reachability announcement is newer than the unreachability
announcement.

When an unreachability update is generated or propagated, a "grace
period" may be attached to it. This is especially useful when a link is
about to be taken down for maintenance: by first announcing
unreachability, packets that come in because the update hasn't
propagated yet aren't lost. Whenever a router processes an update with a
grace period, it subtracts the time the update was in transit (or a
reasonable estimate) and lowers the grace period by a short time, like
five seconds. The update is processed when the grace period expires.
Because the grace period is reduced as the route propagates, the actual
execution of the update happens "far away" first and then moves closer
to the source of the update. A router or AS should announce a grace
period of zero if it is unable to deliver packets to the affected
destination.

4 Multicast

4.1 Single Source Multicast

Multicast information is carried as both source and listener
information. Single source multicast source information consists of two
prefixes that specify the multicast source: the multicast group prefix
and the source address prefix. When a router connects to listeners
interested in a certain multicast stream (clients or other routers), it
sends out a listener update for the specific group/source combination to
the next router on the shortest path towards the source of the multicast
stream. This way, an intermediate router knows which downstream routers
are interested in the multicast stream so it can replicate packets as
required. Note that source information is aggregated and relatively
stable, while listener information isn't aggregated and more volatile.
Also note that a router must update listener information whenever the
best path to the related source information changes.

    Discussion:

    It's possible to optimize for the situation where router A is
    interested in a multicast group available from router F, where the
    shortest path from A to F is A - C - F, while router B is already
    receiving the multicast feed through B - D - E - F. If A and B are
    connected, A could receive the feed from B with only one additional


Van Beijnum             Expires December 11, 2006               [Page 9]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


    hop having to carry the multicast feed, but this isn't the shortest
    path (that would be A - C - F) so it requires the dissemination of
    additional (volatile) information.

4.2 Any Source Multicast

Although ASM doesn't have a single source, the group prefix must be
sourced into the routing protocol the same way as an SSM source to aid
simple discovery. Operation is similar to that for SSM, except that
multicast packets don't exclusively flow downstream: they are duplicated
downstream and also upstream, with the exception of the direction of the
packet's source, of course. And routers must apply an RPF check to all
ASM packets to avoid loops and unnecessary duplication.

    Discussion:

    Do we need a separate address family for ASM sources? This allows
    for disjoint ASM/unicast topologies but may require duplication of
    all unicast routing information.

4.3 Multicast Address Family

SSM and ASM multicast routing information is a single address family.
When the source address/prefix is specified, this indicates SSM. When
the source prefix is the unspecified address, this indicates ASM. The
multicast address family encompasses both source and listener
information; it's not possible for links to support one but not the
other. An AS may support either no multicast, SSM only or both SSM and
ASM. When only SSM is supported, multicast source information with the
unspecified address as the source address for the stream must not be
carried.

5 Security Considerations

To be worked out as the protocol becomes more concrete. Separating
prefix information from topology information makes it easy to
authenticate the source AS, but re-advertising and aggregation may prove
problematic security-wise.

6 IANA Considerations

None at this time.


Van Beijnum             Expires December 11, 2006              [Page 10]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


7 Document and Author Information

This document expires December, 2006. The latest version will always be
available at http://www.muada.com/drafts/. Please direct questions and
comments to the author:

    Iljitsch van Beijnum

    Email: iljitsch@muada.com


Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Disclaimer of Validity

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Van Beijnum             Expires December 11, 2006              [Page 11]
Internet-Draft        Inter-Domain Link Enumeration            June 2006


Copyright Statement

   Copyright (C) The Internet Society (2006).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.


Acknowledgment

   Funding for the RFC Editor function is currently provided by the
   Internet Society.


Van Beijnum             Expires December 11, 2006              [Page 12]