Network Working Group I. van Beijnum Internet-Draft June 9, 2006 Expires: December 11, 2006 Inter-Domain Link Enumeration draft-van-beijnum-idle-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on December 11, 2006. Copyright Notice Copyright (C) The Internet Society (2006). Abstract This document discusses the architecture of a successor to BGP4. The problems surrounding inter-domain routing and the BGP protocol have been studied within the IETF, the IRTF and elsewhere, but what's required to actually solve some of these issues hasn't been explored much. That's what this document tries to accomplish. 1 Introduction Within the IETF, the notion that a problem statement and a list of requirements must precede all protocol development is very pervasive. And for good reason: without knowing what problem to solve and which requirement a solution must address, it's impossible to even determine success or failure. As such, the effort presented here may seem Van Beijnum Expires December 11, 2006 [Page 1] Internet-Draft Inter-Domain Link Enumeration June 2006 premature. On the other hand, "when the only tool you have is a hammer, every problem looks like a nail." Currently, there are no efforts within the IETF to make more than fairly modest evolutionary improvements to inter-domain routing. Given the facts that anything more ambitious is going to take a lot of time, and that improving the security of BGP as is now being considered in the RPSEC and SIDR working groups will require significant efforts regardless of the size of the changes to the existing BGP protocol, it makes a lot of sense to entertain the idea of a new inter-domain routing protocol at this time. The author is of the opinion that the current state of BGP makes it hard to impossible to fix one flaw because of all the others. This document tries to make the case that biting the bullet and fix all significant flaws in one go is a better way forward. The basic idea is that a router or AS has a link state view of part of the network, and prefixes are injected at the edges of that part of the network. So when reachability close to a router changes, the router learns this through link state updates, but when reachability far away changes, the router learns this through prefix binding updates. This is not unlike iBGP behavior for a part of the network that's larger than the local AS. This is a very early version of this document and should be read as such. Especially the notion of super- and sub-ASes hasn't been worked out yet. 2 Direction 2.1 Problems with BGP Today, the BGP4 protocol is used for inter-domain routing. There are several problems with BGP. By design, it is a "path vector" protocol: basically a distance vector protocol with path information added. As such, it suffers from some of the problems inherent to distance vector routing, such as slow convergence and the count to infinity problem (although the AS path in BGP helps a lot here). Another problem area for BGP is the fact that all processing happens on a per-prefix basis: there is no way to communicate reachability or policy changes except to update all impacted prefixes. BGP is extremely agnostic as to the underlying path selection algorithm in order to accommodate as much policy control as possible. Unfortunately, this makes it very hard to predict BGP's behavior and the default behavior (especially with today's rather flat AS hierarchy) is more often than not suboptimal. BGP allows harmful policies that keep the protocol from converging to a stable state. Lack of workable aggregation mechanisms means that once an address block is deaggregated, it's almost impossible to get rid of the resulting long prefixes, leading to excessive growth of the internet's global routing Van Beijnum Expires December 11, 2006 [Page 2] Internet-Draft Inter-Domain Link Enumeration June 2006 table. Coarseness of the only available end-to-end metric (the AS path) pushes operators to deaggregation for traffic engineering purposes. The way BGP operates within a single AS requires an additional intra-domain routing protocol and suboptimal engineering tradeoffs by requiring having a full mesh between all BGP routers within the AS or having route reflectors or a confederation. There is no validation of routing information beyond the next hop. A BGP speaker only communicates its best path (if any) to a neighbor, with no way to tie additional information to the nonexistence of a path and no way to accomplish type of service routing or install backup paths. Paths must be explicitly revoked, which in practice requires a BGP speaker to keep track of which paths were communicated to which peer. BGP requires fairly extensive configuration (setting up filters) before it's useful. 2.2 BGP's Strengths BGP has two very important strengths: it imposes very few limitations on the policies that can be used (the main one is that only the hop-by-hop forwarding paradigm is supported) and the distributed computation and data dissemination. BGP scalability is determined almost exclusively by the number of prefixes in the global routing table and the topology of the AS in question; the topology of the rest of the network has no impact on scalability to speak of. 2.3 Goals for a New Inter-Domain Routing Protocol First and foremost, a new inter-domain protocol must be able to replace BGP with no real loss in functionality and second, it has to provide benefits in the areas of performance, scalability, security and features. Last but not least, there must be a viable migration path from BGP to the new protocol. 3 The Protocol An important underlying principle of this routing protocol is the separation between topology and prefix information. Prefixes of a certain address family are grouped together and such groups are injected into the network by an AS or sub-AS, regardless of their reachability. To evaluate reachability and distance, routers must be aware of the network topology between them and the places where prefixes are injected. It is unfeasible to disseminate topology information for the entire network throughout the entire network on an internet-wide scale, so to aid scalability there is aggregation of topology information and two types of prefix aggregation: aggregation of prefixes into groups, where a transit AS re-advertises prefixes learned from other ASes, and aggregation of multiple prefixes into one shorter prefix. In the latter case, loop detection and (un)reachability information is lost, so this type of aggregation may not be appropriate under all circumstances. In the former case, AS path information is retained for loop detection Van Beijnum Expires December 11, 2006 [Page 3] Internet-Draft Inter-Domain Link Enumeration June 2006 and/or backward compatibility with BGP. Since one prefix may thusly be advertised by more than one AS and to avoid having to remove and reinsert prefixes into a group when reachability changes, local policy, upstream policy/preference and reachability information is exchanged separately from AS-prefix bindings. The properties of these four types of information are as follows: * Prefix binding information is generated by the advertising or a re-advertising AS, and barring further re-advertising, is exchanged without modification and independent of reachability. This information may be retained across topology events, reboots and even be exchanged out of band. Signatures are used to secure the binding between prefixes and their source AS. Securing the binding between prefixes and a re-advertising AS with permission from the source AS are optional and happens out of band. Prefix binding information is updated by sending out diff-style updates that include a signature over the full set of prefixes after the update is applied. Prefixes are listed in order. Prefix binding information is identified by the source AS (or sub- or super-AS) and a timestamp. * Local policy information is local to an AS (or equivalent: a sub-AS or super-AS) and may or may not be kept confidential. Local policies consist of access list style entries that apply to one or more sets of prefix bindings. Policies that can be expressed this way are blocking of ASes and prefixes or applying a strong preference value that overrides the inter-AS metric or applying a "bonus" to the inter-AS metric. Policies must be identical across an AS or sub-AS. Discussion: The requirement that policy is identical across a (sub-) AS makes it impossible to have inconsistent policies within an AS that get in the way of convergence to a single stable state. However, ANY policy gets in the way of simple link state behavior. The way around this is to "black out" paths as unreachable that would otherwise have been selected as better than the one selected by policy (see below). When this becomes too excessive, it's easier to just re-advertise prefixes within the AS itself in order to present a single coherent view rather than many fragmented ones. Taken to its extreme, where an AS only propagates a full list of prefixes re-advertised to a neighbor, makes the protocol look very similar to BGP to a neighboring router. * Upstream policy information consists of AS paths and metrics for each prefix for a set of prefixes grouped together to be (re-)advertised. This information is more volatile than the prefix bindings it relates to and is only required in places where overlapping sets of prefixes with different advertising ASes are seen. Source ASes use this information for inbound traffic engineering. Van Beijnum Expires December 11, 2006 [Page 4] Internet-Draft Inter-Domain Link Enumeration June 2006 * Reachability information is a bitmap that applies to a set of prefix bindings, where a zero bit indicates that the corresponding prefix is unreachable, either physically, for policy reasons or because it's reachable through some other path even though this fact can't be inferred from other information (i.e., this is the result of a policy decision). A one bit indicates that the corresponding prefix is reachable over the indicated path, and this path is used as per (publicly) visible metrics. When no reachability information is available, unreachability is assumed. This includes the case where prefix binding information is updated by inserting a new prefix but the latest reachability bitmap still refers to an older version of the prefix-AS binding information. Discussion: Since reachability information and prefix binding information come from different places (and very possible over different paths) and the former depends on the latter, it's important that there are robust mechanisms for updating and synchronizing the information. Additionally, if information is cryptographically signed, it's possible that newer information is rejected for security reasons so a situation where different parts of the network use different versions of certain information may persist for significant periods of time. Local policy and reachability information is not specifically protected since it may be changed in transit end-to-end and transport mechanisms are assumed to protect all information on a hop-by-hop basis. Upstream policy information is authenticated by the source AS. 3.1 Path Selection The protocol distributes link states between neighboring ASes. The set of link states may either be complete or incomplete. When the set of link states is complete the state for all links within an AS or super-AS is known (the smallest entity the protocol deals with is sub-ASes, not individual routers). Alternatively, link states that are not part of the currently selected path for one or more destinations (i.e., the path with the lowest metric) can be left out of updates to other ASes at the local AS its discretion. Note that when the set of link states is complete, only incremental changes need to be communicated to a neighbor AS, and there is no need for the local AS to converge before updates are propagated to another AS. When the set of link states is incomplete, the AS must first converge and determine the new best path so that the missing link states for the best path can be communicated to the downstream AS. Van Beijnum Expires December 11, 2006 [Page 5] Internet-Draft Inter-Domain Link Enumeration June 2006 Discussion: Is it useful to first revoke link states that have failed as soon as possible, and then send out link states for the new best path later, or should all of this be done at the same time? In the former case, a downstream AS has the opportunity to reroute traffic over another AS, but this introduces additional volatility. Having link states for a backup path could be a good alternative, but it's not clear whether that's feasible. Within an AS, standard shortest path first path selection is used. But subsequent ASes must calculate the path to a non-local destination using the restriction that AS crossings are one-way and an AS may only appear once in the path to a given destination. Link state mechanics impose certain policy restrictions, such as the requirement that all traffic can flow over all paths (within an AS, at least) while in current BGP it's possible for two destination prefixes in the same AS to be reachable over different AS paths. Link state operation also requires a new way to convey policy: when AS Y connects to ASes X and Z and doesn't want to provide transit between X and Z, this means that the flow of link state information towards X and Z must be limited in some way. Things get more complex is Y has an alternative way to reach Z that X is authorized to use, but which has currently a higher metric than the path to Z that X isn't authorized to use. (If the AS supports ToS routing, this problem might be solved by having a separate ToS for this.) Discussion: Would it make sense to build in the notion of transit/upstream, peer and customer/downstream relationships? This removes the need for much configuration, which should be especially helpful for leaf ASes. It also makes it possible for many ASes to run without any policies imposed, which makes operation of the protocol and its predictability much better. Note: link states contain address family information that indicates which protocols are supported over the link in question. 3.2 Inter-AS Metric The primary metric in this protocol is the inter-AS metric, not to be confused with the BGP Multi Exit Discriminator. There are no MEDs and the AS path is not used as a metric. I.e., a short AS path is not considered better than a long AS path. In many cases, no policy will be applied, so the inter-AS metric (or "metric") for short determines the flow of traffic. The metric must be adjusted such that it reflects an increase of at least 1, at most 127 and default 11 for both input and Van Beijnum Expires December 11, 2006 [Page 6] Internet-Draft Inter-Domain Link Enumeration June 2006 output on all intermediate routers within an AS or sub-AS. The maximum value of the metric is 64897 (255 hops). Discussion: Would allowing having different metrics in the two directions for a link open the door to loops? 3.3 Type-Of-Service Routing In addition to the general metric that applies to all traffic of all supported address families, link states may contain additional metrics for specific service classes. Service classes are communicated through a numerical value that identifies a globally defined service class or a service class relevant to the AS in question. Some service classes may be implemented with diffserv. Neighboring ASes exchange information about which diffserv code points correspond to which service classes. 3.4 Prefix Aggregation Aggregating a set of prefixes into a shorter prefix is a very powerful mechanism to reduce the processing and storage requirements for routing protocols, along with the size of FIB tables. In many cases, the loss of information that results from aggregation is not problematic, but in certain cases it is. For instance, in the case where AS A connects to ASes B and C, which both aggregate A's prefix into a larger one. If then the link between A and B goes down, B no longer has any way to reach A. In order to avoid this situation, it's important that suppression of more specifics that are covered by an aggregate only happens when both the aggregate and the more specific are sourced by an upstream (transit) or peer AS, NOT the local AS or a downstream (customer) AS. Discussion: In BGP, an AS aggregates a number of more specifics into an aggregate and then propagates the aggregate. The logic behind this is probably that the more specifics are all downstream from the local AS. However, this doesn't work in the situation when one of the more specifics is sourced by an AS that is both a customer of the local AS and connects to the network elsewhere, and then the direct link to the source of the more specific goes down. Another possibility is that another AS also generates the aggregate and if their direct link to the AS in question goes down, they wouldn't see the alternative path over the first (local) AS if it was aggregated away. For these reasons, it's probably unavoidable to always propagate prefixes from customers, even when those are covered by an aggregate. All of this doesn't apply to aggregatable prefixes that Van Beijnum Expires December 11, 2006 [Page 7] Internet-Draft Inter-Domain Link Enumeration June 2006 belong to single homed ASes, but those are easily dealt with through manual configuration. Note that suppressing a more specific learned from a peer when the aggregate is learned from an upstream AS is probably undesirable. Aggregates are injected into the protocol with a metric that is the weighted average of the metrics for all underlying more specific prefixes. Holes (unreachable parts) in the address space covered by the aggregate receive a metric of 64897. The weight of a prefix is equal to the relative part of the aggregate it covers. So a /32 more specific in a /24 aggregate contributes 1/256th to the metric of the aggregate. Discussion: Does it make sense to have different metrics for individual prefixes, grouped prefixes and aggregates? For instance, an individual prefix could receive a metric of 11 per hop, while a grouped prefix gets 8 and an aggregate 5. This way, even if the aggregate starts with a higher initial metric than the individual route, the individual route's metric increases faster, so that after a certain number of hops, the metric for the individual route becomes higher than that of the aggregate, at which point the individual route can be dropped in favor of the aggregate. This mechanism makes it possible to limit propagation of individual routes and grouped routes to a certain number of hops. 3.5 Unreachability Propagation When a link between two (sub-) ASes goes down, it's important that this information is communicated throughout the entire network as quickly as possible. However, experience with BGP shows that propagation of routing changes across a network as large as the internet simply takes time, and unfortunately, the further away an AS is from the affected link, the longer it takes for the update to reach that AS, leading to BGP's version of the "count to infinity" problem inherent to distance vector protocols. Even though the protocol suggested here contains important link state aspects, the aggregation of topology information necessary to scale to the size of the internet would this protocol vulnerable to the same problem without additional mechanisms. The assumption is, that several ASes removed from the AS where a link goes down, there is no longer a full view of all links that may be used to route around the failed link. In order to avoid flap amplification and "count to infinity" as seen in BGP, whenever a link goes up or down, an identifying token is generated. Both affected ASes must generate the same token that is unique for a certain link within a reasonable time period, and the token must make it Van Beijnum Expires December 11, 2006 [Page 8] Internet-Draft Inter-Domain Link Enumeration June 2006 possible to determine the order in which events occurred when multiple updates become available at the same time or in quick succession. Intermediate (sub-) ASes add their AS number to updates they propagate. When a router receives an update, over one path, it can now determine whether another path is also effected (because it shares part of the AS path that the update travelled through). When an update that announces reachability becomes available, the tokens can be compared to make sure that the reachability announcement is newer than the unreachability announcement. When an unreachability update is generated or propagated, a "grace period" may be attached to it. This is especially useful when a link is about to be taken down for maintenance: by first announcing unreachability, packets that come in because the update hasn't propagated yet aren't lost. Whenever a router processes an update with a grace period, it subtracts the time the update was in transit (or a reasonable estimate) and lowers the grace period by a short time, like five seconds. The update is processed when the grace period expires. Because the grace period is reduced as the route propagates, the actual execution of the update happens "far away" first and then moves closer to the source of the update. A router or AS should announce a grace period of zero if it is unable to deliver packets to the affected destination. 4 Multicast 4.1 Single Source Multicast Multicast information is carried as both source and listener information. Single source multicast source information consists of two prefixes that specify the multicast source: the multicast group prefix and the source address prefix. When a router connects to listeners interested in a certain multicast stream (clients or other routers), it sends out a listener update for the specific group/source combination to the next router on the shortest path towards the source of the multicast stream. This way, an intermediate router knows which downstream routers are interested in the multicast stream so it can replicate packets as required. Note that source information is aggregated and relatively stable, while listener information isn't aggregated and more volatile. Also note that a router must update listener information whenever the best path to the related source information changes. Discussion: It's possible to optimize for the situation where router A is interested in a multicast group available from router F, where the shortest path from A to F is A - C - F, while router B is already receiving the multicast feed through B - D - E - F. If A and B are connected, A could receive the feed from B with only one additional Van Beijnum Expires December 11, 2006 [Page 9] Internet-Draft Inter-Domain Link Enumeration June 2006 hop having to carry the multicast feed, but this isn't the shortest path (that would be A - C - F) so it requires the dissemination of additional (volatile) information. 4.2 Any Source Multicast Although ASM doesn't have a single source, the group prefix must be sourced into the routing protocol the same way as an SSM source to aid simple discovery. Operation is similar to that for SSM, except that multicast packets don't exclusively flow downstream: they are duplicated downstream and also upstream, with the exception of the direction of the packet's source, of course. And routers must apply an RPF check to all ASM packets to avoid loops and unnecessary duplication. Discussion: Do we need a separate address family for ASM sources? This allows for disjoint ASM/unicast topologies but may require duplication of all unicast routing information. 4.3 Multicast Address Family SSM and ASM multicast routing information is a single address family. When the source address/prefix is specified, this indicates SSM. When the source prefix is the unspecified address, this indicates ASM. The multicast address family encompasses both source and listener information; it's not possible for links to support one but not the other. An AS may support either no multicast, SSM only or both SSM and ASM. When only SSM is supported, multicast source information with the unspecified address as the source address for the stream must not be carried. 5 Security Considerations To be worked out as the protocol becomes more concrete. Separating prefix information from topology information makes it easy to authenticate the source AS, but re-advertising and aggregation may prove problematic security-wise. 6 IANA Considerations None at this time. Van Beijnum Expires December 11, 2006 [Page 10] Internet-Draft Inter-Domain Link Enumeration June 2006 7 Document and Author Information This document expires December, 2006. The latest version will always be available at http://www.muada.com/drafts/. Please direct questions and comments to the author: Iljitsch van Beijnum Email: iljitsch@muada.com Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Van Beijnum Expires December 11, 2006 [Page 11] Internet-Draft Inter-Domain Link Enumeration June 2006 Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Van Beijnum Expires December 11, 2006 [Page 12]