ISP Column - September 2019

The ISP Column A column on things Internet
	Other Formats:

Why is Securing BGP just so Damn Hard?
August 2019

Geoff Huston

Stories of BGP routing mishaps span the entire thirty-year period that we’ve been using BGP to glue the Internet together. We’ve experienced all kinds of route leaks from a few routes to a few thousand or more. We’ve seen route hijacks that pass by essentially unnoticed, and we’ve seen others that get quoted for the ensuing decade or longer! There are ghost routes and gratuitous withdrawals. From time to time we see efforts to craft BGP packets of death and efforts to disrupt BGP sessions through the injection of spoofed TCP resets. After some 30 years of running BGP it would be good to believe that we’ve learned from this rich set of accumulated experience, and we now understand how to manage the operation of BGP to keep it secure, stable and accurate. But no. That's is not where we are today. Why is the task to secure this protocol just so hard?

Are we missing the silver bullet that would magically solve all these BGP issues? If we looked harder, if we spent more money on research and tried new approaches, then would we find the solution to our problems? I doubt it. It’s often the case that those problems that remain unsolved for such a long time are unsolved because that are extremely hard problems and they may not even have a solution. I suspect securing BGP falls into this “extremely hard problem” category. Let’s look at this in a bit more detail to explain why I’m so pessimistic about the prospects for securing BGP.

However, perhaps we might start with a more general question: Why are some Internet issues so challenging to solve, while others seem to be effortless and appear to solve themselves? For example, why was the IPv4 Internet an unintended runaway success in the 90’s, yet IPv6 has been a protracted exercise in industry-wide indecision?

Success and Failure Factors in the Internet

Some technologies have enjoyed success from the outset in the Internet. IPv4, of course, would be clearly placed in ther runaway success category, but perversely enough IPv6 would not. NATs have been outstanding successful, and the TCP transport protocol is still with us and it still drives the Internet. The DNS is still largely unchanged after some 30 years. More recently, content distribution systems and streaming protocols have been extremely successful, and most of today’s public Internet service could be characterized as a gigantic video content streaming network.

Why did these technologies succeed? Every case is different, of course, but there are some common success factors in all these technologies.

Piecemeal deployment: One important factor in many aspects of the Internet is the ability to support piecemeal deployment. Indeed this loosely coupled nature of many aspects of the Internet is now so pervasive that central orchestration of many deployed technologies in the Internet is now practically impossible. The Internet is just too big, too diverse, and too loosely coupled to expect that flags days will work. Any activity that requires some general level of coordination of actions across a diversity of networks and operational environments is a forbidding prospect. Instead, we need to be able to deploy these technologies in a piecemeal basis where one network’s decision to adopt a technology does not force others to do the same, and one network’s decision not to adopt a technology does not block others from adoption.
Relative Advantage to Adopters: The Internet is not a command economy and, generally, technologies are not adopted by fiat or regulatory impost. Market economies still operate in the Internet, and adoption is often fuelled by the perception of relative market advantage to early adopters. Technologies that reduce the service cost in some manner, or improve the service offering, or preferably both at the same time, tend to support an early adopter market advantage, and in so doing the technology enjoys rapid market uptake.
Economies of Scale: Technologies where more is cheaper also tend to be adopted. As the number of adopters increase the unit price of the technology and its use should go down, not up. This implies greater market incentives to adopt as adoption increases, creating a positive feedback loop between adoption and service operation costs.
Alignment of Common and Individual benefit: A common question in the Internet context is: What if everyone did it? If the technology generates benefits only when it is used by a few entities and is less efficient when used by everyone is less likely to succeed. For example, an aggressive TCP flow management protocol may generate benefits when only one or two users use it, but when everyone uses it, the protocol may be poor at generating a stable equilibrium across all users.

These success factors relate to success in a diverse, widely distributed and loosely coupled environment.

But the Internet has left behind a trail of failures every bit as voluminous, if not more so, than its history of successes. For example, spam in the email space is a massive failure for the Internet, as is our vulnerability to many forms of DDOS attacks. In a similar vein, after more than 20 years of exhortations to network operators, I think we can call spoofed source address filtering (or BCP 38) a failure. It’s very sensible advice and every network operator should do it. But they don’t. Which makes it a failure.

Secure end systems and secure networks are both failures, and the Internet of Trash looks like amplifying these systemic failures by many orders of magnitude by introduc. The broader topic of securing our transactions across the Internet also has its elements of failure, particularly in the failure of the public key certification framework to achieve comprehensive robustness. IPv6 adoption is not exactly a runaway success so far. The prospects of the Internet of Things amplifying our common vulnerability to poorly crafted, poorly secured and un-maintained endpoints should create a chilling prospect of truly massive cascading failure.

Again, there appear to be common factors for failure which are the opposite of the previous attributes. These include technologies where there is dependence on orchestration across the entire Internet, and technologies that require universal or near universal adoption. The case where there are common benefits but not necessarily individual benefits, and where there is no clear early adopter advantage lies behind the issues relating to the protracted transition to an IPv6-only Internet.

What makes a technical problem hard in this context?

It might be technically challenging:: While we understand what we might want that does not mean we know how to construct a solution with available technologies.
It might be economically perverse:: The costs of a solution are not directly borne by the potential beneficiaries of deploying the solution.
It might be motivated by risk mitigation:: We are notorious for undervaluing future risk!

So now let’s look at BGP routing security in this light. After 30 years why are we still talking about securing BGP?

Why is Securing BGP So Hard?

Here is my top ten of the reasons why securing BGP represents such a challenging problem for us.

1. Noone is in Charge: There is no single ‘authority model’ for the Internet’s routing environment. We have various bodies that oversee the Internet’s domain name space and IP address space, but the role of a routing authority is still a vacant space. The inter-domain routing space is a decentralised distributed environment of peers. The implication of this characterisation of the routing space is that there is no objective reference source for what is right in routing, and equally no clear way of objectively understanding what is wrong. When 2 networks set up an eBGP session neither party is necessarily forced to accept the routes advertised by the other. If one party is paying the other party then there may be a clearer motivation to accept their advertised routes, but the listener is under no specific obligation to accept and use advertised routes. Noone is in charge and there is no authority that can be invoked to direct anyone to do any particular action in routing. To be glib about this, there is no such thing as the routing police.
2. Routing is by Rumour: We use a self-learning routing protocol that discovers the network’s current inter-AS topology (or part of that topology to be more accurate). The basic algorithm is very simple, in that we tell our immediate eBGP neighbours what we know, and we learn from our immediate BGP neighbours what they know. The assumption in this form of information propagation is that everyone is honest, and everyone is correct in their operation of BGP. But essentially this is a hop-by-hop propagation, and the reachability information is not flooded across the network in the form of an original route reachability advertisement. Instead, each BGP speaker ingests neighbour information, applies local policy constraints, generates a set of advertisements that includes locally applied information and, subject to outbound policy constraints, advertises that information to its neighbours. This is in many ways indistinguishable from any other form of rumour propagation, and as there is no original information that is necessarily preserved in this protocol it is very challenging to determine if a rumour (or routing update) is correct or not, and impossible to determine which BGP speak was the true origin of the rumour.
3. Routing is Relative not Absolute: Distance Vector protocols (such as BGP) work by passing their view of the best path to each destination to their immediate neighbours. They do not pass all their available paths, just the best path. This is a distinct point of difference to the operation of Shortest Path First (SPF) algorithms, which flood link level reachability information across the entire network, so that each SPF speaker assembles a (hopefully) identical view of the complete topology of the network and each SPF speaker assembles a set of nest hop decisions that (hopefully) is consistent with all the other local decisions by each other SPF speaker. What this means is that not only does each BGP speaker only have a partial view of the true topology of the network, it is also the case that each BGP speaker assembles a view that is relative to their own location in the network.

Each eBGP speaker will assemble a different routing table, and that mean that there is no single ‘reference’ routing view could be used to compare with these dynamically assembled local views. In BGP there is no absolute truth about the topology of the network, as there is only a set of relative views that is assembled by each eBGP speaker.
4. Routing is Backwards: Routing works in reverse. When a network advertises reachability information relating to an IP address prefix to a neighbour, the result is that the neighbour may use this link to send traffic to this network. Similarly, if a BGP speaker accepts an inbound routing advertisement from a neighbour it may use this to send outbound traffic to that neighbour. The flow of routing information is the opposite to the consequent flow of traffic in the network.
5. Routing is a Negotiation: Routing has two roles to play. The first is the discovery and maintenance of a usable view of the topology of the network, relative to the local BGP speaker as we’ve already noted. The second is that of routing policy negotiation. When two networks peer using BGP (here I’m using the term peer in the strict protocol sense, in that the two networks are adjacent neighbours rather than describing any business relationship between the two networks) there is a policy negotiation that takes place. Each network has local traffic export preferences and will selectively filter incoming route advertisements to that the preferred outbound routing paths that are selected maximises the local traffic export policy preferences of the network. Similarly, each network has local traffic import preferences, and will attempt to advertise route advertisements that maximise conformance to its preferred traffic import preferences.

Such policies are often entirely logical when viewed as business relationships. Customer routes are preferred to transit and peer routes (peer in a business sense). Customer networks should not re-advertise provider or peer routes to other providers or peers. When given a choice, networks would prefer to use provider paths that present the lowest cost and highest performance, while at the same time would prefer to use customer routes that represent the highest revenue potential. BGP is the protocol that attempts to discover a usable state within this set of route import and export constraints.
6. Routing is non-Deterministic: This may sound odd, given that there is an underlying inter-AS topology and a part of BGP’s task is to discover this topology. This part of BGP’s operation is deterministic, in that a stable BGP state represents a subset of this overall topology. BGP (or at least untampered BGP) cannot create fictitious inter-AS links. However the policy constraints introduce a level of non-determinism, See BGP Wedgies (RFC4264) for a description of one such case of non-determinism.

BGP is able to generate outcomes that can be described as "unintended non-determinism" that can result from unexpected policy interactions. These outcomes do not represent misconfiguration in the standard sense, since all policies may look completely rational locally, but their interaction across multiple routing entities can cause unintended outcomes, and BGP may reach a state that includes such unintended outcomes in a non-deterministic manner.

Unintended non-determinism in BGP would not be so bad if all stable routing states were guaranteed to be consistent with the policy writer's intent. However, this is not always the case. The operation of BGP allows multiple stable states to exist from a single configuration state, where some of these states are not consistent with the policy writer’s intent. These particular examples can be described as a form of route pinning, where the route is pinned to a non-preferred path.
7. There is no Evil Bit: For many years April 1 saw the publication ofg an April Fool’s RFC. In 2003 RFC3514 described the evil bit: “If the bit is set to 1, the packet has evil intent. Secure systems SHOULD try to defend themselves against such packets. Insecure systems MAY chose to crash, be penetrated, etc.”

In a security framework bad data does not identify itself as being bad. Instead, we use digital signatures and other forms of credential management to allow others to correctly identift good or genuine data. The assumption here is that if all of the good data carries credentials that can be verified, then all that’s left is bad or, at best, untrustworthy. However, there is a major assumption in this assertion, name.ly one of universal adoption. If we know that only some data has credentials, then the absence of such credentials does not help us in identifying what is trustworthy data.

In some environments, such as TLS, we are not interested in everyone, just the credentials of the remote party we are trying to connect to. In this case partial deployment can be mitigated to some extent by labelling those destinations where TLS validation is required. However, BGP is the entirety of the routing system. A BGP speaker amasses a complete view of reachability of all prefixes. In a scenario of partial deployment, where some routes have associated credentials, and some do not then the task of determining which routes to use becomes a significant challenge.
8. Risk is Hard: Taking measures to mitigate risk is a bit like buying a ticket in a reverse lottery. In a normal lottery everyone spends money to buy a ticket, and there is only one winner. All the ticket buyers can see that there is a winner, and in some manner this justifies their purchase of a ticket. But in a reverse lottery the winner is rewarded by not being a victim of some malicious attack. Because the attack has been deflected the winner is completely unaware that they are a winner and no one can see the value in buying a ticket in the first place. In such systems of common risk mitigation, where everyone pays, but there are no clear winners, the system is difficult to sustain.
9. Because Business: In the internet each component network is motivated by conventional business economics, attempting to balance factors of risk and opportunity in their enterprise. Spending resources on security must be seen to either reduce business risk or increase an enterprise’s competitive advantage.

But it’s all too often the case that network enterprises under-appreciate risk. Such investments in risk mitigation do not necessarily translate into a visible differentiator in the market, and in a competitive environment the result is a higher cost of service without some associated service differentiation. Where the risk mitigation results in a common outcome there is little to be had in the way of a competitive advantage.
10. We actually don't know what we want!: It is extremely challenging to identify a ‘correct’ routing system, and it is far easier to understand when and where an anomaly arises and react accordingly. This situation could be characterized as: we know what we don’t want when we see it, but that does not mean that we can recognize what we actually want even when we may be seeing it! This is partially due to the observation that the absence of a recognizable ‘bad’ does not mean that all is ‘good’!

Success and Failure Factors in the Internet

Why is Securing BGP So Hard?

Consequences

Conclusions

Disclaimer

About the Author