taylor-mmusic-rtp-failover-problem-01.txt

Internet DRAFT - draft-taylor-mmusic-rtp-failover-problem
draft-taylor-mmusic-rtp-failover-problem

Last Version:	draft-taylor-mmusic-rtp-failover-problem-01.txt	Tracker Entry
Date:	`08-Feb-2017`
Disposition:	expired
Previous Versions:	draft-taylor-mmusic-rtp-failover-problem-00.txt (diff) - 01-Mar-2016

MMUSIC                                                        M. Taylor
Internet Draft                                                N. Larkin
Intended status: Informational                      Metaswitch Networks
Expires: February 28, 2017                              August 31, 2016



                   RTP media failover: problem statement
              draft-taylor-mmusic-rtp-failover-problem-01.txt



Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time.  It is inappropriate to use Internet-Drafts as
   reference material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on February 31, 2009.

Copyright Notice

   Copyright (c) 2016 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.





Taylor & Larkin       Expires February 28, 2017                [Page 1]

Internet-Draft  RTP media failover: problem statement       August 2016


Abstract

   Network-based functions that terminate large numbers of RTP media
   streams and that offer high availability, such as session border
   controllers or conference bridges, typically preserve the same IP
   address towards sources of RTP media across a failover event because
   it is impractical to signal a change of IP address towards large
   numbers of RTP sources sufficiently rapidly to keep media
   interruption intervals within acceptable limits.  The need to
   preserve the IP address of RTP media terminating functions across a
   failover event imposes architectural requirements that can be
   difficult or costly to meet, particularly in network function
   virtualization environments.  This document describes the problem,
   outlines the key requirements for a solution, and discusses the
   merits and shortcomings of various existing approaches to solving
   the problem, before arguing that a new solution is needed.

Table of Contents

   1. Introduction...................................................3
   2. Problem Space..................................................4
      2.1. Geographic Redundancy.....................................4
      2.2. Resource Efficiency.......................................5
      2.3. Evolution to Cloud-Centric Virtualized Network Functions..6
      2.4. Absence of Layer 2 Connectivity...........................6
   3. Requirements for Improved Failover of RTP Media Streams........7
      3.1. Upper Limit on Media Interruption Time....................7
      3.2. Geographic Redundancy.....................................8
      3.3. Resource Efficiency.......................................8
      3.4. Network Compatibility.....................................8
      3.5. Backwards Compatibility...................................8
      3.6. Compatibility with Hosted NAT Traversal...................8
   4. Available Solutions and Their Limitations......................9
      4.1. Use of SIP Re-INVITE or UPDATE to Update SDP..............9
      4.2. Restriction of Size of Fault Zone........................10
      4.3. Re-Routing at the IP Layer Using BGP.....................10
      4.4. Re-routing at the IP Layer Using Link-State Protocols....11
      4.5. Anycast..................................................12
      4.6. RTP Proxy / Load Balancer................................14
      4.7. Multipath RTP............................................14
   5. Proposed New Approach to RTP Media Failover...................15
   6. References....................................................15
      6.1. Normative References.....................................15
      6.2. Informative References...................................16
   7. Change Log....................................................17
      7.1. Changes in draft-taylor-mmusic-rtp-failover-problem-01...17



Taylor & Larkin       Expires February 28, 2017                [Page 2]

Internet-Draft  RTP media failover: problem statement       August 2016




1. Introduction

   Session Description Protocol (SDP) [RFC4566], typically conveyed via
   Session Initiation Protocol (SIP) [RFC 3261] requests, provides a
   means for Real Time Protocol (RTP) [RFC3550] endpoints to negotiate
   via the Offer/Answer Model described in RFC3264 [RFC3264] the
   details of media sessions to be established between them.  An
   endpoint conveys the specific IP address and port number on which it
   wishes to receive a given media stream via the c= (connection
   information) and m= (media description) lines defined by SDP.  An
   endpoint that wishes to change the IP address and port number on
   which it is to receive a given media stream needs to send updated
   SDP to the transmitter of that media stream.

   Some services that make use of SIP and SDP to negotiate the
   establishment of media sessions for voice, video or real-time
   streaming purposes employ RTP media relay functions in the network,
   for example associated with a SIP back-to-back user agent in the
   form of a session border controller.  A single such RTP media relay
   instance may support the relaying of tens of thousands of concurrent
   media streams.  Likewise, a large-scale conference bridge may
   support many thousands of concurrent RTP sessions.

   With network functions that terminate such large numbers of RTP
   sessions (referred to in the remainder of this document as "RTP-
   terminating network functions"), it is desirable to provide some
   means to protect against hardware or software failures in a manner
   that preserves the RTP sessions, such that failover can be
   accomplished with minimal transient impairment to the audio or video
   streams as perceived by users of the service.  This may be
   accomplished by deploying a second, identical instance of the
   network function to act as a backup.  The two instances work
   together as a pair, with one instance actively performing RTP
   session termination and the other instance standing by, ready to
   take over if the active instance fails.  Some means is provided to
   enable the backup instance to detect failure of the active instance,
   for example by means of a heartbeat protocol between the two
   instances.  On detecting failure of the active instance, the backup
   instance becomes active and can take over the processing of all the
   media streams that were previously handled by the active instance.

   The handover between active and standby network function instances
   is typically handled in a manner that is transparent to the RTP
   endpoints that are currently sending media to the active instance.
   This may be accomplished by assigning a virtual IP address that is
   shared between the active and standby instances of the network

Taylor & Larkin       Expires February 28, 2017                [Page 3]

Internet-Draft  RTP media failover: problem statement       August 2016


   function.  It is this IP address that is conveyed over SDP to the
   set of RTP endpoints served by the network function as the
   destination to which they should send their media streams.  Under
   normal operating conditions, the virtual IP address is associated
   with the currently active member of the RTP-terminating network
   function pair.  When the standby member of the network function pair
   detects a failure of the active member, it becomes active and claims
   the virtual IP address, for example by issuing a gratuitous Address
   Resolution Protocol (ARP) [RFC826] message.  By this means, all the
   media streams that are currently being transmitted to the formerly
   active member of the network function pair may be re-directed to the
   newly active member without any of the transmitting endpoints being
   aware of the change.

   Fault tolerance schemes that take advantage of IP address swapping
   in the manner described above are widely employed by network
   functions that terminate large numbers of RTP streams and are often
   embodied in physical appliances such as session border controllers.

2. Problem Space

   The following points describe problematic aspects of highly
   available network functions that terminate large numbers of RTP
   media streams, for which an improved solution (or solutions) is
   sought.  A common theme among these problems is the fact that
   failure recovery of the network function needs to be transparent to
   the sources of the RTP media streams handled by a failed RTP-
   terminating network function instance, in the sense that such
   sources are not aware that failover of the RTP-terminating network
   function instance serving them has taken place, except to the extent
   that they may experience some momentary interruption of received
   media.  In particular, RTP endpoints continue to send media to the
   same IP address before and after an RTP-terminating network function
   failover event.

2.1. Geographic Redundancy

   A pair of RTP-terminating network function instances deployed in the
   same physical location in active-standby mode and sharing the same
   virtual IP address can provide protection against equipment failure
   such as the failure of the active instance itself or the failure of
   network connectivity to the active instance.  However, this
   arrangement does not protect against failure of the site at which
   the RTP-terminating network function is deployed, or failure of
   network connectivity to the site as a whole.

   Network operators typically protect against site failure or site
   connectivity failure by implementing some form of geographic

Taylor & Larkin       Expires February 28, 2017                [Page 4]

Internet-Draft  RTP media failover: problem statement       August 2016


   redundancy.  This usually involves replicating the equipment needed
   to support a given service on at least two sites, such that in the
   event of the failure of one site, the service can continue to be
   supported by making use of the equipment on one of the other sites.
   Since redundant equipment is deployed within each site to protect
   against equipment failure, protection against site failure requires
   yet more equipment to be deployed, which has obvious cost
   implications.  Note that site failure is considered to be a far less
   frequent event than equipment failure, and typically no effort is
   made to preserve active real-time media sessions across a site
   failover, unlike the case of equipment failover.

   Network operators can potentially reduce the cost of meeting service
   availability targets by protecting against both equipment failure
   and site failure with a single common failure recovery mechanism.
   For example, a pair of RTP-terminating network function instances
   could be deployed with one member of the pair located in one site
   and the other member located in another site.  If the network
   operator determines that real-time media sessions must be preserved
   across equipment failure, then we need to be able to switch all of
   the media streams addressed to the failed RTP-terminating network
   function instance to the standby instance located in the backup site
   sufficiently quickly that users experience no more than brief
   transient interruption to their incoming media streams.

   This can be accomplished quite efficiently at Layer 2 (by swapping a
   virtual IP address from one member of the network function pair to
   the other), but this approach requires the establishment of a Layer
   2 connection between sites, which can be complex and inconvenient to
   accomplish.  Other methods for preserving real-time media streams
   across geographic failover are discussed below in Section 4.

2.2. Resource Efficiency

   While it is common practice to deploy RTP-terminating network
   functions as active-standby pairs to provide high availability, this
   arrangement is relatively wasteful of hardware resources because, at
   any one time, only half the hardware supporting the RTP-terminating
   network function is doing useful work.  The cost of hardware to
   support RTP processing can be relatively high, particularly if the
   function is required to perform compute-intensive work on media
   streams such as encryption/decryption of Secure Real Time Protocol
   (SRTP) [RFC3711], or audio or video transcoding.

   The amount of hardware resources required to support any given
   capacity of RTP-terminating network function could be very
   considerably reduced if it were possible to provide protection
   against hardware or software failure by means of a pooling

Taylor & Larkin       Expires February 28, 2017                [Page 5]

Internet-Draft  RTP media failover: problem statement       August 2016


   arrangement.  This could be in the form of a group of RTP-
   terminating network function instances, all of which are active all
   of the time, and where their total aggregate capacity exceeds the
   maximum expected load by a sufficient margin that the load carried
   by any given instance can be successfully load-balanced across the
   remaining instances in the event of the failure of this instance. An
   alternative approach would be to deploy a small number of standby
   instances to protect a much larger number of active instances, and
   to switch all of the RTP sessions carried by a failed active
   instance over to one of the standby instances.

   This type of high availability scheme is often known as N+k
   redundancy.  While the latter example above of N+k redundancy (N x
   active, k x standby) is compatible with the swapping of virtual IP
   addresses, the former example (active-active load-balanced) is not.
   Most network operators express a strong preference for active-active
   N+k schemes, regardless of any consideration as to whether active-
   active N+k can actually be shown to deliver higher availability than
   active-standby N+k.

2.3. Evolution to Cloud-Centric Virtualized Network Functions

   Many network operators are embracing network functions
   virtualization (NFV), whereby network functions that would
   previously have been embodied as physical appliances are now
   embodied as software components deployed in a virtualized cloud
   computing environment.  With the move to NFV, network operators are
   expressing a strong preference for cloud-centric approaches to
   network function design. This tends to imply the deployment of
   relatively large numbers of relatively small instances of network
   functions, where all instances are active, and protection against
   failures at any level from individual instance through physical host
   up to a complete site is provided by means of active-active N+k
   redundant pools of virtualized network function instances.

   It is difficult in practice to architect highly available solutions
   for RTP-terminating network functions based on active-active N+k
   redundancy that meet the requirement that failover must be
   transparent to sources of RTP media.  Possible solutions and their
   limitations are discussed later in this document.

2.4. Absence of Layer 2 Connectivity

   Widely used active-standby techniques for RTP-terminating network
   functions that involve the sharing and swapping of a virtual IP
   address typically require that the active and standby members in a
   high availability arrangement are directly connected via a Layer 2
   network segment.

Taylor & Larkin       Expires February 28, 2017                [Page 6]

Internet-Draft  RTP media failover: problem statement       August 2016




   As discussed in section 2.1 above, this can be problematic if the
   active and standby RTP-terminating network function instances are
   located in different geographic sites, although this problem is
   soluble, for example with the aid of a Layer 2 Virtual Private
   Network.

   A more intractable problem arises when a network operator chooses to
   design a network functions virtualization infrastructure with a
   Layer 3 centric fabric that does not provide L2 connectivity between
   virtualized workloads.  While this is not yet a common approach to
   cloud network design, scaling issues with L2-centric fabrics are
   expected to drive increasing popularity of L3-centric approaches in
   the future.  In L3-centric cloud network fabrics, failover of RTP-
   terminating network functions based on virtual IP address swapping
   cannot be supported with the usual approach based on gratuitous ARP
   [RFC826].

   Approaches based on Network Address Translation (NAT) [RFC3022] such
   as OpenStack's Floating IP Address concept could potentially address
   this need, but the insertion of additional network elements into the
   RTP path to perform NAT introduces additional failure scenarios that
   need to be protected against.  Also, such approaches require that
   that the infrastructure management plane is capable of responding
   very quickly to a NAT re-configuration request, such that the
   interruption in incoming media streams experienced by users is
   perceived as no more than momentary.  Practical experience suggests
   that this cannot currently be achieved with real-world cloud
   infrastructure solutions.

3. Requirements for Improved Failover of RTP Media Streams

   For the reasons described in section 2 above, it is considered
   desirable to specify new behaviors of RTP endpoints so as to provide
   an improved method for failover of RTP media streams that supports
   high availability of RTP-terminating functions in the network.

   When considering any new solution for failing over large numbers of
   RTP media streams, the following requirements should be met.

3.1. Upper Limit on Media Interruption Time

   A new solution designed to preserve RTP media in the face of failure
   of an RTP-terminating network function instance MUST successfully
   re-establish a viable RTP media path for each and every flow that
   was previously handled by the failed instance within a maximum


Taylor & Larkin       Expires February 28, 2017                [Page 7]

Internet-Draft  RTP media failover: problem statement       August 2016


   elapsed time of two seconds, and SHOULD re-establish all media flows
   within 500 milliseconds.

3.2. Geographic Redundancy

   A new solution for failover of RTP media streams MUST be capable of
   preserving media sessions across the failure of a physical site or
   the failure of network connectivity to a physical site, even when
   the two sites are separated by hundreds of miles.

3.3. Resource Efficiency

   A new solution for failover of RTP media streams MUST support N+k
   redundancy of RTP-terminating network functions, where k << N.

3.4. Network Compatibility

   A new solution for failover of RTP media streams MUST not assume the
   existence of Layer 2 connectivity between RTP-terminating network
   function instances that are protecting each other, and MUST not
   assume the existence of any network capabilities beyond basic IP
   unicast connectivity.

3.5. Backwards Compatibility

   It will take time to upgrade the installed base of RTP endpoints to
   embody any new behaviors required to support a new solution for RTP
   media failover.  RTP-terminating network functions that embody a new
   solution for failover of RTP streams MUST remain compatible with RTP
   endpoints that do not support the new behaviors.  RTP-terminating
   network functions that support a new solution for failover of RTP
   media streams MAY continue to support legacy methods for failover of
   RTP media streams, but are not required to do so.

3.6. Compatibility with Hosted NAT Traversal

   A new solution for failover of RTP media streams MUST be compatible
   with the method of Hosted NAT Traversal described in RFC7362
   [RFC7362].  If the solution requires that, following failover, the
   RTP endpoint is to transmit RTP media streams to an RTP-terminating
   network function at an IP address and port number that is different
   than prior to failover, the RTP endpoint MUST commence transmission
   of RTP packets towards the new IP address and port number without
   waiting to receive RTP media packets from the new IP address and
   port number.




Taylor & Larkin       Expires February 28, 2017                [Page 8]

Internet-Draft  RTP media failover: problem statement       August 2016


4. Available Solutions and Their Limitations

   In this section, we discuss alternative ways of supporting high
   availability of RTP-terminating network functions without any change
   to the existing behavior of SIP- and SDP-signaled RTP endpoints.  It
   will be seen that none of these methods meets the full set of
   requirements identified in Section 3 above.

4.1. Use of SIP Re-INVITE or UPDATE to Update SDP

   A SIP User Agent in an active session state associated with a
   currently active RTP transmitter can be instructed to transmit RTP
   to a different destination IP address and port number by sending it
   an in-dialog re-INVITE or UPDATE request that includes SDP with the
   new connection details.

   This use of a re-INVITE or UPDATE request to update SDP within an
   active session may be leveraged to manage failover of an RTP-
   terminating network function instance in the network.  The SIP User
   Agent instance that is associated with the RTP-terminating network
   function instance, upon detecting the failure of said instance,
   could send a re-INVITE or UPDATE request to each and every SIP UA
   that is in an active session and sending RTP media to the failed
   RTP-terminating network function instance, with an SDP body that
   directs each RTP endpoint to send RTP media to a different RTP-
   terminating network function instance.

   In practice, it is found that the processing resources required to
   transmit the required number of re-INVITE or UPDATE requests and
   process all of the responses so as to achieve resumption of all
   active RTP media flows within an acceptable elapsed time far exceed
   the processing resources that would normally be required to support
   the SIP signaling load associated with that number of concurrent
   sessions.  It is therefore very costly to support RTP media failover
   by means of this technique.

   One use case for RTP-terminating network functions is in peering
   arrangements for the connection of large numbers of concurrent RTP
   sessions between different networks.  In this situation, if a SIP UA
   associated with an RTP-terminating network function were to send
   large numbers of in-dialog re-INVITE or UPDATE requests in a short
   elapsed time to its peer SIP UA in the other network so as to
   request that a large number of incoming RTP streams be sent to a
   different IP address and port number, the receiving SIP UA might
   easily be overwhelmed by the incoming load of SIP message traffic.
   This could have the doubly deleterious effect of failing to achieve
   the failover of many of the RTP streams in a timely fashion, and


Taylor & Larkin       Expires February 28, 2017                [Page 9]

Internet-Draft  RTP media failover: problem statement       August 2016


   failing to complete requests for the establishment of new sessions
   while the signalling overload condition persists.

4.2. Restriction of Size of Fault Zone

   In a network functions virtualization environment, it is possible to
   terminate large numbers of RTP sessions by deploying large numbers
   of small scale RTP-terminating network function instances.  These
   instances could be deployed without any form of redundancy, such
   that the failure of any instance causes the complete loss of all RTP
   media sessions currently being handled by it.

   With this type of arrangement it could be argued that, if the
   maximum number of sessions that are handled by a single RTP-
   terminating network function instance is low enough, then the
   failure of one instance and the consequent loss of all the media
   sessions that it is currently handling represents a relatively minor
   impact to the service as a whole.

   Some network operators may take the view that this approach meets
   their criteria for an acceptable quality of service.  However it
   should be pointed out that, with a reasonably efficient
   implementation of the RTP-terminating function, a minimally-sized
   instance occupying just a single virtual CPU could be handling
   several hundred concurrent sessions.  For most network operators,
   the loss of several hundred concurrent media sessions arising from
   the failure of an unprotected network element would be unacceptable.

   It is also worth pointing out that deploying large numbers of small
   instances of a network function may restrict the size of the fault
   zone as it relates to failure of small-scale resources such as
   virtual machines, hypervisors or compute nodes, but it does not
   restrict the size of the fault zone as it relates to failure of
   large-scale resources such as an availability zone, an entire cloud
   instance or an entire site.  Protection is still required in the
   event of these resources failing.

4.3. Re-Routing at the IP Layer Using BGP

   It is possible to cause IP packets to be delivered to a different
   host system by means of appropriate interaction with the routing
   protocols of the IP network control plane.  This capability can be
   exploited to support a highly available RTP-terminating network
   function.

   In an IP network that employs Internal Border Gateway Protocol (BGP)
   [RFC4271], one way to accomplish this is to add a BGP speaker
   function to the RTP-terminating network function.  The RTP-

Taylor & Larkin       Expires February 28, 2017               [Page 10]

Internet-Draft  RTP media failover: problem statement       August 2016


   terminating network function uses BGP to advertise a route to the
   RTP service address via its own host address.  The IP infrastructure
   to which the RTP-terminating network function instance is connected
   effectively treats the host address of this instance as the next hop
   towards the RTP service address, and routes IP packets addressed to
   the RTP service address towards that RTP-terminating network
   function instance.

   In the event of the failure of such an RTP-terminating network
   function instance, another RTP-terminating network function instance
   that is providing protection for the failed instance issues a BGP
   message that withdraws the original RTP service route via the host
   address of the failed instance, and advertises a new route via its
   own host address.  The IP infrastructure will now route all IP
   packets addressed to the RTP service address towards the protecting
   RTP-terminating network function instance.

   This approach places a number of demands on the IP routing
   infrastructure to which the active and standby RTP-terminating
   network function instances are connected which it may be difficult
   to meet in practice.  In particular, the routing infrastructure must
   be able to respond to the withdrawal of a route and the
   advertisement of a new route to the RTP service address sufficiently
   rapidly to meet the requirement described in Section 3.1 on the
   upper limit for media interruption time.

   It also requires that the routing policy prevailing in the
   infrastructure allows for individual host routes (e.g. IPv4 /32 or
   IPv6 /128 routes) to be installed in routing tables.

   In many cases it may not be practicable or even possible to meet
   these demands.

4.4. Re-routing at the IP Layer Using Link-State Protocols

   In IP networks that employ Interior Gateway Protocols other than
   IBGP, for example OSPF [RFC2328] or IS-IS [RFC1142], it may be
   possible to re-route RTP media at the IP layer using methods
   conceptually similar to that described in section 4.2.  However,
   link-state protocols rely on the detection of a link failure to
   initiate re-routing of IP traffic, and it isn't likely that the
   failure of an RTP-terminating network function instance could always
   be detected as a link failure by neighboring routers sufficiently
   quickly to meet the requirement on the upper limit for media
   interruption time described in section 3.1.




Taylor & Larkin       Expires February 28, 2017               [Page 11]

Internet-Draft  RTP media failover: problem statement       August 2016


4.5. Anycast

   Anycast [RFC4786] is a routing scheme whereby multiple host systems
   share a single address, and IP packets destined for that address are
   routed to the host that is "nearest" the sender.

   Anycast techniques can be employed to implement a scheme that is
   conceptually similar to that described in Section 4.2 above, but
   which relies on the active and standby members of an RTP-terminating
   network function pair to advertise different route weights such that
   IP traffic is routed to the active member.  Failover requires that
   the advertised route weights are adjusted to ensure that IP traffic
   is routed to the standby member.

   Anycast techniques can also be employed to support a form of load-
   balancing.  If multiple RTP-terminating network function instances
   are advertised to be reachable at the same address and with equal
   distance, the IP routing infrastructure can distribute load across
   the instances using Equal Cost Multi Path (ECMP) routing.
   Furthermore, if some means is provided for the detection of the
   failure of any given RTP-terminating network function instance and
   subsequent transmission of a BGP message withdrawing the route to
   that instance, then ECMP should act to re-distribute the load across
   the remaining instances.

   This use of Anycast appears to address the N+k active-active use
   case very effectively, although it should be noted that, in the case
   of an RTP-terminating network function that is acting as a media
   relay, for example as a component of a session border controller, it
   is not generally possible to ensure that the two streams that make
   up a bi-directional RTP session are handled by the same media relay
   function instance.  This may well add considerably to the complexity
   of the design of the media relay function.

   A more serious problem with using Anycast in this way is that, in a
   virtualized environment, it becomes extremely challenging to manage
   the placement of the RTP-terminating network function instances.
   These challenges arise because, at each router supporting ECMP that
   sees multiple available routes to the Anycast address with the same
   distance, the router splits the traffic evenly between all these
   routes.  If there is more than one router between the source of the
   traffic and the set of RTP-terminating network function instances
   that are the destination of the traffic, these instances must be
   arranged so as to create a symmetrical routing tree in order to
   ensure that each instance receives a similar share of the overall
   traffic load.



Taylor & Larkin       Expires February 28, 2017               [Page 12]

Internet-Draft  RTP media failover: problem statement       August 2016


   To illustrate this, consider the following scenario, described in
   the diagram below.  All RTP media traffic from a given set of RTP
   endpoints transits via Router A (which might be, for  example, an
   end-of-rack L3 switch), and then via either Router B or Router C
   (which might be, for example, top-of-rack L3 switches) to RTP-
   terminating network function instances M1 through M5.  The routes to
   instances M1 and M2 are via Router B, while the routes to instances
   M3, M4 and M5 are via Router C.  All RTP-terminating network
   function instances are advertising the same RTP service address.


                                 +--------+
                                 |        |
                                 |        |
                                 |        +---> M1
                                 | Router |
                  +--------+     |    B   |
                  |        |     |        +---> M2
                  |        +-----+        |
                  |        |     |        |
   RTP flows -----> Router |     +--------+
                  |    A   |     +--------+
                  |        |     |        |
                  |        +-----+        +---> M3
                  |        |     |        |
                  +--------+     | Router +---> M4
                                 |    C   |
                                 |        +---> M5
                                 |        |
                                 |        |
                                 +--------+


   From the point of view of Router A, there are two possible routes to
   the RTP service address, via Router B and Router C respectively.  It
   therefore sends half of the RTP flows to Router B, and half to
   Router C.  Router B will distribute half of the RTP flows that it
   receives from Router A to each of M1 and M2, while Router C will
   distribute one third of the flows it receives from Router A to each
   of M3, M4 and M5.  It can be seen that the load is not evenly
   balanced over the population of RTP-terminating network function
   instances.

   In the general case, placing the instances of RTP-terminating
   network functions so as to form a symmetrical routing tree presents
   an extremely difficult problem for the workload scheduling algorithm
   in a virtualized environment, particularly if the intention is to
   spread the load between RTP-terminating network function instances

Taylor & Larkin       Expires February 28, 2017               [Page 13]

Internet-Draft  RTP media failover: problem statement       August 2016


   on two or more separate sites.  Topology-aware scheduling is not a
   capability offered by current generations of cloud orchestration
   software, and even if it were, dynamically scaling the population of
   RTP-terminating network function instances while maintaining a
   symmetric routing tree would be cumbersome and inflexible.

4.6. RTP Proxy / Load Balancer

   It is possible to imagine a solution based on an RTP proxy or load
   balancer which sits between RTP-terminating network functions and a
   population of RTP endpoints that are sending RTP media towards those
   RTP-terminating network functions.  The RTP proxy or load balancer
   presents a single IP address towards the population of SIP UAs.  In
   the event that an instance of an RTP-terminating network function
   fails, the RTP proxy or load balancer can detect the failure of the
   instance, and re-direct incoming RTP media to a different instance
   of an RTP-terminating network function which has been configured so
   as to receive and correctly process the incoming RTP media streams
   that were previously being sent to the failed instance.

   The problem with this approach is that the RTP proxy or load
   balancer itself represents a single point of failure that must be
   protected by some means in order to provide a high availability
   service. All that is achieved in deploying an RTP proxy or load
   balancer is that the RTP failover problem is moved from the RTP-
   terminating network functions to an RTP-proxying function.  The
   fundamental problem remains the same: a population of RTP endpoints
   expects to be able to transmit RTP media streams to the IP address
   and port number that was negotiated when the session was set up, and
   this address must be preserved across a failover of the RTP proxy or
   load balancer in order to ensure session continuity.

4.7. Multipath RTP

   Multipath RTP [I-D.ietf-avtcore-mprtp] (MPRTP) is a proposed
   extension to RTP which splits a single RTP stream into multiple
   subflows that are transmitted over different network paths. It is
   primarily intended to leverage pooling of the resource capacity of
   multiple network paths to improve user experience by enabling higher
   bit-rate and higher quality codecs to be used.

   It is possible to imagine using MPRTP to support failover of
   individual RTP streams, by defining two MPRTP sub-flows at session
   establishment time and then sending all media over one of the sub-
   flows. If an RTP-terminating network function involved in such an
   MPRTP session were to fail, media could then be transmitted and
   received via the other sub-flow.


Taylor & Larkin       Expires February 28, 2017               [Page 14]

Internet-Draft  RTP media failover: problem statement       August 2016


   There are a number of concerns about the use of MPRTP to support the
   simple case of failover.  MPRTP is primarily concerned with the
   support of multiple simultaneous sub-flows that must be merged by
   the receiver.  This needs additional RTP header information which
   would require extensive enhancements to the RTP stack in each
   endpoint.  This additional RTP header information would not be
   required for the simple failover case.  Furthermore, MPRTP mandates
   that endpoints keep alive sub-flows on which no media is being sent.
   This would result in the unnecessary consumption of resources in
   RTP-terminating network functions. Finally, MPRTP does not support
   any mechanism for signaling to a transmitting RTP endpoint that it
   should stop sending media on one sub-flow and start sending it on
   another.  Thus any solution for RTP failover based on the use of
   MPRTP would require further protocol extensions to address this
   requirement.

5. Proposed New Approach to RTP Media Failover

   This document has argued that currently available solutions for RTP
   media failover are inadequate because they are inefficient from a
   hardware resources standpoint and not well suited to the evolving
   environment of network functions virtualization.  It has also
   pointed out that many of the challenges faced by RTP media failover
   solutions arise from the need to preserve the destination IP address
   of the RTP-terminating network function across a failover event.

   The need for robust and flexible high availability solutions for SIP
   User Agents is addressed by existing standards by permitting SIP UAs
   to establish multiple flows over which SIP signaling messages can be
   sent and received [RFC5626].

   This document proposes that an analogous scheme be defined for RTP
   endpoints.  The details of such a proposed scheme will be described
   in another Internet Draft.

6. References

6.1. Normative References

   [RFC4566] Handley, M., Jacobson, V., and C. Perkins, "SDP: Session
             Description Protocol", RFC 4566, July 2006.

   [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G.,
             Johnston,A., Peterson, J., Sparks, R., Handley, M., and
             E.Schooler, "SIP: Session Initiation Protocol", RFC 3261,
             June 2002.



Taylor & Larkin       Expires February 28, 2017               [Page 15]

Internet-Draft  RTP media failover: problem statement       August 2016


   [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V.
             Jacobson, "RTP: A Transport Protocol for Real-Time
             Applications", STD 64, RFC 3550, July 2003

   [RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
             with Session Description Protocol (SDP)", RFC 3264, June
             2002.

   [RFC826]  Plummer, D., "Ethernet Address Resolution Protocol: Or
             Converting Network Protocol Addresses to 48.bit Ethernet
             Address for Transmission on Ethernet Hardware", STD 37,
             RFC 826, November 1982

   [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and
             K.Norrman, "The Secure Real-time Transport Protocol
             (SRTP)", RFC 3711, March 2004.

   [RFC3022] Srisuresh, P. and K. Egevang, "Traditional IP Network
             Address Translator (Traditional NAT)", RFC 3022, January
             2001

   [RFC7362] Ivov, E., Kaplan, H., and D. Wing, "Latching: Hosted NAT
             Traversal (HNT) for Media in Real-Time Communication", RFC
             7362, September 2014

   [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
             Border Gateway Protocol 4 (BGP-4)", RFC 4271, January 2006

   [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998

   [RFC1142] Oran, D., Ed., "OSI IS-IS Intra-domain Routing Protocol",
             RFC 1142, February 1990

   [RFC4786] Abley, J. and K. Lindqvist, "Operation of Anycast
             Services", BCP 126, RFC 4786, December 2006

   [RFC5626] Jennings, C., Ed., Mahy, R., Ed., and F. Audet, Ed.,
             "Managing Client-Initiated Connections in the Session
             Initiation Protocol (SIP)", RFC 5626, October 2009

6.2. Informative References

   [I-D.ietf-avtcore-mprtp]
            Singh, V., Ott, J., Karkkainen, T., Ahsan, S., Eggert, L.,
            "Multipath RTP (MPRTP)", draft-ietf-avtcore-mprtp-03 (work
            in progress), July 2016.



Taylor & Larkin       Expires February 28, 2017               [Page 16]

Internet-Draft  RTP media failover: problem statement       August 2016


7. Change Log

7.1. Changes in draft-taylor-mmusic-rtp-failover-problem-01

   Corrected missing section header "Re-Routing at the IP Layer Using
   BGP"

   Added new section 4.7 on MPRTP





   Authors' Addresses

   Martin Taylor
   Metaswitch Networks
   100 Church St
   Enfield EN2 6BQ
   UK

   Email: martin.taylor@metaswitch.com

   Nic Larkin
   Metaswitch Networks
   100 Church St
   Enfield EN2 6BQ
   UK

   Email: nic.larkin@metaswitch.com























Taylor & Larkin       Expires February 28, 2017               [Page 17]