S. Van den Bosch Internet Draft M. Buchli Document: draft-vandenbosch-nsis-resilience- Alcatel 00.txt Expires: December 2002 June 2002 NSIS resilience analysis Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract The Next Steps in Signaling (NSIS) working group is chartered to develop requirements, architecture and protocols for the next IETF steps in signaling Quality-of-Service (QoS). Although the NSIS protocol may eventually be used for other applications in addition to QoS, it is expected that these, in much the same way as QoS, will depend on predictable and high-availability service delivery. This document attempts to identify, list and classify potential resilience issues for the NSIS protocol. Table of Contents Status of this Memo................................................1 Abstract...........................................................1 Conventions used in this document..................................2 1. Introduction....................................................2 2. Terminology.....................................................2 3. Assumptions and non-assumptions.................................3 4. Design principles impacting resilience..........................4 5. Failure description.............................................6 Van den Bosch, et al. Informational - Expires May 2002 1 NSIS resilience analysis June 2002 6. Recovery issues.................................................7 7. Reversion issues................................................9 8. Conclusion.....................................................10 9. Security Considerations........................................10 References........................................................10 Author's Addresses................................................10 Full Copyright Statement..........................................10 Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [1]. 1. Introduction The Next Steps in Signaling (NSIS) working group is chartered to develop requirements, architecture and protocols for the next IETF steps in signaling Quality-of-Service (QoS). Although the NSIS protocol may eventually be used for other applications in addition to QoS, it is expected that these, in much the same way as QoS, will depend on predictable and high-availability service delivery. This particularly mandates an investigation into resilience issues. Resilience typically involves two main cycles: a recovery cycle following the occurrence of a failure and a reversion cycle following a repair. The recovery cycle involves three major steps: failure detection (potentially enhanced with failure correlation and identification), failure notification and recovery actions. The reversion cycle involves repair, fault indication clearing and the reversion actions. This document attempts to identify several issues related to NSIS signalling following this two-cycle structure. The remainder of the document is structured as follows. Section 2 summarises important terminology. Section 3 lists assumptions and non-assumptions made in this document, including a rationale. Section 4 describes the NSIS design choices and principles that will impact protocol resilience. Section 5 gives an overview of potential failure conditions. Section 6 describes potential issues for the recovery cycle. Section 7 describes potential issues for the reversion cycle. Section 8 formulates some conclusions. 2. Terminology Rerouting: A recovery mechanism in which the recovery path or path segments are created dynamically after the detection of a fault on the working path. In other words, a recovery mechanism in which the recovery path is not pre-established. Protection Switching: A recovery mechanism in which the recovery path or path segments are created prior to the detection of a fault Van den Bosch, et al. Informational - Expires December 2002 2 NSIS resilience analysis June 2002 on the working path. In other words, a recovery mechanism in which the recovery path is pre-established. Control plane: Aggregate of network functionalities including entities such as routing protocols, admission control, and signaling Data plane: Aggregate of network functionalities where per-packet activities such as packet forwarding, queuing, conditioning and header editing occur (Per-flow packet conditioning may require interaction with control plane). QoS domain: Subnetwork under a single administrative control and/or using a single QoS technology QoS Initiator (QI): NSIS entity responsible for generating the QSCs for traffic flow(s) based on user or application requirements and signaling them to the network as well as invoking local QoS provisioning mechanisms. This can be located in the end system, but may reside elsewhere in the network. QoS Controller (QC): NSIS entity responsible for interpreting the signaling carrying the user QoS parameters, optionally inserting/modifying the parameters according to local network QoS management policy, and invoking local QoS provisioning mechanisms. Note that the QoS controller might have very different functionality depending on where in the network and in what environment they are implemented. QoS Receiver (QR): NSIS entity responsible for terminating the QSCs for traffic flow(s) based on user or application requirements and responding to them to the network as well as invoking local QoS provisioning mechanisms. This can be located in the end system, but may reside elsewhere in the network. QoS Service Classes (QSC): Specification of the QoS requirements of a traffic flow or aggregate. Can be further sub-divided into user specific and network related parameters 3. Assumptions and non-assumptions The discussion about several design choices for the NSIS protocol is still ongoing. The resilience discussion in this document, however, is intended to cover any useful combination of design choices made for the eventual protocol or protocols. Therefore, care has been taken not to presuppose any solution regarding the following subjects: - routing of signaling - reservation state Instead, these issues are described in section 4 and their impact on protocol resilience is investigated where appropriate. However, in order to carry out this analysis, the following minimal assumptions were needed: Van den Bosch, et al. Informational - Expires December 2002 3 NSIS resilience analysis June 2002 - terminology regarding the entities involved in the NSIS signalling. We have adopted the terminology used in the requirements and framework draft. Note that this model does not imply any choice with respect to routing of signalling. - indication of the type of messages that are used in the signalling protocol. From the [2] we assume that at least the following messages are present o NEW: Request for the set up of a new reservation o REFRESH: Refresh of an existing reservation (for soft-state) o TEAR: (Explicit) teardown of an existing reservation o ACK: Acknowledgement of an NSIS message o NACK: Negative acknowledgement of an NSIS message 4. Design principles impacting resilience 4.1. In-band versus out-of-band signalling In case of in-band signaling, the data path is equal to the signaling path. In this case, two options can be used for routing the signaling messages between two NSIS entities: - End-to-end routing (like RSVP), requires QI, QC and QR to be on the datapath (e.g. in the endhost, router). The source address is that of the QI, the destination address that of the QR. The routing of the signalling messages is the same as for data packets. - Hop-by-hop routing can be used to explicitly send signaling messages to the next NSIS hop (which may be one or more IP hops away). This can be done by means of direct addressing, in which case the destination address for the signalling message is the next NSIS hop. This means that the source and destination address of the data flow need to be present in the payload of the signaling message. Another alternative is the use of (IPSEC) tunnels between adjacent NSIS entities. In this case, the QC needs to determine the next NSIS hop which may imply a non-standard L3 routing decision for routing of signaling. In-band signaling implies fate sharing between control and data plane. With out-of-band, data and signalling path may be different. This means that the QC needs to ensure route alignment between data plane and control plane. In this case, only hop-by-hop routing of the signaling messages is possible. It also requires determination of the NSIS next hop, which may be more than one domain away. In case out-of-band signaling is used there is no intrinsic fate sharing between control and data plane. There may be a failure in the data plane while the control plane is working properly, or the other way around. However, a failure in the data plane may cause the control plane to fail as well in certain cases (the signaling and data packets use the same network). 4.2. Reservation state The type of reservation state that is kept in QI, QC and QR will have critical impact on the resilience of the nsis protocol. Van den Bosch, et al. Informational - Expires December 2002 4 NSIS resilience analysis June 2002 Essentially, two decisions regarding the reservation state have to be made. The protocol can keep per-flow or per-class state and hard- state or soft-state. Hard state requires that reservations will be made at the time of call setup and will remain in place until they are explicitly released. This requires route pinning for the data flow in order to ensure that the reserved resources match with the data plane path. Keeping hard-state strongly reduces the amount of signalling traffic in the network but involves the risk of leaving stale state in the network after failures. Soft state uses periodic refreshes to update the state of existing reservations in the network. This means that old state is removed automatically when a reservation is not refreshed within a certain time interval and that standard L3 routing can be used for the reservation setup. Per-flow reservation state means that state for each individual accepted reservation is maintained. Per-class state means that the admission control entity only maintains the sum of the reservations of all flows with identical QoS requirements for each of the links or trunks in the network. It is the minimum amount of state needed for admission control. It is envisaged that the QI and QR will always keep per-flow state. For scalability reasons, it may be more appropriate for the QC to keep only per-class state. If no per-flow state is kept, the QC cannot relate network events to affected flows. It is therefore dependent on the frequency of the signaling messages originated by the QI/QR to provide feedback. 4.3. NSIS hierarchy In principle, the NSIS protocol or protocols can be used in a nested hierarchy. A new NSIS signalling session could be triggered by a higher layer of NSIS signalling. This situation is depicted on the picture below where (a) indicates global NSIS operation and (b) refers to local NSIS operation. +--+ (a) +--+ (a) +--+ (a) +---+ +->|QI|---->|QC|---------------->|QC|---->| QR|-+ | +--+ /+--+\ /+--+\ +---+ | | / \ / \ v +--+(b)+--+ +--+(b)+--+(b)+--+ +--+(b)+--+(b)+--+ +--+(b) +--+ |QI|-->|QR| |QI|-->|QC|-->|QR| |QI|-->|QC|-->|QR| |QI|----|QR| +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ When a failure happens in such a situation, care must be taken to avoid a race condition between both protection actions. It is generally preferable to allow the lower layer to apply protection first. This means that hold-off timers need to be foreseen in NSIS indicating the time that needs to elapse before a protection action may be taken. 4.4. Reliable transport of NSIS messages Van den Bosch, et al. Informational - Expires December 2002 5 NSIS resilience analysis June 2002 It is a requirement for NSIS signalling to be reliable. Reliability can be achieved by providing reliable transport of NSIS signalling messages and/or by including some acknowledgement and retransmission capabilities within the NSIS protocol itself. 4.5. Peer discovery or backup entities Peer discovery of NSIS entities is seen as a preliminary phase prior to NSIS protocol operation. The requirements for the peer discovery protocol, however, will differ significantly when backup NSIS entities are foreseen compared to when new peering relations are set up online after a failure condition. 5. Failure description 5.1. Internal data plane failures Data plane failures can result from the breakdown or misconfiguration of data path entities such as routers or links. In case of a router, the failure can be due to a power outage, a software crash or hardware failure, e.g. interface cards. Link failures can be due to defective interface cards or to lower layer failures such as cable cuts. We make a distinction between data plane failures that are internal to a QoS domain and failures at the edges of a QoS domain (In case of a QoS domain consisting of a single router, an internal failure can occur when a component link in a link bundle attached to the router fails). Internal data plane failures should be solved locally and should not involve NSIS signalling when succesful. 5.2. External data plane failures Edge failures can for instance occur when an Autonomous System Border Router (ASBR) or the link between two ASBRs goes down. Recovery from this type of failures requires a coordinated action between two or more QoS domains and will therefore impact NSIS signalling. 5.3. Control plane failures Two main types of control plane failures are distinguished: - Failure of an NSIS entity, e.g. a QI, QC or QR failure. If the QI is located at the end host, the end host is also considered to be part of the control plane of the NSIS signalling. - Loss of signalling messages between NSIS entities 5.4. QoS degradation A failure does not necessarily have to equal a loss of connectivity. In a QoS-enabled environment, a degradation of the perceived performance on the data path can also be categorised as a failure. It may be important to distinguish these failures from a 'hard' failure (involving loss of connectivity) in order to support hierarchical protection. Van den Bosch, et al. Informational - Expires December 2002 6 NSIS resilience analysis June 2002 6. Recovery issues 6.1. Failure detection In the data plane, node failures can be detected from the absence of hello messages from the Interior Gateway Protocol (IGP). Link failure indications can usually be derived from of lower-layer failure indications (loss of light, loss of signal). In case of logical links, failure detection can be more problematic and must either take place via the hello protocol or via a specific OAM procedure. In the control plane, NSIS failure detection involves the following issues: - When the QI is different from the end host, QI and end host need to be aware of each others failures. Two factors can complicate this problem. First, the QI may keep hard-state related to the end-host requests because the overhead of refreshes is deemed excessive. In this case, state refreshes can not be used as a failure detection mechanism. Second, the end host is not necessarily an NSIS speaker, which means that the detection mechanism must be supported outside of NSIS signalling. One potential solution is the use of higher and or lower layer triggers in the QI. This means that the QI should have higher/lower layer information available. - When an NSIS entity is protected by a backup entity, a synchronisation failure between the two entities needs to be detected prior to the protection switch. Since both entities will be NSIS speakers, this functionality could be implemented in the NSIS protocol. - In case of hop-by-hop routing of the NSIS signalling messages, a failure detection mechanism is required in NSIS, even when the communication between NSIS entities is done with a reliable transport mechanism. This is caused by the fact that the transport signalling is terminated at each NSIS hop. This means that messages can be lost in an NSIS entity without the transport layer noticing it. 6.2. Failure notification Any failure notification needed in the data plane, e.g. from lower layers, should normally be transparent to NSIS. An issue can arise for external failures. - If a failure of an external link or node is detected, a route change might be required in order to perform the recovery action. In that case, the failure should be notified to the QI if only per- class state is kept in the QC. For control plane failures, an issue might arise when only per-class state is kept. - When only per-class state is kept, which might for instance occur in a QC in order to reduce information storage, the QC cannot reroute the LSP without a message from the QI. This means that this situation is only feasible in combination with soft-state. In that case, the QC needs to notify the QI about the failure in response to Van den Bosch, et al. Informational - Expires December 2002 7 NSIS resilience analysis June 2002 the next refresh message, that therefore needs to contain an identification of the QI. - The failing reservation may be part of a service consisting of multiple such flows. An obvious example is a bi-directional service. It would seem appropriate for each flow to be protected separately and for the service not to be impacted when the protection action is successful. However, it would make little sense to keep one direction up when the other goes down. This would imply notification of the QI/QR of the flows making up the service in case of unsuccessful protection. 6.3. Recovery action The definition of a recovery strategy typically involves a decision on: - the scope of the protection action: this can be end-to-end path protection or local segment protection. We have already seen that the use of per-class state in the QC will prevent the application of local segment protection for external failures. - the timing of the protection action: protection paths can be pre- computed and pre-established in which case only a switch from primary to protection path is needed when a failure occurs. If the computation and/or establishment of the protection path is started after failure detection (rerouting), an additional recovery delay is introduced. - resource allocation on the protection paths: resource reservation on protection paths prior to failure is very inefficient and leads to excessive usage of network capacity for protection purposes. This can be avoided by not reserving capacity on the protection path. In this case, however, the availability of the protection path can not be guaranteed when a failure occurs. Alternatively, protection capacity can be shared between protection paths that are unlikely to fail simultaneously. For this, a limited number of failure scenarios (e.g. single node or single link failures) is assumed. Sharing protection capacity is highly complicated by the interdomain scope of the NSIS signalling because of the lack of detailed information on the failure states and the primary path. Recovery on the control plane may involve any of the following issues: - If a failure can only be solved by means of a change of the interdomain path of the flow, the QC might be required to force a route change from the QI. Note that this can occur without any data plane failure when the QC of a QoS domain fails and the QCs of neighbouring domains want to bring up peering relations with QCs that are not realted to the next hop on the data path. - When only per-class state is kept in a QC and a setup message is lost, retransmission of the setup message may cause duplicate reservations in QoS domains upstream of the point where the loss occurred. In case of soft-state, this situation will last for several refresh intervals. Note that the impact of this type of failure can be restricted by the use of reliable transport between NSIS entities. - In case of a QI or QR failure, the QC immediately following or preceding the failing entity may need to take up some proxy Van den Bosch, et al. Informational - Expires December 2002 8 NSIS resilience analysis June 2002 functionality for the failing entity. This functionality is optional for sending out teardown messages when soft-state is used, but mandatory for hard-state. It is also mandatory for replying to teardown messages, e.g. originating from the QI. - If the QoS Initiator does not receive a response on a signaling message within a certain time interval it will consider it as lost. It may decide to retransmit the signaling message. Therefore, the QoS Initiator must be able to make sure that the received ACK is not a delayed ACK from the previous transmitted signaling message. The signaling messages should therefore include a sequence number in order to associate a request (NEW, TEAR, REFRESH) with a response (ACK, NACK). 7. Reversion issues 7.1. Repair detection No issues are identified for the data plane. For the control plane, repair detection will most likely occur by means of discovery messages initiated by the repaired NSIS entity. It is proposed that this discovery process could be a preliminary phase prior to the NSIS protocol and can therefore remain out of its scope. It is nevertheless crucial that such a discovery protocol is standardised in order to allow NSIS entities from different vendors to discover each other. 7.2. Repair notification Repair notification will normally take place by means of the IGP in the data plane. In the control plane, repair notification may be needed when a QC has taken up some proxy functionality for the QI or QC. However, no special action should be needed for this since this information can be inferred from the repair detection. 7.3. Reversion action On the data plane, reversion is unavoidable when in-band signaling is used because the signaling messages will follow the new data plane route. This might cause issues because resources are typically not reserved on the new route, which might cause (temporary) QoS degradation on the new route. In case of out-of-band signaling reversion may be avoided by means of route pinning. However, it will usually be advantageous to move to a more favourable route when one becomes available, both from a performance and resource utilisation point of view. In that case, the separation of data plane and control plane might help delaying the route change until sufficient resources are reserved on the new path. From a control plane perspective, reversion makes little sense when backup NSIS entities are used. In that case, the reversion action Van den Bosch, et al. Informational - Expires December 2002 9 NSIS resilience analysis June 2002 will be limited to a synchronisation with the currently active NSIS entity. The repaired entity will then become the backup. 8. Conclusion This document identified, listed and classified potential resilience issues for the NSIS protocol. 9. Security Considerations TBC Reference 1 RFC 2119 Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997 2 B. Braden, B. Lindell, " A Two-Level Architecture for Internet Signaling," work in progress, draft-braden-2level-signal-arch- 00.txt Author's Addresses Sven Van den Bosch Alcatel Francis Wellesplein 1 Phone: 32-3-240-8103 B-2018 Antwerpen Email: sven.van_den_bosch@alcatel.be Belgium Maarten Buchli Alcatel Francis Wellesplein 1 Phone: 32-3-240-7081 B-2018 Antwerpen Email: maarten.buchli@alcatel.be Belgium Full Copyright Statement "Copyright (C) The Internet Society (date). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implmentation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This Van den Bosch, et al. Informational - Expires December 2002 10 NSIS resilience analysis June 2002 document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Van den Bosch, et al. Informational - Expires December 2002 11