CCAMP Working Group R. Rabbat (Ed.) Internet Draft Fujitsu Labs of America Expires: July 2004 Toshio Soumiya (Ed.) Fujitsu Laboratories Ltd January 2004 Optical Transport Network Failure Recovery Requirements draft-rabbat-optical-recovery-reqs-01.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document focuses on requirements for control-plane based recovery from data-plane failures in optical transport networks that use an IP-based (GMPLS) control plane. It aims to gather and systematically lay out the requirements so that they can serve as a coherent basis for work on solution and protocol enhancements and developments. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [2]. Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 1] draft-rabbat-optical-recovery-reqs-01.txt January 2004 Table of Contents 1. Introduction...................................................2 2. Glossary of Terms Used.........................................3 3. Failure Recovery Requirements..................................3 3.1 Overview of Recovery Requirements.............................3 3.2 Shared Mesh-based Recovery....................................4 3.3 Failure Notification Mechanisms...............................5 3.4 Optical Network Failure Recovery Requirements.................6 4. Security Considerations........................................8 5. Conclusions....................................................8 6. Intellectual Property Considerations...........................8 7. References.....................................................9 8. Acknowledgments...............................................10 9. EditorsÆ Address..............................................10 10. AuthorsÆ Addresses...........................................10 Full Copyright Statement.........................................11 1. Introduction This document describes requirements for control plane-based recovery from data-plane failures in optical networks. We focus on optical networks that use a Generalized Multi-Protocol Label Switching (GMPLS)-based [3] control plane and various data plane technologies. Service recovery from failures, using either a protection or restoration scheme, is an important feature of these transport networks to ensure high-reliability and uninterrupted service. Protection and restoration algorithms may be used either for local repair (around failed spans or nodes) or edge-to-edge recovery of an LSP. Shared mesh-based recovery is desirable to reduce spare capacity requirements and enable flexible service recovery scenarios. While edge-to-edge based recovery has the potential to be more resource-efficient than link-based protection, it also entails the (potentially lengthy) delay incurred in notifying all nodes along the recovery path of the failure of a remote resource on the working path. For many applications, recovery paths must be chosen carefully to meet strict recovery time requirement (e.g., in the range of few tens to a few hundred ms). Several documents within the CCAMP WG currently relate to recovery in GMPLS networks. They cover terminology and functional specifications [4, 5] and analysis [6] for recovery in GMPLS-based networks, and survivability requirements and considerations for traffic engineered or hierarchical networks [7]. As a set, these documents provide detailed discussions of the concepts and mechanisms used in network recovery. The requirements for control plane-based recovery in Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 2] draft-rabbat-optical-recovery-reqs-01.txt January 2004 transport network have not, however, been specifically detailed in any one document. This is the objective of the current document. 2. Glossary of Terms Used The following acronyms are used in this document: o LMP: Link Management Protocol [8] o LSP: Label Switched Path o OADM: Optical Add/Drop Multiplexer o OXC: Optical Cross-Connect o RSVP-TE: Resource Reservation Protocol-Traffic Engineering [9] The terminology for GMPLS-based recovery is documented in [4]. These terms are borrowed from the generic protection switching document at the ITU-T [10]. We use the following terms from that document: o Detecting Entity (Failure Detection). o Reporting Entity (Failure Correlation and Notification). o Deciding Entity (part of the failure recovery decision process). o Recovery Entity (part of the failure recovery activation process). o Bridge, which could be Permanent Bridge, Broadcast Bridge, or Selector Bridge. o Selector, which could be a ôSelective selectorö or a ôMerging Selectorö. o Recovery phases: 1. Failure Detection, 2. Failure Localization and Isolation, 3. Failure Notification, 4. Recovery (Protection or Restoration), 5. Reversion (Normalization) 3. Failure Recovery Requirements Even though some requirements for fault recovery have been discussed in the CCAMP, MPLS, and TE WGs, several additional aspects need to be examined in the context of recovery in optical networks. In this section, we describe the fault recovery requirements that we have collected based on discussions with several carriers. 3.1 Overview of Recovery Requirements This subsection summarizes the survivability requirements for optical networks. Greater details on the requirements are provided in the subsequent subsections. The following classes (types) of recovery are required for span, LSP segment, and LSP recovery: Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 3] draft-rabbat-optical-recovery-reqs-01.txt January 2004 o Protection - pre-computed route and pre-established (i.e., cross- connected) resources o Restoration - pre-computed route and on-demand establishment of resources, or - on-demand route and on-demand establishment of resources A recovery scheme uses either protection or restoration (or both), together with failure detection and notification mechanisms. Depending on the service specification, the timing bounds for recovery may range from 50 ms (for e.g., to repair services carrying PSTN voice) to other less strict bounds of say several hundred ms (for low priority data). For multi-layered networks, hold-off timers are required to allow recovery at lower layers to proceed before higher layers take action (if needed). Of course, escalation to higher layers should be possible when necessary. Support for horizontal hierarchy must also be included, because large networks are usually segmented [7]. In general, recovery schemes must operate in a stable and cooperative manner to maximize the network's reliability and availability. Recovery schemes should also be resource efficient and flexible with respect to types of failures, service classes, and the network operator policies that they can support. As has been identified in [4], a critical component in guaranteeing the time constraints for service recovery is the Failure Notification phase. 3.2 Shared Mesh-based Recovery TodayÆs Synchronous Optical Network / Synchronous Digital Hierarchy (SONET/SDH) networks use recovery techniques based on linear and ring topologies. Linear protection may include 1+1 and 1:N protection, while ring protection usually involves uni-directional path switched ring (UPSR) and bi-directional line switched ring (BLSR) protection. Linear 1+1 protection and ring-based protection both require 100% redundancy in spare resources for every working path. Even with 1:N based link protection, it may difficult to select different routes flexibly. Therefore, shared mesh-based recovery has emerged as a flexible and efficient option for optical network recovery. Shared mesh recovery allows for the possibility of sharing recovery capacity among multiple working paths. This increases flexibility, by allowing for more options when routing both the working and the Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 4] draft-rabbat-optical-recovery-reqs-01.txt January 2004 recovery paths. Furthermore, this flexibility allows faster recovery because the shared mesh provides for a greater number of suitable/feasible intermediate nodes for routing the recovery paths. However, it does raise the need that failure notification and reconfiguration may have to be performed at multiple nodes along the protection path, as illustrated by the following simple example. +---+ .....| E |.............. : +---+ : : : +---+ +---+ \ / +---+ +---+ ===| A |====| B |====X====| C |====| D |=== +---+ +---+ / \ +---+ +---+ : : : +---+ +---+ : :......| F |.........| G |......: +---+ +---+ Figure 1. Multiple (partial) recovery paths protecting against the failure of link BC. Figure 1 illustrates how, for shared mesh recovery, different network nodes may need to be informed of a network fault/failure. Suppose a failure occurs on link BC. Here, the working LSPs follow the route ABCD, and the recovery paths have been reserved along the two dotted routes. The nodes along the recovery paths have not been activated, however. Recovery paths BED and AFGD are each responsible for recovering a portion of the working capacity on link BC. In this case, nodes A, B, D, E, F, and G must all receive a notification of the failure and perform reconfiguration actions before the backup paths can carry traffic from the working path. 3.3 Failure Notification Mechanisms To effect recovery in a timely fashion, both the failure correlation/ aggregation time (that is, the time spent on the computations performed at the reporting entity) and failure notification time (the time that elapses prior to all entities involved in the recovery receiving a failure notification signal) must be minimized. The failure correlation time is required regardless of the restoration scheme used. Since shared-mesh restoration potentially requires the reconfiguration of nodes along the protection path, merely using data plane notification techniques to notify the end points of an LSP of a Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 5] draft-rabbat-optical-recovery-reqs-01.txt January 2004 failure are not sufficient to effect recovery. Rather, there needs to be a means for the control-plane to inform nodes on the backup path of a failure/fault in the network (which can be viewed as control- plane based failure notification). There are, in general, two alternatives for control-plane based failure notification: o Failure notification messages dispatched using GMPLS signaling o Controlled flooding of failure notification messages The GMPLS signaling protocol, RSVP-TE [9], supports notification using a Notify message. Under this scheme, the deciding entity pre- arranges to receive the notifications by sending a Notify Request object in the Path or Resv messages. The recovery process therefore requires 2 or 3-phases. The reporting entity first sends notification of the failure to the deciding entity. The deciding entity then begins a 1 or 2-phase signaling process on the recovery LSP (which requires either signaling down the recovery LSP or signaling down and back). The controlled flooding of failure notification messages in the control plane is another alternative for failure notification. Flooding supports recovery schemes that require reconfiguration, or policy or priority-based decisions to be made at multiple decision entities distributed within the network, off the working path. 3.4 Optical Network Failure Recovery Requirements o Requirements on the efficiency of bandwidth use 1. A recovery scheme SHOULD allow efficient use of working LSP bandwidth using such measures as route optimization, taking into account route dependencies between a working path and its recovery path. 2. A recovery scheme SHOULD allow efficient use of recovery LSP bandwidth using such measures as route optimization, taking into account route dependencies between a working path and its recovery path. 3. A recovery scheme SHOULD, when possible, allow sharing of recovery bandwidth among multiple recovery paths to enable efficient use of recovery bandwidth. o Requirements on recovery actions Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 6] draft-rabbat-optical-recovery-reqs-01.txt January 2004 4. A recovery scheme SHOULD allow suppression of fault notification messages, so that spurious fault notification messages and recovery action messages are not transmitted within the network, ensuring scalability of the fault recovery mechanism. 5. A recovery scheme SHOULD ensure reliable transmission of fault notification messages, provided the control plane is connected. 6. A recovery scheme SHOULD allow the network operator to choose whether or not reversion actions are to be performed. 7. A recovery scheme SHOULD allow testing and verification of the availability of the recovery path before its actual use. This testing may occur when the recovery path is provisioned, or after it is provisioned but before actual recovery action occurs. 8. A recovery scheme SHOULD make sure that recovery actions correctly move traffic from failed paths to their respective recovery paths, such that the recovery actions do not result in long-term misconnections o Requirements on recovery schemes 9. A recovery scheme SHOULD provide mechanisms that can be used to support generally used recovery schemes such as 1+1, 1:1, 1:N, M:N, and unprotected. 10. A recovery scheme SHOULD support priority-based recovery of failed LSPs. This means that recovery should be ordered according to each LSP's recovery priority. o Requirements on recovery priority of service classes 11. A recovery scheme SHOULD take into consideration the recovery priority of LSPs. 12. A recovery scheme SHOULD allow support of service classes with different recovery time guarantees. o Requirements on recovery granularity 13. A recovery scheme SHOULD allow recovery of traffic on an aggregated basis, for scalability. o Requirements on fault notification Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 7] draft-rabbat-optical-recovery-reqs-01.txt January 2004 14. A recovery scheme SHOULD have a failure notification mechanism that guarantees prompt and reliable delivery of notification of data plane faults to a deciding entity in charge of recovering from the fault. 15. A recovery scheme SHOULD support recovery within bounded time constraints and MAY be compliant with generally used recovery times like 50ms for SONET/SDH protection. o Requirements on graceful degradation and network stability 16. A recovery scheme SHOULD allow for graceful degradation of performance in the presence of a fault class that was not anticipated. 17. A recovery scheme SHOULD allow fallback operations of its recovery actions. For example, when the system encounters a fault class that was not anticipated, the system should execute a best-effort recovery, such that as many working paths as possible are restored under the circumstances. 18. A recovery scheme SHOULD NOT compromise the stability of the network when the network encounters a fault class that was not anticipated (such as multiple, independent, simultaneous failures). 4. Security Considerations This draft does not introduce any new security issues. 5. Conclusions This draft described requirements for control plane-based recovery from data plane failures in optical transport networks. We identified some important requirements for enabling flexible recovery schemes, facilitating the efficient use of resources, and meeting the potentially strict recovery times in such networks. 6. Intellectual Property Considerations This section is taken from Section 10.4 of RFC2026 [1]. The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 8] draft-rabbat-optical-recovery-reqs-01.txt January 2004 might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights, which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. 7. References [1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [3] Mannie, E. (Ed.), "Generalized Multi-Protocol Label Switching (GMPLS) Architecture", Internet Draft, work in progress, draft- ietf-ccamp-gmpls-architecture-07.txt, May 2003. [4] Mannie, E. and D. Papadimitriou (Eds.), "Recovery (Protection and Restoration) Terminology for GMPLS", Internet Draft, work in progress, draft-ietf-ccamp-gmpls-recovery-terminology-02.txt, May 2003. [5] Lang, J.P. and B. Rajagopalan (Eds.), "Generalized MPLS Recovery Functional Specification", Internet Draft, work in progress, draft-ietf-ccamp-gmpls-recovery-functional-01.txt, September 2003. [6] Papadimitriou, D. and E. Mannie (Eds.), "Analysis of Generalized MPLS-based Recovery Mechanisms (including Protection and Restoration)", Internet Draft, work in progress, draft-ietf- ccamp-gmpls-recovery-analysis-02.txt, September 2003. [7] Lai, W.S., and D. McDysan (Eds.), "Network Hierarchy and Multilayer Survivability", RFC 3386, November 2002. Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 9] draft-rabbat-optical-recovery-reqs-01.txt January 2004 [8] Lang, J. (Ed.), "Link Management Protocol (LMP)", Internet Draft, draft-ietf-ccamp-lmp-10.txt, October 2003. [9] Berger, L. (Ed.), "Generalized MPLS Signaling - RSVP-TE Extensions", RFC 3473, January 2003. [10] "Generic Protection Switching: Linear Trail and Sub-Network Protection", ITU-T Recommendation G.808.1, November 2003. 8. Acknowledgments The authors would like to thank Peter Czezowski and Takafumi Chujo of Fujitsu Labs of America, Inc., Norihiko Shinomiya and Akira Chugo of Fujitsu Laboratories, Ltd for various inputs, Jonathan Lang for valuable review and feedback, and Adrian Farrell for his feedback. 9. EditorsÆ Address Richard Rabbat Fujitsu Labs of America, Inc. 1240 E. Arques Ave., MS 345 Sunnyvale, CA 94085 United States of America Phone: +1-408-530-4537 Email: rabbat@alum.mit.edu Toshio Soumiya Fujitsu Laboratories Ltd. 1-1, Kamikodanaka 4-Chome Nakahara-ku, Kawasaki 211-8588, Japan Phone: +81-44-754-2765 Email: soumiya.toshio@jp.fujitsu.com 10. AuthorsÆ Addresses Kohei Shiomoto NTT Network Innovation Laboratories Midori-machi 3-9-11, Musashino-shi Tokyo, Japan 180-8585 Phone: +81-422-59-4402 Email: Shiomoto.Kohei@lab.ntt.co.jp Shoichiro Seno Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 10] draft-rabbat-optical-recovery-reqs-01.txt January 2004 Mitsubishi Electric Corporation 5-1-1 Ofuna, Kamakura Kanagawa, Japan 247-8501 Phone: +81-467-41-2430 Email: senos@isl.melco.co.jp Full Copyright Statement "Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." Rabbat & Soumiya (Eds.) Expires - July 2004 [Page 11]