Netowork Working Group L. Guo Internet-Draft CAICT Intended status: Informational Y. Feng Expires: 10 September 2023 China Mobile J. Zhao China Telecom F. Qin China Mobile L. Zhao H. Wang Huawei W. Quan Beijing Jiaotong University 9 March 2023 Requirement of Fast Fault Detection for IP-based Network draft-guo-ffd-requirement-01 Abstract The IP-based distributed system and software application layer often use heartbeat to maintain the network topology status. However, the heartbeat setting is long, which prolongs the system fault detection time. This document describes the requirements for a fast fault detection solution of IP-based network. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." Guo, et al. Expires 10 September 2023 [Page 1] Internet-Draft Abbreviated-Title March 2023 This Internet-Draft will expire on 10 September 2023. Copyright Notice Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1. IP-based NVMe . . . . . . . . . . . . . . . . . . . . . . 3 3.2. Distributed Storage . . . . . . . . . . . . . . . . . . . 7 3.3. Cluster Computing . . . . . . . . . . . . . . . . . . . . 8 4. Requirement . . . . . . . . . . . . . . . . . . . . . . . . . 9 5. Security Considerations . . . . . . . . . . . . . . . . . . . 9 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 7.1. Normative References . . . . . . . . . . . . . . . . . . 10 7.2. Informative References . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction In the face of ever-expanding data, the powerful single-server system cannot meet the requirements of data analysis and storage. At the same time, with the increase of Ethernet network bandwidth and scale, the distributed system that communicates through the network emerges and develops rapidly. Heartbeat is a common network topology maintenance technology used in distributed systems and software application layers. However, if the heartbeat is set too short, the current network congestion may lead to misjudgment. If the value of this parameter is too long, the judgment is slow. Generally, you need to balance and set the parameters based on various conditions. IP-based NVMe, distributed storage and Cluster Computing are used for core application scenarios. The requirements for performance and impact of faults on services are increasing. This document describes application scenarios and capability requirements for fast fault Guo, et al. Expires 10 September 2023 [Page 2] Internet-Draft Abbreviated-Title March 2023 detection in scenarios such as IP-based NVMe, artificial intelligence, and distributed storage. 2. Terminology AI:Artificial intelligence FC: Fiber Channel HPC:High-performance computing NVMe: Non-Volatile Memory Express IP-based NVMe: using RDMA or TCP to transport NVMe through Ethernet NoF: NVMe of Fabrics 3. Use Cases 3.1. IP-based NVMe For a long time, the key storage applications and high performance requirements are mainly based on FC networks. With the increase of transmission rates, the medium has evolved from HDDs to solid-state storage and the protocol has evolved from SATA to NVMe. The emergence of new NVMe technologies brings new opportunities. With the development of the NVMe protocol, the application scenario of the NVMe protocol is extended from PCIe to other fabrics, solving the problem of NVMe extension and transmission distance. The block storage protocol uses NoF to replace SCSI, reducing the number of protocol interactions from application hosts to storage systems. The end-to-end NVMe protocol greatly improves performance. Guo, et al. Expires 10 September 2023 [Page 3] Internet-Draft Abbreviated-Title March 2023 Fabrics of NoF include Ethernet, Fibre Channel and InfiniBand. Comparing FC-NVMe to Ethernet- or InfiniBand-based Network alternatives generally takes into consideration the advantages and disadvantages of the networking technologies. Fibre Channel fabrics are noted for their lossless data transmission, predictable and consistent performance, and reliability. Large enterprises tend to favor FC storage for mission-critical workloads. But Fibre Channel requires special equipment and storage networking expertise to operate and can be more costly than IP-based alternatives. Like FC, InfiniBand is a lossless network requiring special hardware. IP- based NVMe storage products tend to be more plentiful than FC-NVMe- based options. Most storage startups focus on IP-based NVMe. But unlink FC, The Ethernet switch does not notify the change of device status. When the device is faulty, relying on the NVMe link heartbeat message mechanism, the host takes tens of seconds to complete service failover. +--------------------------------------+ | NVMe Host Software | +--------------------------------------+ +--------------------------------------+ | Host Side Transport Abstraction | +--------------------------------------+ /\ /\ /\ /\ /\ / \ / \ / \ / \ / \ FC IB RoCE iWARP TCP \ / \ / \ / \ / \ / \/ \/ \/ \/ \/ +--------------------------------------+ |Controller Side Transport Abstraction | +--------------------------------------+ +--------------------------------------+ | NVMe SubSystem | +--------------------------------------+ Figure 1: NVMe SubSystem This section describes the application scenarios and capability requirements of the IP-based NVMe storage that implements fast fault detection similar to FC. The NVMe over RDMA or IP-based network in storage includes three types of roles: an initiator (referred to as a host), a switch, and a target (referred to as a storage device). Initiators and targets are also referred to as endpoint devices. Guo, et al. Expires 10 September 2023 [Page 4] Internet-Draft Abbreviated-Title March 2023 +--+ +--+ +--+ +--+ Host |H1| |H2| |H3| |H4| (Initiator) +/-+ +-,+ +.-+ +/-+ | | '. ,-`| | | | `', | | | | ,-` '. | | +-\--+ +--`-+ +`'--+ +-\--+ | SW | | SW | | SW | | SW | +--,-+ +---,, +,.--+ +-.--+ `. `'.,` .` `. _,-'` ``'., .` IP +--'`+ +`-`-+ Network | SW | | SW | +--,,+ +,.,-+ .` `'., ,.-`` ', .` _,-'` `. +--`-+ +--'`+ `'---+ +-`'-+ | SW | | SW | | SW | | SW | +-.,-+ +-..-+ +-.,-+ +-_.-+ | '. ,-` | | `., .' | | `', | | '.` | | ,-` '. | | ,-` `', | Storage +-`+ `'\+ +-`+ +`'+ (Target) |S1| |S2| |S3| |S4| +--+ +--+ +--+ +--+ Figure 2: NVMe over IP-based Network Hosts and storage devices are connected to the network separately and in order to achieve high reliability, each host and storage device are connected to dual network planes simultaneously. The host can read and write data services when an NVMe connection is established between the host and the storage device. When a storage device link is faulty during running, the host cannot detect the fault status of the indirectly connected device at the transport layer. Based on the IP-based NVMe protocol, the host uses the NVMe heartbeat to detect the status of the storage device. The heartbeat message interval is 5s. Therefore, it takes tens of seconds to determine whether the storage device is faulty and perform service switchover using the multipath software. Failure tolerance time for core applications cannot be reached. In order to obtain the best customer experience and business reliability requirement, we need to enhance fault detection and failover for IP-based NVMe. The storage system has an active-active solution. The proposal, the second active path can be used to transfer faults to drive the switchover of the source node, is going on in NVMe. However, this can only solve the local link faults of the storage node, but cannot Guo, et al. Expires 10 September 2023 [Page 5] Internet-Draft Abbreviated-Title March 2023 solve the problem of unconverged network faults. In storage application deployment scenarios, independent dual-plane networking maybe used. In this deployment, a single-plane device may be faulty. In this case, network convergence cannot be performed completely. In this proposal, a fast fault detection solution with switch participation is proposed. This scheme utilizes the ability of switches to detect faults quickly at the physical layer and link layer, and allows the switch to synchronize the detected fault information in the IP network, and then notify the fault status to the endpoint devices. Fault detection procedure: The host can detect the fault status of the storage device and quickly switch to the standby path. 1. If a storage fault occurs, the access switch detects the fault at the storage network layer or link layer. 2. The switch synchronizes the status to other switches on the network. 3. The switch notifies the storage fault information to the hosts. 4. Quickly disconnect the connection from the storage device and trigger the multipathing software to switch services to the redundant path. The fault should be detected within 1s. +----+ +-------+ +-------+ +-------+ |Host| |Switch | |Switch | |Storage| +----+ +-------+ +-------+ +-------+ | | |-+ | | | |1| | | | |-+ | | |<----2------| | | | | | |<----3-------| | | | | | | |<----4-------|------------|-----------> | | | | | Figure 3: Switches interact with hosts and storage devices Guo, et al. Expires 10 September 2023 [Page 6] Internet-Draft Abbreviated-Title March 2023 3.2. Distributed Storage Distributed storage cluster devices are interconnected through a network (back-end IP network) to establish a cluster. When a link fault on a node or node fault occurs in the storage cluster, other nodes in the storage cluster cannot detect the fault status of the indirectly connected devices through the transport layer. Based on the IP protocol, management or master nodes in a storage cluster use heartbeats to detect the status of storage nodes. It takes 10 seconds or more to determine whether a storage device is faulty and switch services to another normal storage node. Services cannot be accessed during the fault. To achieve the best customer experience and service reliability, we need to enhance the fault detection and failover of IP-based cluster nodes. Storage +--+ +--+ +--+ +--+ cluster |S1| |S2| |S3| |S4| +--+ +--+ +--+ +--+ | '. ,-` | | .`',_ | | _ ..--` `'--.._ | +-\--+ +-\--+ | SW | | SW | +--,-+_ _+-.--+ `. `'--..._ _ .. -- '`_.` `. _,-'` -._ .` BACK Storage +--'`+ +`-`-+ IP Network | SW | | SW | +----+ +----+ Figure 4: Distributed storage The fast fault detection solution in this proposal can be used in this scenario. This solution takes advantage of the switch's ability to quickly detect faults at the physical layer and link layer, and allows the switch to synchronize fault information detected on the IP network. Then, the system notifies the storage cluster management node or the primary node of the fault status. Fault detection procedure: 1. If a storage fault occurs, the access switch detects the fault at the storage network layer or link layer. 2. The switch synchronizes the status to other switches on the network. Guo, et al. Expires 10 September 2023 [Page 7] Internet-Draft Abbreviated-Title March 2023 3. The switch notifies the storage fault information to the storage management or master node. The fault should be detected within 1s. +------+ +-------+ +-------+ +-------+ |master| |Switch | |Switch | |Storage| +------+ +-------+ +-------+ +-------+ | | |-+ | | | |1| | | | |-+ | | |<----2------| | | | | | |<----3---------| | | | | | | Figure 5: Switches interact with controller 3.3. Cluster Computing In cluster computing scenarios, for example, HPC cluster applications and AI cluster applications, cluster node faults and failures may occur on any node at any time. However, for a high-performance computing task, once a fault occurs, the entire task needs to be re- scheduled. However, It takes several minutes for the management node to detect the node fault status. During this period, new jobs may be scheduled to the faulty node, causing task execution failure. The fast fault detection solution in this proposal can be used in this scenario. The fault can be detected within seconds. +-----------------+ +-------+ +-------+ +----------+ | Management/ | |Switch | |Switch | | Computer | | Scheduling node | | | | | | node | +-----------------+ +-------+ +-------+ +----------+ | | |-+ | | | |1| | | | |-+ | | |<----2------| | | | | | |<----3--------------------| | | | | | | Figure 6: Switches interact with HPC cluster Fault detection procedure is similar to that of distributed storage like figure 6. Guo, et al. Expires 10 September 2023 [Page 8] Internet-Draft Abbreviated-Title March 2023 4. Requirement In distributed Ethernet systems and cross-network connection scenarios, the following requirements are raised to accelerate failover: 1. A network device can detect link or network failure. 2. A network device can synchronize the failure to other network devices. 3. A network device can notify local/remote failure information to local access endpoints. 4. The network device sends notification to the endpoints when it detects, or being notified of the detection of, any of the endpoints' subscribing failure . 5. Security Considerations The functions in this requirement are mainly used in limited networks, and the use of the functions needs to be deployed by the operator and control the scope of use. This requirement involves network devices notifying messages to endpoint devices, which requires the cooperation of endpoint devices. In addition, in order to limit the range of notification messages, it is recommended that network devices use L2 messages to implement the notification function, so that the range of notification messages generated is limited to the access range of access nodes, and the flood of notification messages will not be caused. In addition, according to the scope of this required function, the notification message should only be generated by the access network devices, and should not be forwarded by the network device, so the network device also needs to control the receiving and publishing behavior of the messages. The synchronization message between network devices is based on the session between devices, and the message encryption and authentication can be performed for this session, which is already a mature technology. 6. IANA Considerations NA 7. References Guo, et al. Expires 10 September 2023 [Page 9] Internet-Draft Abbreviated-Title March 2023 7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . 7.2. Informative References Authors' Addresses Liang Guo CAICT No.52, Hua Yuan Bei Road, Haidian District, Beijing 100191 China Email: guoliang1@caict.ac.cn Yi Feng China Mobile 12 Chegongzhuang Street, Xicheng District Beijing China Email: fengyiit@chinamobile.com Jizhuang Zhao China Telecom South District of Future Science and Technology, Changping District Beijing China Email: zhaojzh@chinatelecom.cn Fengwei Qin China Mobile 12 Chegongzhuang Street, Xicheng District Beijing China Email: qinfengwei@chinamobile.com Guo, et al. Expires 10 September 2023 [Page 10] Internet-Draft Abbreviated-Title March 2023 Lily Zhao Huawei No. 3 Shangdi Information Road, Haidian District Beijing China Email: Lily.zhao@huawei.com Haibo Wang Huawei No. 156 Beiqing Road Beijing P.R. China Email: rainsword.wang@huawei.com Wei Quan Beijing Jiaotong University 3 Shangyuan Cun, Haidian District Beijing P.R. China Email: weiquan@bjtu.edu.cn Guo, et al. Expires 10 September 2023 [Page 11]