DECADE Xin.Wang Internet Draft Fudan University Intended status: Informational Jin.Zhao Expires: April 2011 Fudan University Tiegang.Zeng Fudan University Jun.Li Fudan University Lei.Liu Fudan University Shihui.Duan CATR October 18, 2010 Router-supported Data Regeneration for In-network Storage Systems draft-wang-decade-data-regeneration-00.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on April 18, 2009. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. Wang, et al. Expires April 18, 2011 [Page 1] Internet-Draft data regeneration October 18, 2010 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract In-network storage systems store redundancy to compensate for the data loss incurred by hardware failures. To maintain the amount of redundancy, in-network storage systems should regenerate the lost redundancy on a replacement server. This document introduces a practical work of router-supported data regeneration in in-network storage systems. Supporting routers in such systems can help to reduce the traffic incurred during the regeneration and save the regeneration time. This document presents a regeneration process that exploits the bandwidth diversity in the network, and the corresponding protocol for storage servers and supporting routers such that supporting routers can work transparently to support the regeneration process. Wang, et al. Expires April 18, 2011 [Page 2] Internet-Draft data regeneration October 18, 2010 Table of Contents 1. Introduction ................................................ 4 2. Key problems ................................................ 4 3. Overview of the Router-Supported Regeneration Process ......... 6 3.1. Regeneration Process .................................... 6 3.2. File Regeneration Protocol .............................. 7 3.3. System Implementation and Components .................... 9 3.4. Key Technologies ....................................... 10 4. Experiment ................................................. 11 5. Security Considerations ..................................... 12 6. IANA Considerations ........................................ 12 7. Conclusions ................................................ 12 8. References ................................................. 13 8.1. Normative References ................................... 13 8.2. Informative References ................................. 13 9. Acknowledgments ............................................ 13 Wang, et al. Expires April 18, 2011 [Page 3] Internet-Draft data regeneration October 18, 2010 1. Introduction In-network storage systems store a substantial volume of data in an overlay network containing a large number of storage servers, which can be used for online storage service, file sharing, CDN, and etc. In such systems, peer churns and hardware failures are unavoidable, such that some data may not be accessible. Thus, the system should maintain a certain ratio of redundancy such that a subset of data is enough for data recovery. Storing coded data of original files rather than their replicas [1] can maintain higher data integrity [2], such that any k of n storage servers can retrieve the original data. On the other hand, if a storage server fails, a replacement server should regenerate the data stored in the failed server. For coded data that guarantee that any k servers can retrieve the original file, the replacement server, which we call as "newcomer" in this document, should contact at least k storage servers, which we call as "providers" in this document. The efficiency of such regeneration is influenced by the topology of the network. First, since the newcomer should contact multiple providers, several flows of data from providers may converge at some links, and thus incur a significant bottleneck in the transmission. Second, the topology may have an influence on the available bandwidth between two servers. For example, two servers in a subnet may probably have higher available bandwidth between them than two servers in two different subnets. As shown in [3], the diversity of bandwidth incurred by network topology can be exploited by letting providers relay the data flow from other providers to form a tree- structured topology during the regeneration, where network coding naturally resides, such that some slow links can be bypassed. In this document, we show that the devices that relay the traffic during the regeneration can be not only providers, but routers as well. Routers can support the regeneration, as it can encode data when multiple data flows converge and reduce the overall traffic during the regeneration. Since the redundant maintenance is independent of data access, the scheme we present is compatible with the protocol to access in-network storage [4]. We show that routers can support the regeneration process transparently such that no storage servers should be aware of such routers or the network topology. 2. Key problems In order to support data regeneration in in-network storage systems, routers should be able to work transparently such that storage Wang, et al. Expires April 18, 2011 [Page 4] Internet-Draft data regeneration October 18, 2010 servers participating in the regeneration do not need any information about routers in the network. To satisfy this goal, the following key problems should be considered: a) Bandwidth: During the regeneration, since providers are allowed to relay the traffic from other providers, available bandwidth between servers should be measured to determine the optimal routing. However, since supporting routers are supposed to be transparent to storage servers, we can only measure the end-to-end available bandwidth between each two servers participating in the regeneration process. b) Routing: Since we can only get the table of the available bandwidth between each pair of participating servers, we first determine the routing on the overlay network covering the participating servers, such that the transmission rate during the regeneration is optimized. c) Mapping: To make the supporting routers work, we need a mapping mechanism to make supporting routes aware of the regeneration process and know how to act during the regeneration. After the routing in the overlay network has been determined, participating servers may send data to their next-hop servers, to make supporting routers between them know that there will be traffic of a regeneration process coming soon. Supporting routers should be able to determine how many flows will come, whether it is necessary to encode such flows and where the next-hop is. After mapping, the data transmission can start. d) Reliability: Data transmission is unreliable in the network due to packet loss, disorder or error. Supporting routers should have a mechanism to make sure the data transmission is reliable. If the transmission between two servers is based on TCP, supporting routers should maintain the TCP state of incoming flows. If the transmission is based on UDP, there should be application-level retransmission scheme to guarantee the data reliability. e) Congestion control: TCP provides a mechanism to control the congestion in the network. However, if multiple flows are encoded by a supporting router, the router should control the congestion in place of the destination server. However, if the transmission is based on UDP, participating servers should perform end-to-end congestion control. Wang, et al. Expires April 18, 2011 [Page 5] Internet-Draft data regeneration October 18, 2010 3. Overview of the Router-Supported Regeneration Process Apart from data access, data regeneration is a part of functions provided by in-network storage systems. Our scheme requires that coded data are stored in the systems, such that a file is divided into k blocks and n encoded blocks are produced after encoding in which any k blocks can recover the original file. The coding technique should be decentralized such that the coding operations are not necessary to perform on a single server, such as random linear coding [5]. The n encoded blocks are stored in n storage servers, i.e., each storage server stores one encoded block. When a server failure occurs, a replacement server, called newcomer, should contact at least k encoded blocks, and reconstruct a new encoded block by re- encoding these encoded blocks. 3.1. Regeneration Process Data regeneration process should be carried out as follows. 1. A server failure is detected and then a regeneration process is triggered. One newcomer and at least k providers are selected. A weighted complete graph covering all these k + 1 servers is made in which the weight of each edge denotes the available bandwidth measured between each pair of the k + 1 servers. A maximum spanning tree, called regeneration tree, is constructed on such a complete graph. 2. The newcomer obtains IP addresses of all providers and the regeneration tree. It then sends a NOTIFICATION messages to each provider. 3. Each provider replies an ACK message when it receives a NOTIFICATION message to its parent in the regeneration tree. 4. When an ACK message goes through a supporting router, the supporting router forwards this message and stores IP addresses of the source and the destination of the ACK message. An operation table should be constructed on the supporting router that contains the sources and destinations of the received ACK message and the number of hops to the corresponding destinations. 5. Non-leaf providers modify the type of the received ACK message to a DACK message and forward it to the newcomer. If the newcomer has received ACK or DACK message from all providers, it sends a DETECT message to each provider. Wang, et al. Expires April 18, 2011 [Page 6] Internet-Draft data regeneration October 18, 2010 6. When a provider receives a DETECT message, it replies a RE-DETECT message to its parent in the regeneration tree. When a RE-DETECT message goes through a supporting router, the supporting router select the destination with the minimum number of hops in the operation table as the new destination of the RE-DETECT message and then forwards it. The supporting router selects IP addresses of sources of all received RE-DETECT messages and construct an encoding table. 7. Non-leaf providers modify the type of the received RE-DETECT message to a DRE-DETECT message and forward it to the newcomer. If the newcomer has received RE-DETECT or DRE-DETECT messages from all providers, it sends a START message to each provider. 8. All providers begin to send data in DATA messages that contain fixed number of bits of data to its parent node when it has received a START message. A provider that has incoming flow(s) has to wait to send its first DATA message until the first DATA message of each incoming flow has arrived and it has encoded the received data with the data it stores. 9. When a DATA message goes through a supporting router, the supporting router stores it until it has received corresponding DATA messages from all entries in its encoding table. It then encodes the data in the received DATA message and sends a new DATA message that contains the encoded data to the destination with the minimum number of hops in the operation table. 10.The newcomer stores the received data. If the newcomer has multiple incoming flows, it encodes the received data and stores them. The regeneration finishes when it has received all data. 3.2. File Regeneration Protocol The File Regeneration Protocol (FRP) runs on the application layer, as shown in Figure 1. +-----------------------------------------------------------+ | MAC | IP | TCP/UDP | FR | PAYLOAD | | HEADER | HEADER | HEADER | HEADER | | +-----------------------------------------------------------+ Figure 1 The overall structure of the File Regeneration Protocol The structure of the FR head is shown as Figure 2. Wang, et al. Expires April 18, 2011 [Page 7] Internet-Draft data regeneration October 18, 2010 0 1 16 +-----------------------------------------------------------+ |CMD | SOURCE | |TYPE| IP ADDRESS | +-----------------------------------------------------------+ | | | FILE BLOCK NAME | +-----------------------------------------------------------+ |RESERVE|PROVIDER |PACKET | | |NUMBER |NUMBER | +--------------------------+ 0 2 4 6 Figure 2 The structure of the FR header "CMD TYPE" indicates the type of the message that includes: NOTIFICATION: The newcomer sends a NOTIFICATION messages to providers to start a regeneration process. ACK: A provider replies an ACK message to the newcomer when it receives a NOTIFICATION message. An ACK message makes supporting routers that it goes through be aware of the regeneration process. DACK: A DACK message is no different from a ACK message except for the CMD TYPE. Supporting routers will not process the DACK message. DETECT: The newcomer sends DETECT messages to providers when it has received ACK messages that contain addresses of all providers. RE-DETECT: A provider replies an DETECT message to the newcomer when it receives a DETECT message. A RE-DETECT message makes supporting routers be aware of the number of incoming flows during the upcoming data transmission. DRE-DETECT: A DRE-DETECT message is no different from a RE-DETECT message except for the CMD TYPE. Supporting routers will not process the DRE-DETECT message. START: The newcomer sends START messages to all providers, indicating all servers are ready. DATA: Providers send DATA messages that contain fixed number of bits of data to its parent in the regeneration tree. DATA messages may be encoded at supporting routers and are finally forwarded to the newcomer. Wang, et al. Expires April 18, 2011 [Page 8] Internet-Draft data regeneration October 18, 2010 "SOURCE IP ADDRESS" represents the IP address of the last encoding device, which may be a provider or a supporting router. "FILE BLOCK NAME" represents the name of the file block in the transmission. "RESERVE" represents the segment that is reserved for future applications. "PROVIDER NUMBER" represents the identifier number of the provider. "PACKET NUMBER" represents the sequential number of the file block in the transmission. 3.3. System Implementation and Components Router-supported data regeneration is an independent component in in- network storage systems. Since there are a large number of storage servers in the system, the server failure occurs frequently. To maintain the data integrity, a high-efficient mechanism is necessary to regenerate the lost data when a server fails. The implementation of our proposed in-network storage system contains two parts: storage servers and supporting routers. From the perspective of functions, storage servers are composed of three functional parts: dispatcher, newcomer and provider. Figure 3 illustrates the system architecture and components of the router-supported data regeneration. +----+ +----+ |---| SR |---| PR | | +----+ +----+ +----+ +----+ +----+ | | | DI |---| NC |---| SR |---| | +----+ +----+ +----+ | | | +----+ +----+ |---| SR |---| PR | +----+ +----+ Figure 3 The architecture and components of the router-supported data regeneration DISPATCHER (DI): It manages the whole system, including the selection of the newcomer and providers and the detection of server failures. NEWCOMER (NC): It starts the regeneration process with providers and accepts data from providers. It encodes the received data and stores the encoded data as a regenerated block. Wang, et al. Expires April 18, 2011 [Page 9] Internet-Draft data regeneration October 18, 2010 PROVIDER (PR): Providers are storage server that provider data to the newcomer in the regeneration process. SUPPORTING ROUTER (SR): Supporting routers are routers with computing and cache capabilities and can support regeneration. Before the data transmission during the regeneration, it collects information from ACK and RE-DETECT message to be aware of the incoming flows and the destination in the regeneration process. 3.4. Key Technologies Some key technologies are presented in this section, such that data transmission rate during the regeneration can be improved, the bandwidth consumption is saved and the time spent during the regeneration can be reduced. 1. When the newcomer and providers have been selected, we measure the available bandwidth between each pair of the newcomer and providers. A complete graph covering the newcomer and providers can be constructed and the weight of each edge is the corresponding available bandwidth between two servers. A maximum spanning tree, i.e., a regeneration tree, then is constructed on this graph, in which the newcomer is the root. All non-root servers in the regeneration tree send their data to its parent. When a flow of data transferred goes through a supporting router, it may be encoded with other flows and forwarded to another server. Compared with conventional regeneration process, the method we propose utilizes the link with higher available bandwidth in the network, reduces the communication cost and thus increases the transmission rate during the regeneration. 2. Another key technology in the router-supported data regeneration is data encoding on the supporting router. During the regeneration, supporting router detects File Regeneration Protocol (FRP) by processing all IP packets that go through it. Supporting routers recognizes the file block being regenerated by analyzing the FRP. If multiple flows of the same block come during the regeneration, a supporting router should encode the received data. The header of FRP enables the supporting router to know whether encoding operations should be performed. Utilizing the computing capability, encoding operations that should have been performed on the newcomer or providers are partially transferred to supporting routers, such that supporting routers sends out only one data flow even if it receives multiple data flows. Therefore, multiple data flows sharing the same physical link are eliminated or at least partially eliminated, and transmission rate during the regeneration can be significantly improved. Wang, et al. Expires April 18, 2011 [Page 10] Internet-Draft data regeneration October 18, 2010 4. Experiment We present our evaluation results of our proposed scheme here. We implement supporting routers on servers running Linux (kernel 2.6.30). In the network, there are four supporting routers and four storage servers, among which one is selected as both the dispatcher and the newcomer and others are providers. The storage servers and supporting routers can connect in a topology shown in Figure 4. +-----+ +-----+ +-----+ +-----+ |DI+NC|----| SR1 |----| SR4 |----| PR3 | +-----+ +-----+ +-----+ +-----+ | | | | | | +-----+ +-----+ +-----+ +-----+ | PR1 |----| SR2 |----| SR3 |----| PR2 | +-----+ +-----+ +-----+ +-----+ Figure 4 the network topology in the experiment Each link in the network topology refers to a fast Ethernet (100BASE- TX, specifically). Actually we control the available bandwidth on each link as follows. Table 1 The available bandwidth in the network topology +-------------------------------------------------------+ | Node 1 | Node 2 | Available Bandwidth (MBps/S) | +-------------------------------------------------------+ | DI+NC | SR1 | 60 | | SR1 | SR4 | 50 | | SR4 | PR3 | 80 | | SR1 | SR2 | 40 | | SR1 | SR3 | 25 | | SR4 | SR3 | 20 | | PR1 | SR2 | 80 | | SR2 | SR3 | 50 | | SR3 | PR2 | 80 | +-------------------------------------------------------+ We regenerate a coded block with a size of 6,000,000 bytes. We compare the time spent in the regeneration process and the bandwidth consumed between the conventional regeneration process in which the newcomer receives data directly from providers and the regeneration process we propose above. The experiment is repeated for 100 times and the result is the average value. Table 2 The average bandwidth consumption Wang, et al. Expires April 18, 2011 [Page 11] Internet-Draft data regeneration October 18, 2010 +-------------------------------------+ | conventional | router-supported | | regeneration | regeneration | +-------------------------------------+ | 58241047 byte | 45606648 bytes | +-------------------------------------+ Table 2 shows the bandwidth consumption on average. We count the total number of bytes all providers and supporting routers send out. We can see that router-supported regeneration process is able to reduce the bandwidth consumption by 21.7%, since supporting routers can encode data from multiple devices. Table 3 The average regeneration time +-------------------------------------+ | conventional | router-supported | | regeneration | regeneration | +-------------------------------------+ | 95.4 sec. | 49.4 sec. | +-------------------------------------+ Table 3 shows the average regeneration time. Router-supported regeneration can reduce the regeneration time by 48.3%, because it can not only reduce the bandwidth consumption, but also utilize the network topology by bypassing links with low available bandwidth. 5. Security Considerations This draft does not introduce any new security issues. 6. IANA Considerations This memo includes no request to IANA. 7. Conclusions We propose a topology-aware regeneration process for in-network storage system such that bandwidth diversity in the network can be exploited and routers may support the regeneration process by encoding the incoming data flows. We present the corresponding protocol to configure the regeneration process adaptively in which supporting routers and servers do not need to know the network topology and make decisions by their local information. System architecture is presented and related key technologies are discussed. Wang, et al. Expires April 18, 2011 [Page 12] Internet-Draft data regeneration October 18, 2010 8. References 8.1. Normative References [1] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, and D. Noveck, "Network File System (NFS) version 4 Protocol", RFC 3510, 2003. [2] H. Weatherspoon and J. Kubiatowicz, "Erasure Coding vs. Replication: A Quantitative Comparison," Peer-to-Peer Systems, vol. 2429/2002, pp. 328-337, 2002. [3] J. Li, S. Yang, X. Wang, and B. Li, "Tree-structured Data Regeneration in Distributed Storage Systems with Regenerating Codes," in Proc. IEEE INFOCOM, 2010. [4] H. Song, N. Zong, Y. Yang, and R. Alimi, "DECoupled Application Data Enroute (DECADE) Problem Statement," http:// http://tools.ietf.org/id/draft-ietf-decade-problem-statement- 00.txt. [5] T. Ho, R. Koetter, M. Medard, D. Karger, and M. Effros, "The Benefits of Coding over Routing in a Randomized Setting," in Proc. International Symp. Inform. Theory, pp. 442, 2003. 8.2. Informative References 9. Acknowledgments Wang, et al. Expires April 18, 2011 [Page 13] Internet-Draft data regeneration October 18, 2010 Authors' Addresses Xin Wang Fudan University Shanghai 201203, China Phone: 86-13917836410 Email: xinw@fudan.edu.cn Jin Zhao Fudan University Shanghai 201203, China Phone: 86-15900825171 Email: jzhao@fudan.edu.cn Tiegang Zeng Fudan University Shanghai 201203, China Phone: 86-21-51355526 Email: 09210240087@fudan.edu.cn Jun Li Fudan University Shanghai 201203, China Phone: 86-21-51355526 Email: 0572222@fudan.edu.cn Lei Liu Fudan University Shanghai 201203, China Phone: 86-21-51355526 Email: 09210240117@fudan.edu.cn Shihui Duan CATR Beijing 100045, China Phone: 86-10-63200068 Email: duanshihui@catr.cn Wang, et al. Expires April 18, 2011 [Page 14]