Network Working Group V. Narayanan Internet-Draft S. Das Intended status: Standards Track A. Swaminathan Expires: January 1, 2010 Qualcomm, Inc. June 30, 2009 Deterministic Replication for RELOAD draft-vidya-p2psip-replication-00 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on January 1, 2010. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this Narayanan, et al. Expires January 1, 2010 [Page 1] Internet-Draft Replication for RELOAD June 2009 material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Abstract RELOAD [1] provides successor replication of data to protect against data loss occurring from churn on the overlay. It specifies storing of two redundant copies of data on the two immediate successors of a particular node. This provides for basic replication and is highly essential for stability of the overlay. However, it does not address the problem of replication to meet the availability requirements for a particular piece of data or to have the replicas be useful in some inherent load balancing on the overlay. This document specifies a mechanism to provide application-agnostic, deterministic replication on the overlay to meet these needs. Narayanan, et al. Expires January 1, 2010 [Page 2] Internet-Draft Replication for RELOAD June 2009 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Replication Mechanism . . . . . . . . . . . . . . . . . . . . 4 3.1. Determining Number of Replicas . . . . . . . . . . . . . . 5 3.1.1. Changes to Number of Replicas . . . . . . . . . . . . 6 3.2. Determining Replica Locations on the Overlay . . . . . . . 7 4. Storing Replicas . . . . . . . . . . . . . . . . . . . . . . . 8 4.1. Data Authorization for Replicas . . . . . . . . . . . . . 8 4.2. Replica Consistency . . . . . . . . . . . . . . . . . . . 9 5. Retrieving Replicas . . . . . . . . . . . . . . . . . . . . . 9 6. Configuration Document Extension . . . . . . . . . . . . . . . 9 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10.1. Normative References . . . . . . . . . . . . . . . . . . . 10 10.2. Informative References . . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 Narayanan, et al. Expires January 1, 2010 [Page 3] Internet-Draft Replication for RELOAD June 2009 1. Introduction Replication is a basic mechanism needed for providing useful services on a peer-to-peer overlay. It is necessary to provide robust and available data discovery. If data is stored once in the overlay, it is susceptible to loss due to the node on which the information was stored leaving the network. In a peer-to-peer network, churn (joining and leaving of nodes in the overlay network due to mobility, turning devices on and off etc.) is common and storage of data items needs to have sufficient robustness to data loss due to churn. Replication on successors, as done in [1], provides some basic protection against data loss due to churn, but it does not take any availability metrics of the data itself into account. Since the availability metrics for different types of data may be different, it is also not feasible to generically take this into account in successor replication of data. Further, successor replication does not contribute to data retrieval load balancing itself, since the resource id owner is still responsible for handling all requests corresponding to the data item. This document describes a scheme that accommodates for churn and makes data fetches more robust by replicating data items over a certain number of nodes in the overlay, taking an availability metric into account. The aim of the replication scheme is to allow application designers to be able to store and retrieve data with some basic guarantees. This design inherently provides some load balancing of data retrieval as well, thereby reducing the load on the primary storing node. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [2]. 3. Replication Mechanism This section describes a basic mechanism for replication for RELOAD. The goal of the simple scheme is to provide some basic availability for important data items stored in the overlay. The mechanism is meant to deal only with data items of a size supported by the overlay. RELOAD defines a size restriction for data stored on the overlay, specified by the overlay configuration document. Dealing with larger data items would require more complicated mechanisms for consistency and concurrent access as well as system-wide management of bandwidth for maintaining replicas; and these mechanisms are out of scope of this design. Narayanan, et al. Expires January 1, 2010 [Page 4] Internet-Draft Replication for RELOAD June 2009 The basic mechanism proposed here takes target availability A for the data item that is being stored and determines the number of replicas k, required to attain that target availability depending on the overlay network characteristics. It then defines which K nodes should be used to store these replicas. The scheme is based on research in [3]. 3.1. Determining Number of Replicas Determining the number of replicas depends on two factors: (1) the target availability for the data item (A) and (2) the average node availability in the overlay network (H). The higher the value of A, the more likely it is that a larger number of replicas will be required. On the other hand, the higher the value of H, the more likely it is that a smaller number of replicas will be required. The replication mechanism takes a value of A and H and finds a number of replicas k to satisfy the availability A. The relationship between A, H and k can be characterized by the following equations. A = 1 - (1 - H)^K K = log(1 - A)/log(1 - H) The first equation characterizes the availability of object as the probability that at least one replica of the object survived. The probability that a particular replica fails is given by 1-H since H is the average host availability. This value raised k times is the probability that all k replicas have failed. This assumes that replica failure is independent, which is a valid assumption to make since we aim to choose the replicas randomly (as discussed in the next section). Subtracting this value from 1 is the event that at least one replica of the data item is still available. The second equation is obtained by solving the first equation for K. Given this equation, the number of replicas is given by the A level for a data item and the H level of the overlay on which the data item is going to be stored. The A level is obtained from the application requesting to store a data item. This value is typically expected to be specified as a number of nines. For example, 3 nines of availability mean that A is 0.999. Studies on Internet availability have shown that it can be on the order of 3 nines and hence, it makes sense to have a value lower than or equal to 3 nines for A. However, we do not limit the values for A in this document. Narayanan, et al. Expires January 1, 2010 [Page 5] Internet-Draft Replication for RELOAD June 2009 Determining the H value of an overlay network is a bigger challenge. Essentially H is a number characterizing the average nodes dynamics in the overlay networks. H is the average node availability on the overlay and hence depends on measurement of the node availability of each node in the overlay. We propose to measure availability of a single node in the overlay using a model proposed in [3]. Each node has a membership lifetime which is the time from when the node first is part of the overlay until it leaves the overlay permanently. During its membership lifetime a node may have many sessions with periods of disconnection due to mobility or turning a node on and off. The node availability is the fraction of time the node is reachable which is the sum of all the session times divided by the overall membership lifetime. To distinguish temporary departure from permanent leave a membership timeout is used in the model. If the node is unreachable for more than the membership timeout Tau, it is assumed to have permanently left. Studies show that the H value can vary based on the type of overlay being considered. Thus, the number of replicas K needed for 2 nines of availability would be higher for a p2p overlay in the "wild" than an enterprise overlay network, for instance. Based on studies done in this field, we propose that a static H value be configured for a network based on the type of overlay: 0.99 for managed distributed systems, 0.85 for enterprise overlays and 0.5 for end-user based p2p overlays. We also propose that the H value for local area overlays be initially 0.5 as well because of the high churn values expected due to mobility in such networks. The H value must be specified in the overlay configuration document for use by all the nodes. Active measurement of the H value is possible, but left for further study. It is conceivable to think of ways in which nodes can arrive at their own determination of the value of H based on the knowledge of their neighbors and fingers. However, this is not specified by this version of the document. Note that the replication does not depend on the number of nodes in the network but on the dynamics of the network and the availability required. 3.1.1. Changes to Number of Replicas The number of replicas needed for a data item may change due to changes either in the A or H values. The data owner is responsible for dealing with any changes to the number of replicas. If the Narayanan, et al. Expires January 1, 2010 [Page 6] Internet-Draft Replication for RELOAD June 2009 availability needed for a particular data item changes, existing replicas may be removed or new replicas may be added by the node. The average availability of a node is not expected to be a volatile number. However, it is conceivable that this may change occasionally. In such a case, replicas for a data may be determined any time the data is refreshed. Alternately, the data owner may actively change the number of replicas when there is a change in the H value. 3.2. Determining Replica Locations on the Overlay Using the value of H for an overlay network and the application specified value for A, the number of replicas K can be determined. The next issue is where to store K replicas of the data item. The scheme proposed for determining the K nodes to replicate on is as follows. Each chosen replica node MUST replicate the data item on the successor (in Chord). This allows for automatic recovery from transient failures when a node suddenly leaves. Requests for the data would automatically get routed to the successor. Thus the problem now is to choose K/2 replica nodes since they will each also replicate on their successors. We propose to use the resource name of the data item being stored and adding well known modifiers such as replica1, replica2, replica3 and so on to the resource name and generating K/2 new resource ids for the same data item. Thus if an object with a resource name "alice@exampledht.org" is being replicated and the K value determined is 6, a resource id generated by doing an overlay-specific hash of the primary resource name will account for 2 replicas (the node responsible for the resource id generated and the successor of that node). Subsequently, resource ids MUST be generated by hashing "alice@exampledht.org:replica1" and "alice@exampledht.org:replica2" to and store the data items with those 2 resource ids to generate the remaining 4 replicas. The benefits of this method of replica selection are: o The cost of replication is on the owner of the data o A random node in the overlay knows where to expect a replica of the data item to reside and can look for the replica when a query to the original resource id fails o For certain applications, it may be necessary to optimize lookup time for a resource id and this allows for parallel lookups over all the resource ids of an object and its replicas by any node in the overlay. Narayanan, et al. Expires January 1, 2010 [Page 7] Internet-Draft Replication for RELOAD June 2009 Note that when a node (say A) suddenly leaves the network, the node following it (say B) becomes the primary node to manage the queries made to items in A. Then, the new data that B obtained due to A leaving needs to be propagated to B's successor to maintain continuity. 4. Storing Replicas A replica storage MUST indicate that the data being stored is a replica and not the primary resource. It does so by indicating the replica information in the Store request message. For a replica, it sets the IsReplica value to "TRUE" and provides the replica instance of the data being stored. struct { boolean IsReplica; uint8 ReplicaInstance; } ReplicaInfo; The modified StoreReq message is as follows: struct { ResourceId resource; uint8 replica_number; StoreKindData kind_data<0..2^32-1>; ReplicaInfo replica_info; } StoreReq; ToDo: Resolve terminology clash between RELOAD's "replica" and the replica used in this draft. RELOAD uses "replica_number" to indicate a replica stored on a successor - I'm not sure why this is needed, but it causes confusion to use the same terminology. The storing node MUST verify authorization of the replica store as described below. 4.1. Data Authorization for Replicas Nodes must be able to verify authorization on the resource name after excluding the replication specific string. Hence, the authorization semantics for replicas are exactly identical to what is required for the particular kind id of the data being stored. For example, if the data being stored requires an authorization corresponding to the resource id being equal to the hash of the user name, the storing Narayanan, et al. Expires January 1, 2010 [Page 8] Internet-Draft Replication for RELOAD June 2009 node of a replica MUST ensure that hash of (user name:||"replica"|| replica_instance, where || indicates concatenation) equals the resource id. Here, "replica" is just the string and replica_instance can be determined from the Store request. Hence, the replication model requires no changes in how authorization is applied for various kind ids. 4.2. Replica Consistency Maintaining consistency across replicas is the responsibility of the data owner. There are no provisions in the replication mechanism to synchronize across replicas. Any such synchronization mechanism will require more complexity at the storing node and is highly undesirable. 5. Retrieving Replicas Applications that have knowledge of the availability of the data items stored can make use of this to retrieve data from the closest replica. Given that the location of these replicas is deterministic, nodes can directly issue a Fetch request to one of the replica owners. This effectively also reduces the load on the primary resource owner. Nodes may determine which replica to retrieve based on any information on the topological proximity of the nodes. Alternately, nodes may determine this based on the proximity in the overlay identity space. The primary resource owner may also redirect the request for a fetch to other replicas when it is highly loaded. The behavior for such a redirect is not defined in this version of the document. 6. Configuration Document Extension Inside each "configuration" element we propose to add an element "H-value" that defines the H value (a floating point number) for a given overlay and drives the replication algorithm. The enrollment server may choose to update the H-value in a revision of the document based on overlay operational information that may be obtained from diagnostics. 7. Security Considerations TBD Narayanan, et al. Expires January 1, 2010 [Page 9] Internet-Draft Replication for RELOAD June 2009 8. IANA Considerations TBD 9. Acknowledgments TBD 10. References 10.1. Normative References [1] Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and H. Schulzrinne, "REsource LOcation And Discovery (RELOAD) Base Protocol", draft-ietf-p2psip-base-02 (work in progress), March 2009. [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 10.2. Informative References [3] Rodrigues, R. and B. Liskov, "High Availability in DHTs: Erasure Coding vs. Replication", In Peer-to-Peer Systems IV 4th International Workshop IPTPS 2005, 2005. Authors' Addresses Vidya Narayanan Qualcomm, Inc. 5775 Morehouse Dr San Diego, CA USA Phone: +1 858-845-2483 Email: vidyan@qualcomm.com Narayanan, et al. Expires January 1, 2010 [Page 10] Internet-Draft Replication for RELOAD June 2009 Saumitra M. Das Qualcomm, Inc. 3195 Kifer Road Santa Clara, CA USA Phone: +1 408-533-9529 Email: saumitra@qualcomm.com Ashwin Swaminathan Qualcomm, Inc. 5775 Morehouse Dr San Diego, CA USA Phone: +1 858-845-8775 Email: sashwin@qualcomm.com Narayanan, et al. Expires January 1, 2010 [Page 11]