Network Working Group                                       V. Narayanan
Internet-Draft                                                    S. Das
Intended status: Standards Track                          A. Swaminathan
Expires: January 1, 2010                                  Qualcomm, Inc.
                                                           June 30, 2009


                  Deterministic Replication for RELOAD
                   draft-vidya-p2psip-replication-00

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on January 1, 2010.

Copyright Notice

   Copyright (c) 2009 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents in effect on the date of
   publication of this document (http://trustee.ietf.org/license-info).
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.


   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008.  The person(s) controlling the copyright in some of this



Narayanan, et al.        Expires January 1, 2010                [Page 1]

Internet-Draft           Replication for RELOAD                June 2009


   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

Abstract

   RELOAD [1] provides successor replication of data to protect against
   data loss occurring from churn on the overlay.  It specifies storing
   of two redundant copies of data on the two immediate successors of a
   particular node.  This provides for basic replication and is highly
   essential for stability of the overlay.  However, it does not address
   the problem of replication to meet the availability requirements for
   a particular piece of data or to have the replicas be useful in some
   inherent load balancing on the overlay.  This document specifies a
   mechanism to provide application-agnostic, deterministic replication
   on the overlay to meet these needs.






























Narayanan, et al.        Expires January 1, 2010                [Page 2]

Internet-Draft           Replication for RELOAD                June 2009


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Replication Mechanism  . . . . . . . . . . . . . . . . . . . .  4
     3.1.  Determining Number of Replicas . . . . . . . . . . . . . .  5
       3.1.1.  Changes to Number of Replicas  . . . . . . . . . . . .  6
     3.2.  Determining Replica Locations on the Overlay . . . . . . .  7
   4.  Storing Replicas . . . . . . . . . . . . . . . . . . . . . . .  8
     4.1.  Data Authorization for Replicas  . . . . . . . . . . . . .  8
     4.2.  Replica Consistency  . . . . . . . . . . . . . . . . . . .  9
   5.  Retrieving Replicas  . . . . . . . . . . . . . . . . . . . . .  9
   6.  Configuration Document Extension . . . . . . . . . . . . . . .  9
   7.  Security Considerations  . . . . . . . . . . . . . . . . . . .  9
   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 10
   9.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 10
   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10
     10.1. Normative References . . . . . . . . . . . . . . . . . . . 10
     10.2. Informative References . . . . . . . . . . . . . . . . . . 10
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10































Narayanan, et al.        Expires January 1, 2010                [Page 3]

Internet-Draft           Replication for RELOAD                June 2009


1.  Introduction

   Replication is a basic mechanism needed for providing useful services
   on a peer-to-peer overlay.  It is necessary to provide robust and
   available data discovery.  If data is stored once in the overlay, it
   is susceptible to loss due to the node on which the information was
   stored leaving the network.  In a peer-to-peer network, churn
   (joining and leaving of nodes in the overlay network due to mobility,
   turning devices on and off etc.) is common and storage of data items
   needs to have sufficient robustness to data loss due to churn.
   Replication on successors, as done in [1], provides some basic
   protection against data loss due to churn, but it does not take any
   availability metrics of the data itself into account.  Since the
   availability metrics for different types of data may be different, it
   is also not feasible to generically take this into account in
   successor replication of data.  Further, successor replication does
   not contribute to data retrieval load balancing itself, since the
   resource id owner is still responsible for handling all requests
   corresponding to the data item.  This document describes a scheme
   that accommodates for churn and makes data fetches more robust by
   replicating data items over a certain number of nodes in the overlay,
   taking an availability metric into account.  The aim of the
   replication scheme is to allow application designers to be able to
   store and retrieve data with some basic guarantees.  This design
   inherently provides some load balancing of data retrieval as well,
   thereby reducing the load on the primary storing node.


2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [2].


3.  Replication Mechanism

   This section describes a basic mechanism for replication for RELOAD.
   The goal of the simple scheme is to provide some basic availability
   for important data items stored in the overlay.  The mechanism is
   meant to deal only with data items of a size supported by the
   overlay.  RELOAD defines a size restriction for data stored on the
   overlay, specified by the overlay configuration document.  Dealing
   with larger data items would require more complicated mechanisms for
   consistency and concurrent access as well as system-wide management
   of bandwidth for maintaining replicas; and these mechanisms are out
   of scope of this design.




Narayanan, et al.        Expires January 1, 2010                [Page 4]

Internet-Draft           Replication for RELOAD                June 2009


   The basic mechanism proposed here takes target availability A for the
   data item that is being stored and determines the number of replicas
   k, required to attain that target availability depending on the
   overlay network characteristics.  It then defines which K nodes
   should be used to store these replicas.  The scheme is based on
   research in [3].

3.1.  Determining Number of Replicas

   Determining the number of replicas depends on two factors: (1) the
   target availability for the data item (A) and (2) the average node
   availability in the overlay network (H).  The higher the value of A,
   the more likely it is that a larger number of replicas will be
   required.  On the other hand, the higher the value of H, the more
   likely it is that a smaller number of replicas will be required.  The
   replication mechanism takes a value of A and H and finds a number of
   replicas k to satisfy the availability A.

   The relationship between A, H and k can be characterized by the
   following equations.

      A = 1 - (1 - H)^K

      K = log(1 - A)/log(1 - H)

   The first equation characterizes the availability of object as the
   probability that at least one replica of the object survived.  The
   probability that a particular replica fails is given by 1-H since H
   is the average host availability.  This value raised k times is the
   probability that all k replicas have failed.  This assumes that
   replica failure is independent, which is a valid assumption to make
   since we aim to choose the replicas randomly (as discussed in the
   next section).  Subtracting this value from 1 is the event that at
   least one replica of the data item is still available.  The second
   equation is obtained by solving the first equation for K.

   Given this equation, the number of replicas is given by the A level
   for a data item and the H level of the overlay on which the data item
   is going to be stored.

   The A level is obtained from the application requesting to store a
   data item.  This value is typically expected to be specified as a
   number of nines.  For example, 3 nines of availability mean that A is
   0.999.  Studies on Internet availability have shown that it can be on
   the order of 3 nines and hence, it makes sense to have a value lower
   than or equal to 3 nines for A. However, we do not limit the values
   for A in this document.




Narayanan, et al.        Expires January 1, 2010                [Page 5]

Internet-Draft           Replication for RELOAD                June 2009


   Determining the H value of an overlay network is a bigger challenge.
   Essentially H is a number characterizing the average nodes dynamics
   in the overlay networks.  H is the average node availability on the
   overlay and hence depends on measurement of the node availability of
   each node in the overlay.  We propose to measure availability of a
   single node in the overlay using a model proposed in [3].  Each node
   has a membership lifetime which is the time from when the node first
   is part of the overlay until it leaves the overlay permanently.
   During its membership lifetime a node may have many sessions with
   periods of disconnection due to mobility or turning a node on and
   off.

   The node availability is the fraction of time the node is reachable
   which is the sum of all the session times divided by the overall
   membership lifetime.  To distinguish temporary departure from
   permanent leave a membership timeout is used in the model.  If the
   node is unreachable for more than the membership timeout Tau, it is
   assumed to have permanently left.

   Studies show that the H value can vary based on the type of overlay
   being considered.  Thus, the number of replicas K needed for 2 nines
   of availability would be higher for a p2p overlay in the "wild" than
   an enterprise overlay network, for instance.

   Based on studies done in this field, we propose that a static H value
   be configured for a network based on the type of overlay: 0.99 for
   managed distributed systems, 0.85 for enterprise overlays and 0.5 for
   end-user based p2p overlays.  We also propose that the H value for
   local area overlays be initially 0.5 as well because of the high
   churn values expected due to mobility in such networks.  The H value
   must be specified in the overlay configuration document for use by
   all the nodes.

   Active measurement of the H value is possible, but left for further
   study.  It is conceivable to think of ways in which nodes can arrive
   at their own determination of the value of H based on the knowledge
   of their neighbors and fingers.  However, this is not specified by
   this version of the document.

   Note that the replication does not depend on the number of nodes in
   the network but on the dynamics of the network and the availability
   required.

3.1.1.  Changes to Number of Replicas

   The number of replicas needed for a data item may change due to
   changes either in the A or H values.  The data owner is responsible
   for dealing with any changes to the number of replicas.  If the



Narayanan, et al.        Expires January 1, 2010                [Page 6]

Internet-Draft           Replication for RELOAD                June 2009


   availability needed for a particular data item changes, existing
   replicas may be removed or new replicas may be added by the node.
   The average availability of a node is not expected to be a volatile
   number.  However, it is conceivable that this may change
   occasionally.  In such a case, replicas for a data may be determined
   any time the data is refreshed.  Alternately, the data owner may
   actively change the number of replicas when there is a change in the
   H value.

3.2.  Determining Replica Locations on the Overlay

   Using the value of H for an overlay network and the application
   specified value for A, the number of replicas K can be determined.
   The next issue is where to store K replicas of the data item.

   The scheme proposed for determining the K nodes to replicate on is as
   follows.  Each chosen replica node MUST replicate the data item on
   the successor (in Chord).  This allows for automatic recovery from
   transient failures when a node suddenly leaves.  Requests for the
   data would automatically get routed to the successor.  Thus the
   problem now is to choose K/2 replica nodes since they will each also
   replicate on their successors.  We propose to use the resource name
   of the data item being stored and adding well known modifiers such as
   replica1, replica2, replica3 and so on to the resource name and
   generating K/2 new resource ids for the same data item.

   Thus if an object with a resource name "alice@exampledht.org" is
   being replicated and the K value determined is 6, a resource id
   generated by doing an overlay-specific hash of the primary resource
   name will account for 2 replicas (the node responsible for the
   resource id generated and the successor of that node).  Subsequently,
   resource ids MUST be generated by hashing
   "alice@exampledht.org:replica1" and "alice@exampledht.org:replica2"
   to and store the data items with those 2 resource ids to generate the
   remaining 4 replicas.

   The benefits of this method of replica selection are:

   o  The cost of replication is on the owner of the data

   o  A random node in the overlay knows where to expect a replica of
      the data item to reside and can look for the replica when a query
      to the original resource id fails

   o  For certain applications, it may be necessary to optimize lookup
      time for a resource id and this allows for parallel lookups over
      all the resource ids of an object and its replicas by any node in
      the overlay.



Narayanan, et al.        Expires January 1, 2010                [Page 7]

Internet-Draft           Replication for RELOAD                June 2009


   Note that when a node (say A) suddenly leaves the network, the node
   following it (say B) becomes the primary node to manage the queries
   made to items in A. Then, the new data that B obtained due to A
   leaving needs to be propagated to B's successor to maintain
   continuity.


4.  Storing Replicas

   A replica storage MUST indicate that the data being stored is a
   replica and not the primary resource.  It does so by indicating the
   replica information in the Store request message.  For a replica, it
   sets the IsReplica value to "TRUE" and provides the replica instance
   of the data being stored.


         struct {
           boolean   IsReplica;
           uint8     ReplicaInstance;
         } ReplicaInfo;


   The modified StoreReq message is as follows:


           struct {
               ResourceId             resource;
               uint8                  replica_number;
               StoreKindData          kind_data<0..2^32-1>;
               ReplicaInfo            replica_info;
           } StoreReq;

   ToDo: Resolve terminology clash between RELOAD's "replica" and the
   replica used in this draft.  RELOAD uses "replica_number" to indicate
   a replica stored on a successor - I'm not sure why this is needed,
   but it causes confusion to use the same terminology.

   The storing node MUST verify authorization of the replica store as
   described below.

4.1.  Data Authorization for Replicas

   Nodes must be able to verify authorization on the resource name after
   excluding the replication specific string.  Hence, the authorization
   semantics for replicas are exactly identical to what is required for
   the particular kind id of the data being stored.  For example, if the
   data being stored requires an authorization corresponding to the
   resource id being equal to the hash of the user name, the storing



Narayanan, et al.        Expires January 1, 2010                [Page 8]

Internet-Draft           Replication for RELOAD                June 2009


   node of a replica MUST ensure that hash of (user name:||"replica"||
   replica_instance, where || indicates concatenation) equals the
   resource id.  Here, "replica" is just the string and replica_instance
   can be determined from the Store request.  Hence, the replication
   model requires no changes in how authorization is applied for various
   kind ids.

4.2.  Replica Consistency

   Maintaining consistency across replicas is the responsibility of the
   data owner.  There are no provisions in the replication mechanism to
   synchronize across replicas.  Any such synchronization mechanism will
   require more complexity at the storing node and is highly
   undesirable.


5.  Retrieving Replicas

   Applications that have knowledge of the availability of the data
   items stored can make use of this to retrieve data from the closest
   replica.  Given that the location of these replicas is deterministic,
   nodes can directly issue a Fetch request to one of the replica
   owners.  This effectively also reduces the load on the primary
   resource owner.

   Nodes may determine which replica to retrieve based on any
   information on the topological proximity of the nodes.  Alternately,
   nodes may determine this based on the proximity in the overlay
   identity space.  The primary resource owner may also redirect the
   request for a fetch to other replicas when it is highly loaded.  The
   behavior for such a redirect is not defined in this version of the
   document.


6.  Configuration Document Extension

   Inside each "configuration" element we propose to add an element
   "H-value" that defines the H value (a floating point number) for a
   given overlay and drives the replication algorithm.  The enrollment
   server may choose to update the H-value in a revision of the document
   based on overlay operational information that may be obtained from
   diagnostics.


7.  Security Considerations

   TBD




Narayanan, et al.        Expires January 1, 2010                [Page 9]

Internet-Draft           Replication for RELOAD                June 2009


8.  IANA Considerations

   TBD


9.  Acknowledgments

   TBD


10.  References

10.1.  Normative References

   [1]  Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and H.
        Schulzrinne, "REsource LOcation And Discovery (RELOAD) Base
        Protocol", draft-ietf-p2psip-base-02 (work in progress),
        March 2009.

   [2]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", BCP 14, RFC 2119, March 1997.

10.2.  Informative References

   [3]  Rodrigues, R. and B. Liskov, "High Availability in DHTs: Erasure
        Coding vs. Replication",  In Peer-to-Peer Systems IV 4th
        International Workshop IPTPS 2005, 2005.


Authors' Addresses

   Vidya Narayanan
   Qualcomm, Inc.
   5775 Morehouse Dr
   San Diego, CA
   USA

   Phone: +1 858-845-2483
   Email: vidyan@qualcomm.com












Narayanan, et al.        Expires January 1, 2010               [Page 10]

Internet-Draft           Replication for RELOAD                June 2009


   Saumitra M. Das
   Qualcomm, Inc.
   3195 Kifer Road
   Santa Clara, CA
   USA

   Phone: +1 408-533-9529
   Email: saumitra@qualcomm.com


   Ashwin Swaminathan
   Qualcomm, Inc.
   5775 Morehouse Dr
   San Diego, CA
   USA

   Phone: +1 858-845-8775
   Email: sashwin@qualcomm.com

































Narayanan, et al.        Expires January 1, 2010               [Page 11]