Network Working Group J. Lei Internet-Draft X. Fu Expires: December 22, 2006 X. Yang D. Hogrefe Univ. Goettingen June 20, 2006 DMMP: Dynamic Mesh-based Overlay Multicast Protocol draft-lei-samrg-dmmp-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on December 22, 2006. Copyright Notice Copyright (C) The Internet Society (2006). Abstract This document describes a Dynamic Mesh-based overlay Multicast Protocol (DMMP) to support multicast data delivery applications without relying on classic IP multicast, including multicast group management, overlay hierarchy establishment, multicast tree construction and data forwarding scheme from the source to a number of receivers. The DMMP framework builds on control plane functions Lei, et al. Expires December 22, 2006 [Page 1] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 which dynamically manage an overlay core and a multicast tree layer. The key idea is a small number of selected end hosts self-organizing into an overlay mesh, and dynamically maintaining such a mesh. Based on the constructed mesh, some core-based clusters are built with capacity-aware trees inside. Then, a multicast tree consists of DMMP-aware end hosts (and/or specific routers) is built on the top of the overlay core for efficient delivery of the multicast data. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Features of DMMP . . . . . . . . . . . . . . . . . . . . . . . 4 3. Terminology and Abbreviations . . . . . . . . . . . . . . . . 6 4. DMMP Overview . . . . . . . . . . . . . . . . . . . . . . . . 7 4.1. Control plane in DMMP . . . . . . . . . . . . . . . . . . 8 4.2. Data plane in DMMP . . . . . . . . . . . . . . . . . . . . 11 5. DMMP Messages . . . . . . . . . . . . . . . . . . . . . . . . 12 6. DMMP: Protocol Details . . . . . . . . . . . . . . . . . . . . 15 6.1. Initialization . . . . . . . . . . . . . . . . . . . . . . 15 6.2. Super Node Selection . . . . . . . . . . . . . . . . . . . 16 6.3. Member Join . . . . . . . . . . . . . . . . . . . . . . . 17 6.4. Data Delivery Control . . . . . . . . . . . . . . . . . . 18 6.5. Refresh Information . . . . . . . . . . . . . . . . . . . 19 6.6. Capacity specification . . . . . . . . . . . . . . . . . . 19 6.7. Member Leave . . . . . . . . . . . . . . . . . . . . . . . 20 6.8. Failure Recovery . . . . . . . . . . . . . . . . . . . . . 21 7. Security Considerations . . . . . . . . . . . . . . . . . . . 22 8. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 22 9. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 23 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 23 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 23 11.1. Normative References . . . . . . . . . . . . . . . . . . . 23 11.2. Informative References . . . . . . . . . . . . . . . . . . 23 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25 Intellectual Property and Copyright Statements . . . . . . . . . . 26 Lei, et al. Expires December 22, 2006 [Page 2] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 1. Introduction Until now, a wealth of research has been focused on moving multicast support out of the network core since the deployment of IP multicast has met many difficulties(i.e. technical and marketing reasons). Such research includes ESM [3], NICE [4], OMNI [5], and TOMA [6]. Among them, ESM can be only deployed well into small or medium-sized groups [7]. In ESM, all end hosts join the overlay construction and multiple connections exist between any two nodes. The main advantages of constructing a full mesh are the easy deployment and being relatively stable, e.g. it can recover quickly from faults. Unfortunately, ESM causes very high control overhead because each node must maintain the information of the whole group or at least most of the group members. In addition, ESM does not consider the heterogeneities of end hosts, such as different computation power, available bandwidth and access possibility. Comparatively, NICE introduces a hierarchical management scheme that each member must lay at the lowest layer and a distribution clustering protocol at each layer partitions these members into a set of clusters. Only one node of each cluster can be elected as the leader to join into the next higher layer. The layered design simplifies the application level multicast and helps it scale better. Nevertheless, the joining procedure in NICE must estimates end-to-end latency from the top layer to the lowest layer resulting a very high control overhead, which not only prolongs the packet delivery, but also is likely vulnerable to single node failure (e.g. failure caused by the node at the highest layer). The reason why application layer multicast solutions are less efficient is that the overlay topology is "randomly" connected without considerations on the underlying network. Therefore, some proxy-based overlay multicast approaches such as OMNI and TOMA have been proposed as alternatives. In some sense, OMNI and TOMA "play a game" between IP multicast and application layer multicast. They mainly depend on several strategically deployed intermediate nodes like proxies and Multicast Service Nodes (MSNs). These nodes are set as comparatively stable nodes to implement the functionalities similar to IP multicast routers to obtain and utilize the knowledge about underlying network topology. On the one hand, they employ these fixed nodes or long- term nodes to simplify membership management and multicast tree construction. This advantage can become a weakness, too, since the assumption of these fixed nodes limits the extensibility of deployment. For example, the construction of overlay network in TOMA is based on long-term measurement of user requests, not constructed on-demand. On-demand overlays are likely to have good performance because they are built upon the current network conditions, while predefined overlays may have degraded performance even if they can be Lei, et al. Expires December 22, 2006 [Page 3] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 correctly maintained. On the other hand, the locations of these intermediate nodes and the overlay links among them are crucial to the overlay construction and maintenance. However, it is not realistic to conjecture the locations of newly joining group members so that the locations of fixed nodes may not be adaptive to dynamic changes in network traffic and overlay topology. Different from these protocols, this draft describes a Dynamic Mesh- based overlay Multicast Protocol (DMMP), supporting large-scale groups without relying on any predetermined intermediate nodes. The overlay multicast tree is solely constructed by end hosts, optimizing both the bandwidth and the delay. Firstly, DMMP constructs an on- demand overlay core by which it could achieve the optimal performance. Secondly, DMMP distributes the burden of group management and data delivery to a few nodes instead of the source. Thirdly, the self-organizing protocol scales well to at least thousands of nodes without sacrificing the quality of the overlay network. Fourthly, DMMP achieves scalability by limiting group management within locality so that it can dramatically reduce the overhead and complexity of the overlay maintenance. 2. Features of DMMP DMMP is organized in a two-level hierarchy and the mechanisms of DMMP are introduced to manage and maintain the hierarchy dynamically. The key idea behind DMMP is to let a few end hosts selected during the multicast initialization phrase and also when group member changes self-organize into an overlay mesh, and dynamically maintain such a mesh. Although routers may also be manually designated (e.g. by ISPs) to construct the mesh, this document initially discusses the approach via end hosts. Generally, DMMP has the following features: o Support end hosts with heterogeneity -- It is particularly well- suited to a heterogeneous environment like the Internet when the upper level nodes in the hierarchy are chosen carefully. In DMMP, it is possible that only a small number of high-capacity end hosts are selected to construct the overlay mesh when there are a large proportion of free-riders which have no possibility to provide extra support for other members. o Dynamic mesh-based approach -- Tree-based overlays are regarded as the most efficient solution for data distribution in a stable network. Nevertheless, lacking of redundant links, an essential characteristic of the tree structure, is not efficient for dynamic scenarios. To address the inherent fragility of trees in dynamic networks, an overlay mesh which connects nodes by multiple paths is introduced in this draft to support efficient and reliable applications. Lei, et al. Expires December 22, 2006 [Page 4] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 o Efficient data distribution tree -- Because overlay multicast networks are built on top of a general unicast infrastructure, the problem of efficiently managing network resources is somehow different from that of networks that have their own links, depending on how they are best configured and operated. By distributing these responsibilities to mesh members, an efficient data delivery tree is constructed upon the mesh. o Adaptive and resilient to dynamic network changes -- The transient nature of the end hosts introduces some problems into reliable services in an overlay multicast tree. The ungraceful departure of nodes or unexpected network failures may result in data outages on the downstream nodes or even crashes to the whole multicast system. Nevertheless, DMMP is robust to handle these problems by preventing incapable or transient nodes from staying at the center of the multicast tree. Hence, the failure of a single node may result only in a transient instability in a small subset of participants, but no single-node failure would lead to a catastrophe in any part of the overlay multicast tree. It can be argued that pure tree structure has fundamental limitations both for high bandwidth multicast and for high reliability. Meanwhile, one difficulty with trees is that bandwidth will be monotonically declined along the tree; for instance, a member in OMNI receives data only from its upstream node and the data reception rate of this member can not be greater than its upstream node. It is even more difficult in the core network where each MSN is responsible for data delivery to a whole cluster. Any data loss high up the tree will reduce the bandwidth available to receivers lower down the tree. Moreover, a tree is less robust than a mesh in that a single node failure or a loop can partition the tree and disable communications among the members. To increase the throughput of the whole overlay network, the dynamic mesh as one of multiple forwarding solutions is introduced in this draft. Such a mesh will allows the reception of packets from multiple nodes rather than from only the upstream node. Specifically, two phrases are involved in the DMMP architecture: (1) An overlay hierarchy is established for group management and multicast tree configuration; (2) Based on the structured mesh, several clusters are shaped to connect with mesh members, namely, super nodes. Network situation changes (such as multicast members joining/leaving) within a local cluster will not have any impact on other clusters. In this way, the total control overhead can be alleviated. Basically, a source-based DMMP architecture consists of a sender, several receivers, one or many Rendezvous Points (RPs) and Dynamic Name Servers (DNSs). Compared with existing application level multicast approaches, DMMP is designed to be more robust to support a large-scale group without relying on predetermined intermediate nodes in the network and may get better performance. Lei, et al. Expires December 22, 2006 [Page 5] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 DMMP now supports source-specific multicast [1] and any-source multicast is for the future study. 3. Terminology and Abbreviations The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL", in this document are to be interpreted as described in BCP 14, RFC 2119 [2]. Other terminologies and abbreviations used in this document are used as follows: o Overlay Multicast - A multicast data delivery scheme depending on end hosts to form an overlay core for message control and a multicast tree for data delivery. o Rendezvous Point (RP) - It can be a server or proxy to assist managing group members and to store some required information. o Source - The sender could be a video stored server or some video distributed servers in one service domain, which delivers the data traffic to the source-based multicast group members; DMMP in this document only provides the source-specific mechanism to realize the single source-based Overlay Multicast. o Super nodes - Some end hosts are chosen to manage the multicast group and relay data from the mesh to receivers within clusters. Currently, only end hosts can serve as super nodes; future version of this document may specify the case when some routers (e.g., first-hop routers) are used as super nodes. o Receivers - They are all end hosts or group members as well, who want to receive the data from the source. o Mesh - Some selected super nodes organize themselves into an overlay core which is responsible for group member management and multicast tree configuration. o Clusters - Based on a selected super node, end hosts organize themselves into a core-based multicast tree [8] within each cluster. o Stress - The number of identical copies of a packet carried by a physical link called the stress of a physical link. o Out-degree - Residual degree, namely, the residual number of connections that a node can establish. o Uptime - The time duration from a node joining in the multicast session to its leaving the multicast session. Within each cluster, there are some terminologies and abbreviations which are used as follows: o Parent - the direct upstream node of a node is called the parent of that node, e.g. end host 1.2 is the parent of end host 1.2.1 in Figure 1. Lei, et al. Expires December 22, 2006 [Page 6] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 o Parent level nodes (PLN)- nodes (exclusive parent) at the same level as the parent of some node, e.g. 1.1 and 1.3 are parent level nodes of 1.2.1 in Figure 1. o Child - the direct downstream node of some node, e.g. in Figure 1, 1.2.1 is the child of 1.2. o Children level nodes (CLN)- nodes (exclusive children) at the same level as children of some node, e.g. 1.1.1, 1.1.2, 1.3.1 are children level nodes of 1.2 in Figure 1. o Siblings - nodes at the same level of a node are called siblings, e.g. in Figure 1, 1.1.1, 1.1.2, 1.2.2, 1.2.3 and 1.3.1 are siblings of 1.2.1. /-------------------------------\ / 1.1.1 1.1.1.1 | / 1.1 / End host- End host | / End host 1.1.2 | / / \ End host | / / 1.2.1 | / / / End host 1.2.2.1 | / / / End host | / / 1.2 / 1.2.2 / | Super node--- End host -- End host | \ \ \ \ End host | \ \ \ 1.2.3 1.2.2.2 | \ \ \ End host | \ \ | \ \ 1.3 1.3.1 | \ End host - End host | \ | \--------------------------------/ Cluster Figure 1: An example of local Cluster 4. DMMP Overview DMMP is an application level protocol used for realizing and managing the overlay multicast data delivery. Specifically, the framework consists of two types of functionalities: control plane and data plane. The control plane consists of one overlay mesh and some core- based clusters. The source and a set of super nodes construct themselves into an overlay mesh, based on which each super node supervises one cluster. During the mesh construction, the selection of the super nodes is to ensure that a newly joining member is able to quickly find its appropriate position in the multicast tree using a very small number of queries such as join message. Data plane is, Lei, et al. Expires December 22, 2006 [Page 7] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 then, built on the top of the structured overlay hierarchy. In DMMP, a respective data channel between two entities is established by exploiting the existing protocol stacks such as UDP/IP or TCP/IP. The data channels utilize IP unicast according to the underlying IP transport scheme. 4.1. Control plane in DMMP In this source-specific overlay multicast, control plane is mainly in charge of controlling the overlay hierarchy and completing the multicast tree configuration. Additionally, available bandwidth [9] is chosen as the first optimal metric because the DMMP mesh is mainly designed to real-time media streaming applications. However, media streaming are bandwidth-constraint services and available bandwidth resources possessed by a multicast group may be insufficient. It is reason why DMMP needs to consider the degree bound in streaming applications, which can be easily observed in the available bandwidth. For example, on the assumption that the bit rate of media is B and the outbound bandwidth of an end host i is b(i), the total number of connections it can establish is b(i)/B which is also the maximum degree of the end host. o Step 1: After initialization, RP will calculate the out-degree of each host and distribute them into two categories: leaf nodes (whose out-degree is less than 1) and non-leaf nodes. If one's maximum degree is less than 2, the end host can only perform as leaf node because it can only receive data from the incoming connection. o Step 2: The information of leaf nodes and non-leaf nodes are respectively stored at the RP. Meanwhile, all non-leaf nodes are placed in the order of out-degree and reported to the source. On receiving the list of ordered non-leaf nodes, the source selects a number of super nodes with higher capacities (i.e. out-degree), as defined in Section 6.2. The capacities of each super node are also stored at the source and the RP. o Step 3: After being selected, the super nodes organize themselves into a mesh with multiple overlay links between any two of them. DMMP prevents incapable hosts from staying at the center of multicast tree and hence super nodes which are able to and willing to make more contributions to the network are likely to get better performances. To be more efficient, the number of active super nodes is no more than one hundred, otherwise it may cause high control overhead and high stress [3]. Assuming that each super node can manage, in average, hundreds of cluster members, it is sufficient to support totally more than thousands of end hosts. Otherwise, multiple sources will be deployed into media streaming by multiple overlay multicast sessions. To balance the tradeoff between efficiency and Lei, et al. Expires December 22, 2006 [Page 8] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 reliability, it is reasonable to select a limited number of super nodes to construct the overlay mesh. In this framework, only super nodes need to keep the full knowledge among themselves so that mesh can quickly detect and recover from failures. Non-super nodes only need to keep the knowledge of a small part of the group within each cluster. Thus, the control overhead of the whole multicast tree could be reduced dramatically in contrast to that each member keeps the full knowledge of the entire group while preserving resiliency. Furthermore, it is possible to improve the overlay mesh performances by allowing dynamically adding and deleting links within the mesh, which will be clarified in a future version of this draft. Based on selected super nodes, some core-based clusters will be established to complete the overlay hierarchy. o Step 1: After constructing the overlay mesh, the next step is to form core-based clusters. Having received a list of super node candidates from the RP, each non-super node caches their capacities, e.g. out-degree, as part of the selection criteria, and measures the e2e latency to each of them. o Step 2: Each end host chooses one super node who can provide the best service according to the e2e latency, as the first contact point to join in the cluster. If there are multiple super nodes which can provide similar e2e latency for the node, one of them with higher out-degree will be chosen. o Step 3: Those non-super nodes sharing the same super node will form a cluster for message control and data delivery. Within each cluster, the super node selects no more than K (defined in 6.2) end hosts whose capacity is comparatively high, as its immediate children. This operation guarantees that the multicast tree within each cluster satisfies bandwidth constraints of media streaming applications. o Step 4: Afterwards, immediate children in each cluster choose some nodes with higher capacities (i.e. out-degree, e2e latency) as their children. This selection will expedite the convergence of the tree and alleviate the average latency in some senses. The iteration will continue until all cluster members are attached into the tree. When an end host joins in the overlay multicast tree, its uptime starts to calculate from 0 until its leaving. For the sake of resiliency, each node in the local cluster should keep some information of its relatives in its local cache. In this draft, these entities (parent, PLN, child, CLN and siblings) are denoted as relatives (see Section 2). Figure 2 illustrates a basic example for the overlay construction. Assuming that it is the first time to construct the overlay hierarchy, then: Lei, et al. Expires December 22, 2006 [Page 9] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 o Step 1: When obtaining the list of non-leaf members from the RP, the source will select a number of super nodes. After selection, super nodes self-organize into a mesh rooted at the source. o Step 2: Each non-super node first consults its local cache for super node candidates. If there are no suitable candidates, it queries the RP immediately. Then, the requestor caches these new received candidates, from which it chooses the best one based on the e2e latency. Those non-super nodes sharing the same super node will then form a local cluster. o Step 3: According to the super node's capacity, no more than K end hosts with larger out-degree are selected as its immediate children. Here, three end hosts, namely, 1.1, 1.2 and 1.3 are chosen as the immediate children of the super node. o Step 4: Once the capacity of the super node is exhausted, it responses to other requestors with its immediate children and an indication of rejection. For example, upon the receipt of rejections and a list of candidate parents (i.e. 1.1, 1.2 and 1.3) from the super node, end hosts 1.1.1, 1.1.2 and 1.3.1 send Join request to them. In this case, requestors with higher out-degree are likely to be selected as the children. If there are multiple acceptances, the end host attaches to the one which is "near" to it due to the e2e latency. For example, 1.1.1 and 1.1.2 accept the request and joins the cluster as children of 1.1. Then 1.1 will update the associated information to 1.2, 1.3 and to the super node after completing its children selection. o Step 5: If there are still some available children left, they will receive a rejection and need to re-send Join request to the children at the lower level, i.e. 1.1.1, 1.1.2 and 1.3.1. When all receivers confirm their positions in the cluster, the control plane is initially finished constructing for the overlay multicast group. Lei, et al. Expires December 22, 2006 [Page 10] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 Communication Source <========> Rendezvous Point | Channel | | ---------------\ v / 1.1.1 | /---------Super Node---------\ / 1.1 End host| | / \ \ | / End host/ | | / \ \ | / / \ 1.1.2 | v / \ \ v / / 1.2 End host| Cluster -> Super Node \ Super Node -- End host | | \ \ / | \ \ 1.3 1.3.1 | | \ \/ | \ End host-End host| v \ /\ v \-----------------/ Cluster -> Super Node \ /Super Node <- Cluster | \ \ / / | | \ \ / / | | \ \ / / | \-----------Super Node ------/ | | Cluster Figure 2: Control hierarchy in DMMP Although the objective of latency minimization is already taken into consideration in the multicast tree construction, the quality of application can be enhanced by self-improving mechanisms. Periodically executing local transformations as proposed in [10] can reduce the average latency of the cluster members. Compared with existing overlay multicast protocols, the overlay hierarchy of DMMP incurs a low delay penalty, makes effective use of network bandwidth, and considers the heterogeneities of end hosts during cluster construction. 4.2. Data plane in DMMP Based on the structured control plane in DMMP, data plane can be directly built on top of it. Corresponding to control plane, the data delivery tree can be divided into two parts. The first part, which is within the overlay mesh, is built through the reverse shortest path between each super node and the source, identical to DVMRP [11]. For example, a super node A receives the packet from the source through its neighbor B only if B is the next hop on the shortest path from A to the source. In this case, super node A will replicate and forward the packets directly to super node B. Likewise, B will replicate and forward the packet to the next super node as Lei, et al. Expires December 22, 2006 [Page 11] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 well. When receiving the data, the super node will replicate and forward the data to its direct children in the cluster. Therefore, data can be delivered throughout the whole cluster. The second part is data delivery within core-based clusters. In each cluster, data are firstly forwarded from the super node to its direct children. Then, the receivers will replicate the data and forward them to its children at the lower level. The whole iteration terminates with success if all receivers within each cluster receive the data. Figure 3 specified a simple example for data delivery within a DMMP cluster. As shown in the figure, data are firstly replicated into three copies, respectively delivered from the super node to its direct children 1.1, 1.2 and 1.3 using unicast. Similarly, 1.1 replicates copies of the data according to the number of their children (e.g. two copies), sending separately to 1.1.1 and 1.1.2. In the next iteration, the receiver will similarly make copies and deliver to its children. /-----------------------------\ / 1.1.1 | / 1.1 / End host | / End host 1.1.2 | / / \ End host | / / 1.2.1 | / / / End host | / / / | -----> / 1.2 / 1.2.2 1.2.2.1 | Data -----> End host -- End host - End host| -----> \ \ | \ \ \ 1.2.3 | \ \ \ End host | \ \ | \ \ 1.3 1.3.1 | \ End host - End host | \ | \ | \-----------------------------/ Cluster Figure 3: Data delivery in DMMP 5. DMMP Messages This section describes the DMMP messages that are used to control the Lei, et al. Expires December 22, 2006 [Page 12] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 overlay hierarchy and to configure the multicast tree. Currently, there are ten pairs of control messages in DMMP as shown in Table 1. Each pair of control messages will be exchanged between the DMMP- aware entities in a request-and-response way. o Subscription Request and Response - Group members get the address of RP from Dynamic Name Server (DNS). o Ping_RP Request and Response - During bootstrapping, each member of the group gets a list of available super nodes from the RP, containing at least one active node. o Source Request and Response - Before joining the group, a new member uses Source Request message to obtain the address of the source. RP sends back Source Response with different indications, i.e. Success, Failure. o Cluster Request and Response - During the cluster construction, Invite Request and Response Messages are used for non-super nodes to find the best super node. o Join Request and Response - A newly joining member gets information from active cluster members about who is appropriate to be its parent in the overlay cluster. o Setup Request and Response - Mesh members use this pair of messages to manage and maintain the mesh. o Status Report and Response - To differentiate a "dead" or "leaving" node from a "living" node. o Probe Request and Response - To confirm whether the target node is still alive or not. o Leave Report and Response - To inform the other group members that the node is leaving the group. o Refresh Request and Response - To maintain the overlay hierarchy, they are used to periodically update the capacities (such as uptime, out-degree) of group members. To adapt to situation changes, each end host maintains the overlay core by periodically updating its capacities. It periodically exchanges Refresh message to maintain the overlay hierarchy. If node A cannot receive this message from node B within the refresh timeout duration, node A will send a Probe request to node B. Node B will be confirmed to be "dead" or "leaving" if node A still cannot receive any Probe response from B. Then the Status report, indicating node B being "dead", will be used to inform the rest of group members. Table 1 lists the DMMP messages according to the associated DMMP operational phases. Lei, et al. Expires December 22, 2006 [Page 13] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 Table 1: DMMP Messages +-----------------+------------+------------+------------+ | Messages | Operation | From | To | +-----------------+------------+------------+------------+ | Subscription Rq | Initializ- |Group Member| DNS server | +-----------------+ ation +------------+------------+ | Subscription Res| | DNS server |Group Member| +-----------------+------------+------------+------------+ | Ping_RP Request | Bootstrap |Group Member| RP | +-----------------+ +------------+------------+ | Ping_RP Response| | RP |Group Member| +-----------------+------------+------------+------------+ | Source Request | Member |new End Host| RP | +-----------------+ +------------+------------+ | Source Response | Join | RP |new End Host| +-----------------+------------+------------+------------+ | Cluster Request | Construct |Cluster Mem.| Super Node | +-----------------+ +------------+------------+ | Cluster Response| Clusters | Super Node |Cluster Mem.| +-----------------+------------+------------+------------+ | Join Request | Member | End Host |Cluster Mem.| +-----------------+ +------------+------------+ | Join Response | Join |Cluster Mem.| End Host | +-----------------+------------+------------+------------+ | Setup Request | Mesh | Super Node | Super Node | +-----------------+ +------------+------------+ | Setup Response | Management | Super Node | Super Node | +-----------------+------------+------------+------------+ | Status Report | Cluster |Group Member|Group Member| +-----------------+ Member +------------+------------+ | Status Response | Monitoring |Group Member|Group Member| +-----------------+------------+------------+------------+ | Probe Request | Probe |Group Member|Group Member| +-----------------+ +------------+------------+ | Probe Response | Members |Group Member|Group Member| +-----------------+------------+------------+------------+ | Leave Report | Member |Leaving Node|Group Member| +-----------------+ +------------+------------+ | Leave Response | Leave |Group Member|Leaving Node| +-----------------+------------+------------+------------+ |Refresh Request | Update |Group Member|Group Member| +-----------------+ +------------+------------+ |Refresh Response |Information |Group Member|Group Member| +-----------------+------------+------------+------------+ Legend: SN - Super Node Cluster Mem. - Cluster Member Lei, et al. Expires December 22, 2006 [Page 14] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 Although TCP provides a generic protocol for a guaranteed, in-order delivery of stream-based messages, this reliability comes at a price in performance. Besides, the communication pattern in DMMP is strictly in a request-response mode, and most messages have a small fixed maximum size. Thus, it is preferred to encapsulate all DMMP messages over UDP to provide the required delivery guarantees without extra network burdens. Under this situation, an error handling of DMMP messages may be required, which will be defined in the future studies. 6. DMMP: Protocol Details All DMMP nodes and the source are assumed to be able to know the address of the RP. Also, once a source node starts, a direct communication channel will be established between the source node and the RP, so that the necessary information like the capacities of group members and active super nodes could be exchanged between them during the lifetime of an overlay multicast session. Furthermore, a DNS namespace is required to maintain the RP information for a specified multicast group. Before initialization, each group member and the source send out the subscription request containing the specified group name and domain name, to the DNS server for the address of certain Rendezvous Point (RP). Since RP does not participate in data forwarding, the location of RP has no significant impact on the performance of data distribution. If there is no existing RP who is serving for this multicast group, DNS server will allocate a new one for this multicast group based on application requirements. Otherwise, DNS server sends the address back and stores group related information. It is possible that multiple RPs serve for the same multicast group, e.g. for the purpose of load balancing and fault tolerance. For simplicity, this document initially considers the case where there is only one RP involved. 6.1. Initialization During the initialization phase, certain application related software has been distributed to the prospective DMMP-aware entities with the DMMP control and data transport models. If the application is a commercial application program, which will be considered by choosing available protocol stacks according to different requirements like reliability (e.g. TCP/IP) and media types (e.g. video, audio, etc) or media codec. So it also depends on the service provider or the source to distribute application related software to group members before this service can really starts. Lei, et al. Expires December 22, 2006 [Page 15] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 Before the service starts, the source and RP must be ready to give response to DMMP request messages from DMMP-aware end hosts. The source and RP will take no further reactions to any DMMP requests once the service stops. During the service, the active time should be the period from the service starts until it stops. Then the out- of-band channel between the RP and source should be active during this active service so that the source can monitor the current status of memberships. However, the detailed mechanism for implementing this out-of-band bootstrapping is out of the scope of this document. Moreover, service-related information should be obtained before the service starts and all prospective group members use out-of-band bootstrapping mechanism to get necessary information, for instance, Group ID and location of RP including the port number serving for certain services before the application begins. Then DMMP-aware entities can start receiving data after they join the overlay hierarchy. 6.2. Super Node Selection It is required to ensure that the mesh, the most important part of DMMP, can be resilient to dynamic network changes and recovery from any failures. As the super node selection is the first step towards mesh establishment, this section gives more details. To select better super nodes for overlay mesh while maintain scalability, the following distribution requirements need to be taken into account [12]. o Connections: Super nodes have more power and are comparatively stable to perform additional tasks such as resources control, load balance and fault tolerance. o Number: The proportion of super nodes to non-super nodes, usually required to be limited within a threshold, is fit for application- specific requirements. For example, in the case of 110 end hosts, it may need 10 super nodes if the ratio is 10%. In this case, 10 end hosts with larger capacity will be chosen as super nodes. o Downstream: To adapt to bandwidth requirements, super nodes should not serve more than K non-super nodes, where K is respectively determined by the out-degree of each super node and service specifications. In order to deal with factors from a large-scale and dynamic network environment, the following three conditions are outlined in addition to above requirements. o Heterogeneity: Systems contain a large number of heterogeneous nodes with different powers and resources. Previous research has shown that a large proportion of free-riders (i.e. zero out-degree Lei, et al. Expires December 22, 2006 [Page 16] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 members) may exist in the network [13], [14]. o Resilience: Super nodes are responsible for detecting dynamic changes and for handling them quickly, e.g. one super node leaves the group ungracefully, which should be detected by at least one of other active nodes. Afterwards, a new super node should be quickly selected to replace the leaving super node in the position. The time for detection and recovery process is also constrained by the service requirements. o Security: Super nodes should be fundamentally invulnerable to attacks; otherwise, they will easily disrupt the multicast service by forwarding wrong messages or failing to accept information. End host based overlay multicast is more susceptive to dynamic network changes since end hosts may join or leave the group at will. It would be even harder for DMMP to manage and maintain such an overlay mesh because super nodes may leave the group ungracefully as well. To address the instability of mesh, uptime is chosen as an assisted criterion to strengthen its maintenance. Once a node joins into the group, its uptime starts to be calculated and is recognized as one of its capacities. Like the rule of "the nature of selection", high-capacity nodes will be periodically pushed to the higher level of the tree by capacity comparison mechanism. Then, it is very likely that super nodes and their immediate children are higher capacity nodes after a certain time due to their comparatively higher out-degree and an indication of being more stable than low- level nodes. So, nodes at the bottom level are either transient nodes or leaf nodes. The main goal of doing this is to reducing the impacts of frequent changes in the overlay so that only a small part of the overlay multicast tree will be affected and needs to be re- constructed after dynamic changes. Detailed description will be augmented in the future version of this draft. 6.3. Member Join DMMP is resilient to dynamic network changes, for examples, events like a member joining/leaving. The newcomer first checks its local cache for super node candidates. If there are no suitable candidates in the cache, it requests the RP for the addresses of the source and super node candidates. If the RP cannot find the requested source, it sends a Source_Failure message back. Otherwise, the new member gets the source address and a list of active super nodes with their capacities, i.e. uptime, out-degree. Then, it will measure the e2e latency between them and itself, and sends the Join Request message to some super node which can provide smaller e2e latency. On receiving Join Request from a newcomer, the super node will check Lei, et al. Expires December 22, 2006 [Page 17] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 its current out-degree. If it is possible to accept the newcomer to as its immediate child, the super node will respond with an indication of acceptance. In this case, the information about newly joining nodes will only be propagated to the existing children of this super node since no child of the new member exists. When the super node cannot accept it as its immediate child, it will redirect this Join message to its active children with the largest out-degree. If one of them responses to the super node, this response will be relayed to the new member. If there are more potential parents, the new member selects the one with smallest tree depth as its parent. If there are multiple potential parents at the same depth, it chooses the best one in terms of their uptime. Once finding the appropriate parent, the new member starts data delivery. At this time, the information about the new member will be propagated from the parent to its PLN, siblings and the super node. The process will be terminated until the new member finds its position and accordingly updates its related information at corresponding nodes. Suppose the newcomer fails to find an appropriate position in any cluster to satisfy application requirements, it can sell itself as a potential super node and report its own capacities to the RP. Regarding its capacity and the current number of super nodes, it is possibly entitled to be a super node. In this way, end hosts have more flexibility to get optimal services except that it can only attach to a super node. Further operation details will be specified in a future version of this document. 6.4. Data Delivery Control After the multicast tree configuration, the new member will ask its immediate parent to send the data. Generally, parent nodes will delete the data from their local cache after they forward to their children. If the parent still holds the data, the new member can get the data from it. If the parent has not received data yet, either it waits until the parent forwards the data after receiving, or it directly inquires the super node to transfer the data. The former option is preferred in DMMP as its overhead is likely lower. On receiving the data, the new member will firstly forward them to its parent if the parent hasn't received the data yet. Besides, the new member can forward the received data to its siblings on the condition that all its PLNs haven't received the data yet, which facilitates the data spread all over the cluster within a short time and alleviate the burden of its PLNs. However, this technique may cause redundant transmissions over the multicast tree. Thus, some preventive mechanism will be explicitly defined in the next version of the draft. Different from joining as a cluster member, the new member may act as Lei, et al. Expires December 22, 2006 [Page 18] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 a super node. For this case, it will firstly ask its neighbors in the overlay mesh to send the data. Alternatively, it may query these data from its children when they are already in its local cluster. If any of them receives the data, this new super node will also get the data. A third possibility is to let the new member directly ask the source to send the data. To alleviate the control overhead, it is recommended that this new node waits for a certain while until one of its mesh neighbors receives the data. 6.5. Refresh Information In DMMP, each member is responsible for maintaining the overlay hierarchy, by periodically sending Refresh message. The Refresh mechanism has a little difference in the overlay mesh and in the local clusters. To efficiently manage the overlay hierarchy, both active and passive models are utilized in DMMP. Within the cluster, each end host starts to exchange Refresh message with its PLNs, siblings and CLNs once it joins the cluster. Except that each member has to periodically update its information, members in the local cluster are able to request refresh message from their relatives, e.g., PNLs or CNLs. Therefore, it guarantees the reliability and comparable stability of the overlay hierarchy. For the Refresh message in the mesh, each super node sends its update information to all mesh members including the source. Once receiving updated information, the source will correspondingly update the information at the RP. If one mesh member stops receiving Refresh message from another beyond the Mesh_Refresh_Timer, it assumes this neighbor to be either "dead" or "leaving". In order to confirm the status, it still needs to initiate a Probe message as stated in Section 5. 6.6. Capacity specification There are several metrics used in DMMP as the criteria of the capacity, which will be specified in this section. Lei, et al. Expires December 22, 2006 [Page 19] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 Table 2: Capacity specification +-------------+-------------------------------------+ | Metric | Operation | +-------------+-------------------------------------+ | | Differentiation non-leaf nodes from | | | leaf nodes | | Out-degree +-------------------------------------+ | | Super nodes selection | | +-------------------------------------+ | | Tree construction within clusters | | +-------------------------------------+ | | New member joins the group | | +-------------------------------------+ | | Failure recovery mechanism | | +-------------------------------------+ | | Self-improving mechanism | +-------------+-------------------------------------+ | | Non-super nodes attach to super | | E2E latency | nodes to form clusters | | +-------------------------------------+ | | New member joins the group | +-------------+-------------------------------------+ | | New member joins the group | | Uptime +-------------------------------------+ | | Self-improving mechanism | +-------------+-------------------------------------+ As shown in Table 2, out-degree is the main criterion to select the super nodes from end hosts. Based on the selected super nodes, non- super nodes select one which locates "near" to it due to the estimated e2e latency. During the tree construction within clusters, nodes with higher out-degree are likely to join in the tree at the high level. Regarding new members joining procedure, out-degree, e2e latency and uptime are all taken into considerations. To keep the stability of the overlay hierarchy, out-degree and uptime are chosen as comparison metric to self-improve the overlay multicast tree. Furthermore, out-degree is regarded as the main selection criterion for alternative node during the failure recovery. 6.7. Member Leave In most cases, two situations for a member leaving the group, either gracefully or ungracefully, are distinguished from each other. In each cluster, the leaving member should at least send a Leave Request message to its parent or one of its children. After receiving the confirmation, it can leave the group gracefully. Then the notified node will propagate this Leave message to its relatives so that they Lei, et al. Expires December 22, 2006 [Page 20] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 can update their service membership tables. In the second case, the Leaving status will be detected by periodically exchanging Refresh messages. If any member within the cluster, say p, fails to receive a Refresh Report message from one of its required relatives, say q, within the refresh timeout Refresh_Timer, then p sends a redundant Probe Request message to q. If there is still no Probe Response message returned, p assumes q to be "dead" and propagates this Status_Dead message throughout the whole cluster. Nevertheless, ungraceful leaving may cause the crash of whole multicast tree. DMMP is able to handle different situations by detecting the failures and recovering quickly from them as shown in Section 6.8. Compared with the handling in the local cluster, the handling will be tougher in the core mesh. When there is no end hosts connecting to the leaving super node, no further change to the overlay is required. Otherwise, the overlay may be broken. Before a super node gracefully leaves the group, it must first elect a replacement leader for the cluster it owns and inform other super nodes in the overlay mesh before leaving. To detect unannounced leaves, DMMP relies on the periodic Refresh message exchanges. If the failed peer happens to be a super node, the overlay hierarchy has to be repaired. Therefore, either one of the other super nodes with available connections will take over the its functionalities or a new super node will be selected. Usually, the source will select one of the victim's children with largest out-degree as the new super node and updates the information to the RP. Accordingly, the neighbors in the same cluster adjust their positions. Taken Figure 1 as an example, 1.1 is supposed to replace the leaving Super Node. If 1.1 still has available connections for 1.2 and 1.3, 1.2 and 1.3 will directly swap to be the children of 1.1. If there is only one available connection left, either 1.2 or 1.3 can be the child of 1.1, depending on their e2e latencies. Otherwise, they contact with the children of 1.1, i.e. 1.1.1 and 1.1.2 for joining the tree. The iteration will continue until they find their positions. After the adjustment, some self-improvement mechanisms could be used to achieve better performance, but this is left for further study. 6.8. Failure Recovery Recovery from failures upon a member crash is similar to handling a member leaves. The difference is that surviving members usually do not receive prior notification of a crash. Thus, the node failure is generally detected by noticing periodically missing REFRESH or UPDATE message. Maintenance of such a multicast tree in DMMP faces a key problem that non-leaf nodes in the tree are end hosts who are more likely to fail than routers and may join/leave the tree at will. It does not happen in IP multicast since non-leaf nodes in the delivery tree are routers which do not leave the multicast tree without Lei, et al. Expires December 22, 2006 [Page 21] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 notification. So, one challenge in DMMP is to reconstruct the overlay multicast tree after a node's departure. It is even more difficult for super nodes to maintain the cluster since all of its local members are partitioned from the multicast tree and can not receive the multicast data until it is repaired. In order to improve the performance of DMMP, especially when there are high packet losses or host failures, two techniques are respectively implemented in proactive and reactive way. The proactive approach is used in the overlay mesh to recovery from dynamic changes. Each immediate child of the super node find a backup parent, either the source or a group member, in advance. Once the super node leaves the group, its children try to contact with their alternative parents to re-join the multicast tree. This approach can facilitate the recovery process and strengthen the reliability of the overlay hierarchy. In each local cluster, every end host periodically estimates its relatives (i.e. PNLs, CNLs) within the cluster and evaluates the number of losses that it shares with these nodes. Besides, each end host periodically detects failures in a random way like PRM [15], where it cooperates Randomized forwarding with Triggered NAKs to handle data losses and network congestion. While in the core mesh, each super node maintains state information about all other mesh members, no additional discovery of nodes is necessary. Using this mechanism, packet delivery ratios can be increased with a high probability. For the reactive method, the detailed technique has been applied for loss repair in RMTP [16]. To handle different scenarios of failures, more mechanisms need to be defined in the near future. 7. Security Considerations Security should be considered during the super nodes selection. In DMMP, the current preference is to use an authority center that qualifies the trust level of end hosts. Only when the end host obtains a security certificate from the authority center, it can be selected as a super node. Within each cluster, Cluster key, Group key and Private key are proposed as the security scheme to manage the cluster members. Details and discussions related to other security issues are to be explored in a future version of this document. 8. Open Issues DMMP framework will study necessary extensions or open issues, where Lei, et al. Expires December 22, 2006 [Page 22] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 security, large-scale efficiency, end-to-end quality-of-service (QoS) provisioning will be likely related. Moreover, initial DMMP does not include support for NATs and firewalls since they impose fundemental restrictions on pair-wise connectivity of hosts on the overlay. 9. Contributors Ruediger Geib contributed a great effort to this document. 10. Acknowledgements We thank Nicolai Leymann for his helpful suggestions. 11. References 11.1. Normative References [1] Bhattacharyya, S., "An Overview of Source-Specific Multicast (SSM)", RFC 3569, July 2003. [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 11.2. Informative References [3] Chu, Y., Rao, S., and et. al, "A Case for End System Multicast", IEEE JSAC Vol. 20, No. 8, October 2002. [4] Banerjee, S., Bhattacharjee, B., and et. al, "Scalable Application Layer Multicast", SIGCOMM 2002, August 2002. [5] Banerjee, S., Kommareddy, C., and et. al, "OMNI: An Efficient Infrastructure for Real-time Applications", Computer Networks, Special Issue on Overlay Distribution Structures and their Applications, Vol. 50, No. 6, April 2006. [6] Lao, L., Cui, J., and et. al, "TOMA: A Viable Solution for Large-scale Multicast Service Support", IFIP Networking 2005, May 2005. [7] Banerjee, S. and B. Bhattacharjee, "Analysis of the NICE Application Layer Multicast Protocol", UMIACS Technical report TR 2002-60 and CS-TR 4380, June 2002. [8] Ballardie, A. and J. Crowcroft, "Core based trees (CBT)", ACM Lei, et al. Expires December 22, 2006 [Page 23] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 SIGCOMM Computer Communication Review, October 1993. [9] Hu, N. and P. Steenkiste, "Evaluation and Characterization of Available Bandwidth Probing Techniques", IEEE JSAC Special Issue in Internet and WWW Measurement, Vol. 21, No. 6, August 2003. [10] Zhi, L. and P. Mohapatra, "HostCast: A New Overlay Multicasting Protocol", IEEE ICC'03 Proceedings of IEEE ICC 2003, June 2003. [11] Deering, S., "Multicasting routing in internetworks and extended LANs", ACM SIGCOMM 1988, August 1988. [12] Lo, V., Zhou, D., and et. al, "Scalable Supernode Selection in Peer-to-Peer Overlay Networks", HOT-P2P'05 Proceedings of International Workshop on Hot Topics in Peer-to-Peer Systems 2005, July 2005. [13] Saroiu, S., Gummadi, P., and S. Gribble, "A Measurement Study of Peer-to-Peer File Sharing Systems", MMCN'02 Proceedings of Multimedia Computing and Networking, July 2002. [14] Sripanidkulchai, K., Maggs, B., and H. Zhang, "An analysis of live streaming workloads on the Internet", SIGCOMM IMC Proceedings of the 4th ACM SIGCOMM IMC, Oct. 2004. [15] Banerjee, S., Lee, S., and et. al, "Resilient Multicast using Overlays", ACM SIGMETRICS 2003, June 2003. [16] Paul, S. and K. Sabnani, "Reliable multicast transport protocol (RTMP)", IEEE JSAC, Vol. 15, No. 3, April 1997. [17] Handler, G. and P. Mirchandani, "Location on Networks: Theory and Algorithms", The MIT Press Cambridge Massachusetts, Mar. 1979. [18] Lynch, A., "Distributed Algorithms", Morgan Kaufmann San Francisco, April 1997. Lei, et al. Expires December 22, 2006 [Page 24] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 Authors' Addresses Jun Lei University of Goettingen Institute for Informatics Lotzestr. 16-18 Goettingen 37083 Germany Email: lei@cs.uni-goettingen.de Xiaoming Fu University of Goettingen Institute for Informatics Lotzestr. 16-18 Goettingen 37083 Germany Email: fu@cs.uni-goettingen.de Xiaodong Yang University of Goettingen Institute for Informatics Lotzestr. 16-18 Goettingen 37083 Germany Email: yang@cs.uni-goettingen.de Dieter Hogrefe University of Goettingen Institute for Informatics Lotzestr. 16-18 Goettingen 37083 Germany Email: hogrefe@cs.uni-goettingen.de Lei, et al. Expires December 22, 2006 [Page 25] Internet-Draft Dynamic Mesh Overlay Multicast Protocol June 2006 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Lei, et al. Expires December 22, 2006 [Page 26]