Internet-Draft                                      Grenville Armitage
                                                              Bellcore
                                                       June 13th, 1996


                 Redundant MARS architectures and SCSP
                 <draft-armitage-ion-mars-scsp-00.txt>


Status of this Memo

   This document was submitted to the IETF Internetworking over NBMA
   (ION) WG.  Publication of this document does not imply acceptance by
   the ION WG of any ideas expressed within.  Comments should be
   submitted to the ion@nexen.com mailing list.

   Distribution of this memo is unlimited.

   This memo is an internet draft. Internet Drafts are working documents
   of the Internet Engineering Task Force (IETF), its Areas, and its
   Working Groups. Note that other groups may also distribute working
   documents as Internet Drafts.

   Internet Drafts are draft documents valid for a maximum of six
   months.  Internet Drafts may be updated, replaced, or obsoleted by
   other documents at any time. It is not appropriate to use Internet
   Drafts as reference material or to cite them other than as a "working
   draft" or "work in progress".

   Please check the lid-abstracts.txt listing contained in the
   internet-drafts shadow directories on ds.internic.net (US East
   Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or
   munnari.oz.au (Pacific Rim) to learn the current status of any
   Internet Draft.

Abstract

   The Server Cache Synchronisation Protocol (SCSP) has been proposed as
   a general mechanism for synchronising the databases of NHRP Next Hop
   Servers (NHSs), MARSs, and MARS Multicast Servers (MCSs).  All these
   entities are different parts to the IETF's ION solution. This
   document is INFORMATIONAL, RAMBLING, and REALLY HACKY. It is intended
   as a catalyst for discussions aimed at identifying the realistic MARS
   scenarios to which the SCSP may find itself applied. This document
   does not deal with NHS and MCS scenarios.


Armitage              Expires December 13th, 1996                [Page 1]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


1.  Introduction.

   SCSP [1] was proposed to the ROLC and IP over ATM working groups as a
   general solution for synchronizing distributed databases such as
   distributed Next Hop Servers [2] and MARSs [3]. It is now being
   developed within the newly formed Internetworking over NBMA (ION)
   working group.

   This document attempts to describe possible redundant/distributed
   MARS architectures, and how SCSP would aid their implementation.

1.1  MARS Client support for backup MARSs.

   The current MARS draft already specifies a set of MARS Client
   behaviours associated with MARS-failure recovery (Section 5.4 of
   [3]).  MARS Clients expect to regularly receive a MARS_REDIRECT_MAP
   message on ClusterControlVC, which the current and backup MARSs.
   When a MARS Client detects a failure of its MARS, it steps to the
   next member of this list and attempts to re-register.  If the re-
   registration fails, the process repeats until a functional MARS is
   found.

   Sections 5.4.1 and 5.4.2 of [3] describe how a MARS Client, after
   successfully re-registering with a MARS, re-issues all the MARS_JOIN
   messages that it had sent to its previous MARS. This causes the new
   MARS to build a group membership database reflecting that of the
   failed MARS prior to the failure.  (This behaviour is required for
   the case where there is only one MARS available and it suffers a
   crash/reboot cycle. Cluster members represent a distributed cache
   'memory' that imposes itself onto the newly restarted MARS.)

1.2  Structure of this document.

   This document is currently structured in a semi-rambling fashion.
   I've put together sequences of ideas to see if I can lead people to
   certain conclusions, highlighting my reasoning along the way so that
   the issues (or lack thereof) may be evident to readers. As of the
   first release there are few conclusions or solutions.


2.  Why a distributed database?

   In the current MARS model [3] a Cluster consists of a number of MARS
   Clients (IP/ATM interfaces in routers and/or hosts) utilizing the
   services of a single MARS. This MARS is responsible for tracking the
   IP group membership information across all Cluster members, and
   providing on-demand associations between IP multicast group
   identifiers (addresses) and multipoint ATM forwarding paths.  It is


Armitage              Expires December 13th, 1996                [Page 2]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


   also responsible for allocating Cluster Member IDs (CMIs) to Cluster
   members (inserted into outgoing data packets, to allow reflected
   packet detection when Multicast Servers are placed in the data path).

   Two different, but significant goals motivate the distribution of the
   MARS functionality across a number of physical entities. These might
   be summarized as:

      Fault tolerance
                    If a client discovers the MARS it is using has
                    failed, it can switch to another MARS and continue
                    operation where it left off.

      Load sharing
                    The component MARSs of a distributed, logically
                    single MARS, handle a subset of the control VCs from
                    the clients in the Cluster.

   Each goal has some characteristics that it does not share with the
   other, so it would be wrong to believe that any solution to one is a
   solution to the other. However, a general solution to the Load
   sharing model may well provide fault tolerance as a by product.

   Some additional terminology is introduced to describe the distributed
   MARS options. These reflect the differing relationships the MARSs
   have with each other and the Cluster members (clients).

   Fault tolerant model:

      Active MARS
                    The single MARS serving the clients, that allocates
                    CMIs and tracks group membership changes by itself.
                    It is the sole entity that constructs replies to
                    MARS_REQUESTs.

      Backup MARS
                    An additional MARS that tracks the information being
                    generated by the Active MARS. Cluster members may
                    re-register with a Backup MARS if the Active MARS
                    fails, and they'll assume the Backup has sufficient
                    up to date knowledge of the Cluster's state to take
                    the role of Active MARS.

   Load sharing model:

      Active Sub-MARS
                    At its most basic, load sharing involves breaking
                    the Active MARS into a number of simultaneously


Armitage              Expires December 13th, 1996                [Page 3]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


                    active MARS entities that each manage a subset of
                    the Cluster. Sub-MARS entities must co-ordinate
                    their activities so that they appear to be
                    interchangeable to cluster members - each one
                    capable of allocating CMIs and tracking group
                    membership information within the Cluster. Together
                    they act as a distributed, logical single Active
                    MARS. MARS_REQUESTs sent to a single Active Sub-MARS
                    return information covering the entire Cluster.

      Backup Sub-MARS
                    A MARS entity that tracks the activities of an
                    active Sub-MARS, and is able to become a member of
                    the active sub-MARS group when failure occurs.

   The next 2 sections discuss the Fault tolerance and Load sharing
   models in further detail.

   (Editorial note: it is not yet clear how to map these to the Server
   Group concept in SCSP, which appears to solely consist of what I
   would term 'active sub-servers'. Terminology will be cleaned up as
   this becomes clearer.)


3.  Architectures for fault tolerance.

   This is the simpler of the two models. The Active MARS is a single
   entity, and only requires a one-way flow of information to the one or
   more Backup MARSs to keep their databases up to date. The
   relationship between cluster members, an Active MARS and 3 backup
   MARSs might be represented as:

             C1            C2            C3
             |             |             |
             ------------- M1 ------------
                           |
                           M2--M3--M4

   In this case the Cluster members (C1, C2, and C3) use M1, the Active
   MARS. M2, M3, and M4 are the Backup MARSs. The communications between
   M1, M2, M3, and M4 is completely independent of the communication
   between M1 and C1, C2, and C3. The Backup MARSs are essentially
   slaved off M1. (The lines represent associations, rather than actual
   VCs. M1 has pt-pt VCs between itself and the cluster members, in
   addition to ClusterControlVC spanning out to the cluster members.)

   As noted in section 1.1, M1 would be regularly transmitting a
   MARS_REDIRECT_MAP on ClusterControlVC specifying {M1, M2, M3, M4} as


Armitage              Expires December 13th, 1996                [Page 4]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


   the set of MARSs for the cluster.

   If M1 was to fail, and M2 was fully operational, the cluster would
   rebuild itself to look like this:

             C1            C2            C3
             |             |             |
             ------------- M2 ------------
                           |
                           M3--M4

   As noted in section 1.1, each cluster member re-issues its
   outstanding MARS_JOINs to M2.

   In case M2 above had also failed, clients would have then tried to
   re-register with M3, then M4, then cycled back to M1. This sequence
   would repeat until one of the MARSs listed in the last heard
   MARS_REDIRECT_MAP allowed the clients to re-register. A further
   complication is that transient failures of 2 or more of the backup
   MARSs may lead, through race conditions in client re-registration, to
   C1 re-registering with a different MARS to C2 and C3. It is clear
   that the backup MARSs must elect their own notion of the Active MARS,
   and redirect clients to this Active MARS if clients attempt to re-
   register with a MARS that considers itself to not be the Active MARS
   for the Cluster. (This needs to be clarified further.)


3.1  MARS_REDIRECT_MAPs and post-recovery reconfiguration.

   Cluster members assume that the members of the MARS_REDIRECT_MAP are
   capable of taking on the role of Active MARS. Any inter-MARS protocol
   for dynamically adding and removing Backup MARSs must ensure this is
   true.  For example, in the preceding example, once M2 takes over as
   the Active MARS it should start sending MARS_REDIRECT_MAPs that carry
   the reduced list {M2, M3, M4} until such time as M1 as recovered.

   A couple of options exist once M1 recovers, and these must be
   addressed by a distributed MARS protocol.

   A simple approach relegates M1 to be a Backup MARS. Thus M2 might
   begin issuing MARS_REDIRECT_MAPs with the list {M2, M3, M4, M1} once
   M1 is known to be available again. The picture might eventually look
   like:

             C1            C2            C3
             |             |             |
             ------------- M2 ------------
                           |


Armitage              Expires December 13th, 1996                [Page 5]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


                           M3--M4--M1

   (If M1 has some characteristics that make it more desirable than M3
   or M4, then M2 might instead start sending {M2, M1, M3, M4}.)

   However, it is possible that M1 has characteristics that make it
   preferable to any of the Backup MARSs whenever it is available.
   (This might include throughput, attachment point in the ATM network,
   fundamental reliability of the underlying hardware, etc.) Ideally,
   once M1 has recovered from whatever problem caused the move to M2, M2
   will force the cluster members to shift back to M1.  This
   functionality is also already included in the cluster member
   behaviour defined by the MARS draft (Section 5.4.3 of [3]).

   Once M1 was known to be available and synchronised with M2, M2 would
   stop sending MARS_REDIRECT_MAPs with {M2, M3, M4}. It would then
   start sending MARS_REDIRECT_MAPs with {M1, M2, M3, M4}, and bit 7 of
   the mar$redirf flag reset. Cluster members would compare the identity
   of their Active MARS (M2) with the first one listed in the
   MARS_REDIRECT_MAP (M1) and initiate a redirect. Bit 7 of mar$redirf
   being reset indicates a soft redirect. Cluster members re-register
   with M1, but do not re-join the multicast groups they are members of
   - by indicating a soft redirect, M2 is claiming that M1 has a current
   copy of M2's database. This reduces the amount of MARS signalling
   traffic associated with redirecting the cluster back to M1.

   (If synchronization of M1 with M2's database is not available, a hard
   redirect back to M1 can be performed - with a consequent burst of
   MARS control traffic as the clients leave M2 and re-join all their
   groups with M1.)

3.2  Impact of cluster member re-registration.

   As noted earlier, the MARS draft requires cluster members re-
   registering after an Active MARS failure MUST re-issue MARS_JOINs for
   all groups to which they consider themselves members.  This has an
   interesting implication - it may not be necessary for an inter-MARS
   protocol to ensure that Backup MARSs have up to date group membership
   maps.

   Take the preceding example. During the transition from M1 to M2
   cluster members C1, C2, and C3 will re-issue to M2 a sequence of
   MARS_JOINs. This would result in M2 building a group membership
   database that reflected M1's just before the failure, even if M2's
   database was initially empty.

   One piece of information that is not supplied by cluster members
   during re-registration/re-joining is their CMI - this must be


Armitage              Expires December 13th, 1996                [Page 6]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


   supplied by the new Active MARS. It is highly desirable that when a
   cluster member re-registers with M2 it be assigned the same CMI that
   it obtained from M1.  To ensure this, the Active MARS MUST ensure
   that the Backup MARSs are aware of the ATM addresses and CMIs of
   every cluster member.

   (If the CMIs are not re-assigned to the same cluster members, data
   packets flowing out of a given cluster member will suddenly have a
   different CMI embedded in them. During the transition from M1 to M2,
   some cluster members may transition earlier than others. If they are
   assigned the same CMI as a pre-transition cluster member to whom they
   are currently sending IP packets, they recipient will discard these
   packets as though they were reflections from an MCS. Once all cluster
   members have transitioned to M2 this problem will go away, but it
   represents a short period where some data packets might fall into a
   black hole.)

3.3  Multicast Servers.

   For the purposes of this document we look at Multicast Servers (MCSs)
   as clients of the Active MARS. They adhere to the same rules as
   cluster members do - listen to MARS_REDIRECT_MAP, and redirect to a
   Backup MARS when the Active MARS fails. In the same way that cluster
   members re-join their groups after re-registration, MCSs also re-
   register for groups that they are configured to serve.

   Unlike Cluster members there is no equivalent of the CMI for MCSs.
   However, it is important for the Active MARS to keep Backup MARSs
   informed of what groups are MCS supported. The reason for this can be
   understood by considering what would happen if the Backup MARS had no
   knowledge of what groups had members, and which of those groups were
   MCS supported when the Active MARS failed. Consider the follwing
   sequence:

      Active MARS fails.

      Cluster members and MCSs gradually detect the failure, and begin
      re-registering with their first available Backup MARS.

      Cluster members re-MARS_JOIN all groups they were members of.

      As the Backup (now Active) MARS receives these MARS_JOINs it
      propagates them on its new ClusterControlVC.

      Simultaneously each MCS re-MARS_MSERVs all groups they were
      configured to support.

      If a MARS_MSERV arrives for a group that already has cluster


Armitage              Expires December 13th, 1996                [Page 7]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


      members, the Backup (now Active) MARS transmits an appropriate
      MARS_MIGRATE on its new ClusterControlVC.

   Assume that group X was MCS supported prior to the Active MARS's
   failure. Each cluster member had a pt-mpt VC out to the MCS (a single
   leaf node). MARS failure occurs, and each cluster member re-registers
   with the Backup MARS. The pt-mpt VC for group X is unchanged.  Now
   cluster members begin re-issuing MARS_JOINs to the Backup (now
   Active) MARS. If the MCS for group X has not yet re-MARS_MSERVed for
   group X, the Backup MARS thinks the group is VC Mesh based, so it
   propagates the MARS_JOINs on ClusterControlVC. Other cluster members
   then update their pt-mpt VC for group X to add the (apparently) new
   leaf nodes. This results on cluster members forwarding their data
   packets to the MCS and some subset of the cluster members directly.
   This is not good. When the MCS finally re-registers, and re-
   MARS_MSERVs group X, the MARS will issue a MARS_MIGRATE, which will
   fix every cluster member's pt-mpt VC for group X. But the transient
   period is potentially dangerous.

   If the Backup MARSs are aware of what groups are MCS supported, they
   can appropriately suppress the cluster member's MARS_JOINs for a
   period of time while waiting for the MCS to explicitly re-register
   and re-MARS_MSERV. This would avoid the transient period where
   cluster members are reacting to MARS_JOINs erroneously sent across
   the new ClusterControlVC.


3.4  Inter-MARS protocol requirements.

   For the purely fault-tolerant model, the requirements are:

      For the architecture discussed in this section, the key pieces of
      information (or sub-caches, described in Appendix B.4 of [1]) that
      must be propagated by the Active MARS to Backup MARSs is the CMI
      to Cluster Member mapping table and the list of groups currently
      MCS supported.

      It is valuable enable a previous Active MARS to be returned to the
      group of MARSs listed in a MARS_REDIRECT_MAP after it recovers
      from its failure.

      If a failed Active MARS restarts, and is preferable to any of the
      Backup MARSs for long term cluster operation, then it is desirable
      that some mechanism exists for synchronising the entire databases
      of the current Active MARS with the restarted MARS. This allows a
      transition back to the restarted MARS using a soft redirect.

   No special additions are required to handle client requests (e.g.


Armitage              Expires December 13th, 1996                [Page 8]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


   MARS_REQUEST or MARS_GROUPLIST_QUERY), since there is only a single
   Active MARS.


4.  Architectures for load sharing.

   Creating a physically distributed, but logically single MARS is a
   non-trivial task. A number of issues arise:

      ClusterControlVC is partitioned into a number of sub-CCVCs, one
      hanging off each Active Sub-MARS. Their leaf nodes are those
      cluster members that make up the cluster partition served by an
      Active Sub-MARS.

      MARS_JOIN/LEAVE traffic to one Active Sub-MARS must propagate out
      on each and every sub-CCVC to ensure Cluster wide distribution.
      This propagation must occur immediately.

      Allocation of CMIs across the cluster must be co-ordinated amongst
      the Active Sub-MARSs to ensure no CMI conflicts within the
      cluster.

      Each sub-CCVC must carry MARS_REDIRECT_MAP messages with an
      appropriate MARS list that perpetuates the illusion to cluster
      members that there is only a single MARS.

      Each Active Sub-MARS must be capable of answering a MARS_REQUEST
      or MARS_GROUPLIST_QUERY with information covering the entire
      Cluster.

   Load sharing configurations take on a range of forms. At the simplest
   end multiple MARS entities are simultaneously operationaly, and
   subdivide the Cluster. No fault tolerance is provided - if a MARS
   fails, its clients are 'off air' until the MARS restarts.

   A more complex model would allow each partition of the cluster to be
   supported by a MARS with its own dedicated set of Backup MARSs.

   Finally, the most complex model requires a set of MARS entities from
   which a subset may at any one time be actively supporting the
   cluster, while the remaining entities wait as Backups. The
   partitioning of the cluster is ideally dynamic and variable.

   The following subsections touch on these different models.


Armitage              Expires December 13th, 1996                [Page 9]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


4.1  Simple load sharing.

   In a simple load sharing model each Active Sub-MARS has no backups,
   and clients only know of one Sub-MARS.  Consider a cluster with 4
   MARS Clients, and 2 Active Sub-MARSs. The following picture shows one
   possible configuration, where the cluster members are split evenly
   between the sub-MARSs:

             C1            C2      C3           C4
             |             |       |            |
             ----- M1 ------       ----- M2 -----
                   |                     |
                   -----------------------

   C1, C2, C3, and C4 all consider themselves to be members of the same
   Cluster. M1 manages a sub-CCVC with {C1, C2} as leaf nodes, while M2
   manages a sub-CCVC with {C3, C4} as leaf nodes. M1 and M2 must have
   some means to exchange cluster co-ordination information.

   When C1 issues MARS_JOIN/LEAVE messages they must be sent to {C1, C2}
   and also {C3, C4} via M2.  When C3 issues MARS_JOIN/LEAVE messages
   they must be sent to {C3, C4} and also {C1, C2} via M1. One side-
   effect is that M1 and M2 are forced to be aware of group membership
   changes from all parts of the cluster (through the exchange of MARS
   messages needing cluster wide propagation).

   M2 must be able to answer a MARS_REQUEST from C3 or C4 that covers
   its own database and that of M1. Conversely M1 must be able to draw
   upon M2's knowledge and its own when answering a MARS_REQUEST from C1
   or C2. Two solutions exist - either M1 and M2 attempt to ensure they
   both share complete knowledge of the cluster's membership lists, or
   they query each other 'on demand' when building the answers to a
   clients MARS_REQUEST.  Given that each Active Sub-MARS will see the
   MARS_JOIN/LEAVE messages generated by clients of other Active Sub-
   MARSs, it seems more effective for each Active Sub-MARS to keeps its
   own view of the Cluster using this message flow, and so build replies
   to MARS_REQUESTs from local knowledge.

   When new cluster members register with either M1 or M2 there must be
   some mechanism to ensure CMI allocation is unique within the scope of
   the entire cluster.

   There must be some element to the inter-MARS protocol that allows
   them to detect the possible loss of messages from the other MARS(s).

   If no backups exist, and no mechanism for dynamically re-arranging
   the partitioning of the cluster, the MARS_REDIRECT_MAP message from
   M1 lists {M1}, and from M2 lists {M2}.


Armitage              Expires December 13th, 1996               [Page 10]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


4.2  Simple load sharing with backups.

   A slightly more complex model would evolve if each Active Sub-MARS
   had their own list of one or more Backup Sub-MARSs. The picture might
   become:

             C1            C2      C3           C4
             |             |       |            |
             ----- M1 ------       ----- M2 -----
                 / |                     | \
                |  -----------------------  |
                M3                          M4

   In this case M3 is a Backup for M1, and M4 is a Backup for M2.
   Initially we'll assume that there is no requirement for M3 and M4 to
   be shareable between M1 and M2.  The MARS_REDIRECT_MAP from M1 would
   list only {M1, M3}, and from M2 would list only {M2, M4}.

   This situation implies the fault-tolerant model (section 3) between
   each Active Sub-MARS and its local group of Backup Sub-MARSs.
   However, it also implies that when a Backup Sub-MARS is promoted to
   Active Sub-MARS it must have some means to know who the other Active
   Sub-MARSs are. Thus the protocol managing load sharing among the
   Active Sub-MARSs needs augmentation to support Backup Sub-MARSs.

   For example, if M1 failed, the picture might become:

             C1            C2      C3           C4
             |             |       |            |
             ----- M3 ------       ----- M2 -----
                   |                     | \
                   -----------------------  |
                                            M4

   The MARS_REDIRECT_MAP from M3 would list only {M3}, and from M2 would
   continue to list {M2, M4}.  (Assuming M1 never recovers. If M1
   recovers, a number of options exist for M1 and M3 to decide who will
   continue supporting their part of the cluster.)


4.3  Load sharing with dynamic reconfiguration.

   The preceding examples are significantly limited. Ideally the set of
   individual sub-MARSs should be capable of managing a variable sized
   partition, all the way up to the full cluster. What size each MARSs
   partition is should be dynamically changeable. If such flexibility
   exists, each Active Sub-MARS can effectively become each other's
   Backup Sub-MARS. Shifting clients from a failed Active Sub-MARS to


Armitage              Expires December 13th, 1996               [Page 11]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


   another Active Sub-MARS is load reconfiguration from the perspective
   of the Sub-MARSs, but is fault tolerant MARS service from the
   perspective of the clients.

   For example, assume this initial configuration:

             C1            C2      C3           C4
             |             |       |            |
             ----- M1 ------       ----- M2 -----
                   |                     |
                   -----------------------

   M1 lists {M1, M2} in its MARS_REDIRECT_MAPs, and M2 lists {M2, M1}.
   The cluster members neither know nor care that the Backup MARS listed
   by their Active MARS is actually an Active MARS for another subset of
   the Cluster.

   If M1 failed its partition of the cluster should collapse. C1 and C2
   should re-register with M2, and the picture becomes:

             C1            C2      C3           C4
             |             |       |            |
             --------------------------- M2 -----

   All cluster members start receiving MARS_REDIRECT_MAPs from M2,
   listing {M2} as the sole MARS.

   Currently missing from this model is a mechanism for re-partitioning
   the cluster once M1 has recovered. M2 needs to get C1 and C2 to
   perform a soft-redirect (or hard, if appropriate) to M1, without
   losing C3 and C4.

   One way of avoiding this scenario is to provision enough Active Sub-
   MARSs for the desired load sharing, and then provide a pool of shared
   Backup Sub-MARSs such that the number of Active Sub-MARSs never
   changes and the cluster partitions never alter. The picture from
   section 4.2 might be redrawn:

             C1            C2      C3           C4
             |             |       |            |
             ----- M1 ------       ----- M2 -----
                   |                     |
                   -----------------------
                       |           |
                       M3          M4

   In this case M1 lists {M1, M3, M4} in its MARS_REDIRECT_MAPs, and M2
   lists {M2, M3, M4}. If M1 fails, the cluster configures to:


Armitage              Expires December 13th, 1996               [Page 12]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


             C1            C2      C3           C4
             |             |       |            |
             ----- M3 ------       ----- M2 -----
                   |                     |
                   -----------------------
                                   |
                                   M4

   Now, if M3 stays up while M1 is recovering from its failure, there
   will be a period within which M3 lists {M3, M4} in its
   MARS_REDIRECT_MAPs, and M2 lists {M2, M4}. This implies that the
   failure of M1, and the promotion of M3 into the Active Sub-MARS set,
   causes M2 to re-evaluate the list of available Backup Sub-MARSs too.

   Then, when M1 is detected to be available again, M1 might be placed
   on the list of Backup Sub-MARS. The cluster would be configured as:

             C1            C2      C3           C4
             |             |       |            |
             ----- M3 ------       ----- M2 -----
                   |                     |
                   -----------------------
                       |           |
                       M1          M4

   M3 lists {M3, M1, M4} in its MARS_REDIRECT_MAPs, and M2 lists {M2,
   M4, M1}.

   Alternatively, as discussed in section 3, the failed MARS M1 may have
   some characteristics that make it preferred any time it is alive. So,
   M3 should only manage {C1, C2} until such time as M1 is detected
   alive again. M3 and M1 should then swap places, and inform the other
   Active Sub-MARSs.

   The difference between this scheme, and that described in section
   4.2, is that M3 and M4 are actually available to support either M1 or
   M2's partitions. For example, if M1 and M2 failed simultaneously the
   cluster should rebuild itself to look like:

             C1            C2      C3           C4
             |             |       |            |
             ----- M3 ------       ----- M4 -----
                   |                     |
                   -----------------------

   M1 and M2 must be careful to list a different sequence of Backup
   Sub-MARSs in their MARS_REDIRECT_MAPs. For example, if M1 listed {M1,
   M3, M4} and M2 listed {M2, M3, M4} the cluster would look like this


Armitage              Expires December 13th, 1996               [Page 13]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


   after a simultaneous failure of M1 and M2:


             C1            C2      C3           C4
             |             |       |            |
             --------------------------- M3 -----
                                         |
                                         M4

   This is a bad situation, since (as noted earlier) we have no obvious
   mechanism to re-partition the cluster between the two available Sub-
   MARSs.

   (Another solution that is not entirely fool proof would be for the
   Active MARS to issue specifically targetted MARS_REDIRECT_MAP
   messages on the pt-pt VCs that each client has open to it. If C1 and
   C2 still had their pt-pt VCs open, e.g. after re-registration, the M3
   could send them private MARS_REDIRECT_MAPs listing {M4, M3} as the
   list, forcing only C1 and C2 to re-direct. This approach requires
   further thought.)

4.4  Multicast Server interactions?

   One of the more complex aspects of a single MARS is its filtering of
   MARS_JOIN/LEAVE messages on ClusterControlVC in the presence of MCS
   supported groups (Section 6 of [3]).

   For an Active Sub-MARS to correctly filter MARS_JOIN/LEAVE messages
   it may want to transmit on its local Sub-CCVC it MUST know what
   groups are, cluster wide, being supported by an MCS. Since the MCS in
   question may have registered with another Active Sub-MARS, this
   implies that the Active Sub-MARSs must exchange timely information on
   MCS registrations and supported groups.


4.5  Key issues?

   Since MARS_JOIN/LEAVE traffic must propagate through every Active
   Sub-MARS, the 'load' being shared across the set of Active Sub-MARSs
   is VCC load rather than message processing load.

   Re-partitioning that involves increasing the number of Active Sub-
   MARSs has no obvious solution at this point.

   Since MARS_JOIN/LEAVE traffic must propagate through every Active
   Sub-MARS, a separate server cache synchronisation protocol covering
   group membership changes is probably not needed between Active Sub-


Armitage              Expires December 13th, 1996               [Page 14]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


   MARSs.

   As for the purely fault tolerant models in section 3, CMI information
   needs to be propagated amongst Active and Backup Sub-MARSs.

   To ensure each Active Sub-MARS can filter the JOIN/LEAVE traffic it
   propagates on its Sub-CCVC, information on what groups are MCS
   supported MUST be distributed amongst them.

   Active Sub-MARSs should be aware at all times what the cluster wide
   group membership is for any given group so they can answer
   MARS_REQUESTs from locally held information.


XX.  Tradeoffs and simplifications.


   TBD.

   [i.e. why do one or the other. summarize difficulties in doing both.
   Value in doing only one? Is fault tolerance more important than
   loadsharing? ]


XX.  So how does SCSP help?


   TBD.

XX.  The relationship between MARS and NHS entities.


   TBD.

   [e.g. they're not required to be co-resident, don't restrict your
   architecture to assume they will be even if NHSs exist in your LIS
   for unicast. MARS has _no_ IP level visibility (except perhaps for
   SNMP access - not clear on how this would work). ]

XX.  Open Issues.


Armitage              Expires December 13th, 1996               [Page 15]

Internet Draft   <draft-armitage-ion-mars-scsp-00.txt>   June 13th, 1996


Security Consideration

   Security consideration are not addressed in this document.

Acknowledgments

   Jim Rubas and Anthony Gallo of IBM have helped clarify some points in
   this initial release, and will be co-authors on future releases.


Author's Address

   Grenville Armitage
   Bellcore, 445 South Street
   Morristown, NJ, 07960
   USA

   Email: gja@thumper.bellcore.com
   Ph. +1 201 829 2635


References
   [1] J. Luciani, G. Armitage, J. Jalpern, "Server Cache
   Synchronization Protocol (SCSP) - NBMA", INTERNET DRAFT, draft-
   luciani-rolc-scsp-02.txt, April 1996.

   [2] J. Luciani, et al, "NBMA Next Hop Resolution Protocol (NHRP)",
   INTERNET DRAFT, draft-ietf-rolc-nhrp-08.txt, June 1996.

   [3] G. Armitage, "Support for Multicast over UNI 3.0/3.1 based ATM
   Networks.", Bellcore, INTERNET DRAFT, draft-ietf-ipatm-ipmc-12.txt,
   February 1996.


Armitage              Expires December 13th, 1996               [Page 16]