Internet-Draft                                      Grenville Armitage
                                                              Bellcore
                                                       July 12th, 1996


                   Issues affecting MARS Cluster Size
                <draft-armitage-ion-cluster-size-00.txt>


Status of this Memo

   This document was submitted to the IETF Internetworking over NBMA
   (ION) WG.  Publication of this document does not imply acceptance by
   the ION WG of any ideas expressed within.  Comments should be
   submitted to the ion@nexen.com mailing list.

   Distribution of this memo is unlimited.

   This memo is an internet draft. Internet Drafts are working documents
   of the Internet Engineering Task Force (IETF), its Areas, and its
   Working Groups. Note that other groups may also distribute working
   documents as Internet Drafts.

   Internet Drafts are draft documents valid for a maximum of six
   months.  Internet Drafts may be updated, replaced, or obsoleted by
   other documents at any time. It is not appropriate to use Internet
   Drafts as reference material or to cite them other than as a "working
   draft" or "work in progress".

   Please check the lid-abstracts.txt listing contained in the
   internet-drafts shadow directories on ds.internic.net (US East
   Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or
   munnari.oz.au (Pacific Rim) to learn the current status of any
   Internet Draft.

Abstract

   IP multicast over ATM  currently  uses  the  MARS  model [1] to
   manage  the  use  of ATM pt-mpt SVCs for IP multicast packet
   forwarding. The scope of any given MARS services is the MARS Cluster
   - typically the same as an IPv4 Logical IP Subnet (LIS). Current
   IP/ATM networks are usually architected with unicast routing and
   forwarding issues dictating the sizes of individual LISes. However,
   as IP multicast is deployed as  a service,  the sizes of LISes will
   only be as big as a MARS Cluster can be. This document looks at the
   issues that  will constrain MARS Cluster size, and why large scale IP
   over ATM networks might preferably be built with many small Clusters
   rather than few large Clusters.


Armitage               Expires January 12th, 1997                [Page 1]

Internet Draft  <draft-armitage-ion-cluster-size-00.txt> July 12th, 1996


1. Introduction

   A MARS Cluster is the set of IP/ATM interfaces that are wil- ling  to
   engage in direct, ATM level pt-mpt SVCs to perform IP multicast
   packet forwarding [1].  Each IP/ATM  interface (a MARS  Client) must
   keep state information regarding the ATM addresses of each leaf node
   (recipient) of each  pt-mpt  SVC it   has  open.  In  addition,  each
   MARS  Client  receives MARS_JOIN and MARS_LEAVE messages  from  the
   MARS  whenever there  is a requirement that Clients around the
   Cluster need to update their pt-mpt SVCs for a given IP multicast
   group.

   The definition of Cluster 'size' can mean two things - the number of
   MARS Clients using a given MARS, and the geographic distribution of
   MARS Clients.  The number of MARS Clients in  a Cluster impacts on
   the amount of state information any given client may need to store
   while managing  outgoing  pt- mpt  SVCs. It also impacts on the
   average rate of JOIN/LEAVE traffic that is propagated by the MARS on
   ClusterControlVC, and the number of pt-mpt VCs that may need
   modification each time a MARS_JOIN or MARS_LEAVE appears on
   ClusterControlVC.

   The geographic distribution of clients impacts on the latency between
   a client issuing a MARS_JOIN, and it finally being added onto the
   pt-mpt VCs  of  the  other  MARS  Clients transmitting to the
   specified multicast group. (This latency is made up of both the time
   to propagate the MARS_JOIN, and the delay in the underlying ATM
   cloud's reaction to the subsequent ADD_PARTY messages.)

2. Limitations on state storage

   A Cluster should not contain more MARS Clients than the maximum
   number  of  leaf nodes supportable by the most limited member of the
   cluster.

   Two items are affected by this limitation:

      ClusterControlVC from the MARS. It has  a  leaf  node  per cluster
      member  (MARS Client). This limitation applies only to the node
      supporting the MARS itself.

      Packet forwarding SVCs out of each MARS Client for each IP
      multicast  group  being  sent to. The number of MARS Clients that
      may chose to be members of a given group may  encompass every MARS
      Client in the cluster.

   Under UNI 3.0/3.1 the most obvious limit on the  size  of  a cluster
   is the 2^15 leaf nodes that can be added to a pt-mpt SVC. However, in


Armitage               Expires January 12th, 1997                [Page 2]

Internet Draft  <draft-armitage-ion-cluster-size-00.txt> July 12th, 1996


   practice  most ATM  NICs  (and  probably switches) are going to
   impose a limit much lower than this - a function of how much per-leaf
   node state information  they need to store (and are capable of
   storing) for pt-mpt SVCs.

   A MARS Client may impose its own state storage  limitations, such
   that  the combined memory consumption of a MARS Client and the ATM
   NIC in a given host limits the client  to  fewer leaf  nodes  than
   the ATM NIC alone might have been able to support.

   Limitations of the switch to which a MARS or MARS Client  is directly
   attached  may  also  impose  a lower limit on leaf nodes than that of
   the MARS, MARS Client, or ATM NIC.  Cluster size  is limited by the
   most constraining of these limits.

   It may be possible to work around leaf node limits  by distributing
   the leaf nodes across multiple pt-mpt SVCs operating in parallel.
   However, such an approach requires  further study,  and is unlikely
   to be a useful workaround for Client or NIC based limitations.

   A related observation can also be made that  the  number  of MARS
   Clients in a Cluster may be limited by the memory constraints of the
   MARS itself. It is required to keep state on all  the  groups  that
   every  one  of its MARS Clients have joined. For a given memory
   limit, the maximum number of MARS Clients must drop if the average
   number of groups joined per Client rises. Depending on the level of
   group  memberships, this  limitation  may  be  more severe that pt-
   mpt leaf node limits.

3. Signaling load.

   In any given cluster there will be  an  'ambient'  level  of
   MARS_JOIN/LEAVE  activity.  What that level will actually be depends
   on the types of multicast  applications  running  on the  majority
   of the hosts in the cluster. It is reasonable to assume that as the
   number of  MARS  Clients  in  a  given cluster  rises, so does the
   ambient level of MARS_JOIN/LEAVE activity that  the  MARS  receives
   and  propagates  out  on ClusterControlVC.

   The existence of MARS_JOIN/LEAVE traffic also has  a  consequential
   impact  on  signaling  activity  at  the ATM level (across the UNI
   and {P}NNI boundaries). For groups that  are VC  Mesh  supported,
   each MARS_JOIN or MARS_LEAVE propagated on  ClusterControlVC  will
   result  in   an   ADD_PARTY   or DROP_PARTY  message sent across the
   UNIs of all MARS Clients that are transmitting  to  a  given  group.
   As  a  clusters membership  increases,  so  does  the average number
   of MARS Clients that trigger ATM signaling activity in  response  to
   MARS_JOIN/LEAVEs.


Armitage               Expires January 12th, 1997                [Page 3]

Internet Draft  <draft-armitage-ion-cluster-size-00.txt> July 12th, 1996


   The size of a cluster needs to be  chosen  to  provide  some level
   of  containment  to  this  ambient  level of MARS and UNI/NNI
   signaling.

   Some refinements to the MARS Client behaviour  may  also  be explored
   to  smooth  out UNI signaling transients. The MARS spec currently
   requires that revalidation of  group  memberships only occurs when
   the Client starts sending new packets to an invalidated group SVC. A
   Client could apply a  similar algorithm  to  decide  when it should
   issue ADD_PARTYs after seeing a MARS_JOIN - wait until it actually
   has a packet  to send,  send  the  packet,  then initiate the
   ADD_PARTY. As a result actively transmitting Clients would update
   their SVCs sooner  than  intermittently  transmitting Clients. This
   requires careful implementation of the Client state machine.

4. Group change latencies

   The group change latency can be defined as the time it takes for  all
   the  senders  to a group to have correctly updated their forwarding
   SVCs after a MARS_JOIN or MARS_LEAVE is received from the MARS. This
   is affected by both the number of Cluster members and the
   geographical distribution of Cluster members.

   The number of Cluster members affects the ATM level  signaling load
   offered  as  soon as a MARS_JOIN or MARS_LEAVE is seen. If the load
   is high, the ATM Cloud itself  may  suffer slow  processing  of  the
   various SVC modifications that are being requested.

   Wide geographic distribution of Cluster members  delays  the
   propagation of MARS_JOIN/LEAVE and ATM UNI/NNI messages. The further
   apart various members are, the longer it  takes  for them to receive
   MARS_JOIN/LEAVE traffic on ClusterControlVC, and the longer it takes
   for the  ATM  network  to  react  to ADD_PARTY  and  DROP_PARTY
   requests.  If  the long distance paths are populated by many ATM
   switches, propagation delays due  to  per-switch processing will add
   substantially to delays due to the speed of light.

   Unfortunately, some mechanisms for smoothing out  the  transient ATM
   signaling  load  described  in  section 3 have a consequence of
   increasing the group  change  latency  (since the  goal  is  for some
   of the senders to deliberately delay updating their forwarding SVCs)

   A related effect will also be felt by the MARS  itself.  The larger
   the MARS database, the longer it may take to process MARS_JOIN/LEAVE
   messages (which involve locating and  updating individual group
   entries). Whilst this issue may not be important for conferencing
   applications (with group  membership changes on the human time
   frame), high speed simulation environments may find such


Armitage               Expires January 12th, 1997                [Page 4]

Internet Draft  <draft-armitage-ion-cluster-size-00.txt> July 12th, 1996


   considerations important.

5. Large IP/ATM networks using Mrouters

   Building a large scale, multicast capable IP over  ATM network  is  a
   tradeoff  between  Cluster sizes and numbers of Mrouters. For a given
   number  of  hosts  across  the  entire IP/ATM network, as cluster
   sizes drop you need more of them.  Clusters must be interconnected by
   Mrouters, so  the  number of  Mrouters  rises.  (The  actual  rise
   in  the  number of Mrouters depends largely on  the  logical  IP
   topology  you choose to implement, since a single physical Mrouter
   may interconnect more than two Clusters at once.) It  is  a  local
   deployment  question  as to what the optimal mix of Clusters and
   Mrouters will be.

   A constructive way to view  conventional  Mrouters  is  that they
   are  aggregation  points  for signaling and data plane loads. An
   Mrouter hides  group  membership  changes  in  one cluster from
   senders within other Clusters, and protects local group members from
   being swamped by SVCs from senders in other  Clusters.
   MARS_JOIN/LEAVE  traffic in one Cluster is hidden from the members of
   all other Clusters.  (The consequential UNI signaling load is
   localized to the source Cluster too.) Group members in a cluster are
   fed packets from an SVC  originating  on the MARS Client residing in
   their local Mrouter, rather than terminating multiple  SVCs
   originating on the actual senders in remote Clusters.

   As a side effect of the Mrouters role  in  aggregating  data path
   flows, it reduces the impact of SVC leaf-node limits. A hypothetical
   10000 node Cluster could  be  broken  into  two 5000 node Clusters,
   or four 2500 node Clusters. In each case the individual Cluster
   members need only source pt-mpt  SVCs with maximums of 5000 or 2500
   leaf nodes respectively.

6. Large IP/ATM networks using Cell Switch Routers (CSRs)

   A Cell Switch Router may act as a conventional Mrouter,  and provide
   all the benefits described in the previous section.  However, one of
   the useful characteristics of the CSR is the ability to internally
   'short-cut' the cells from an incoming VCC to an outgoing VCC. Once
   the CSR has identified  a  flow of  IP  traffic, and associated it
   with an inbound and outbound VCC, it begins to  function  as an  ATM
   cell level device rather than a packet level device.  Even when
   operating in a 'short-cut' mode the CSR  is  still able to protect
   Clusters from the MARS_JOIN/LEAVE activities of surrounding Clusters.
   From the perspective of Clusters to which  the  CSR is directly
   attached, the CSR terminates and originates pt-mpt SVCs. It acts as
   the path out of a  source Cluster,  and  the  entry  point  into  a


Armitage               Expires January 12th, 1997                [Page 5]

Internet Draft  <draft-armitage-ion-cluster-size-00.txt> July 12th, 1996


   target Cluster. It remains unnecessary for senders  in  one  Cluster
   to  issue ADD_PARTY  or  DROP_PARTY  messages  in  response  to
   group membership changes in other Clusters - the CSR tracks  these
   changes, and updates the pt-mpt trees rooted on its own ATM ports as
   needed.

   However, there is one significant point of difference  to  a
   conventional  Mrouter  -  a  simple CSR cannot aggregate the packet
   flows from multiple senders in  one  Cluster  onto  a single  SVC
   into an adjacent Cluster. Within a Cluster with multiple sources, the
   CSR is a leaf node  on  an  individual SVC per source (just like a
   conventional Mrouter). But if it chooses to 'short-cut' traffic at
   the cell  level  to  group members  in  another  Cluster,  it must
   construct a separate forwarding SVC into the target cluster  to
   match  each  VCC from  each  sender  in  the source Cluster. This
   requirement stems from the need to maintain AAL_SDU  boundaries  at
   the ultimate  recipients - the group members in the target cluster.
   If the cells from  individual  senders  in  the  source Cluster were
   FIFO merged onto a single outgoing SVC into the target Cluster,
   recipients in the target Cluster would  have a  hard time
   reconstructing individual AAL_SDUs from the interleaved cells. (This
   is mostly due to  our  use  of  AAL5.  AAL3/4  could  provide  a
   solution  using  the  MID  field, although we would be limited to
   2^10 senders per Cluster and introduce a MID management problem.)

   Interestingly, this problem can magnify  the  UNI  signaling load
   offered within the target Cluster whenever a new group member
   arrives. If there are N senders in the  source Cluster,  the CSR will
   have built N identical pt-mpt SVCs out to the group members  within
   the  target  Cluster.  If  a  new MARS_JOIN  is issued within the
   target Cluster, the CSR must issue N ADD_PARTYs to update its SVCs
   into the target Cluster.  (Under  similar  circumstances  a
   conventional Mrouter would have issued only one ADD_PARTY for its
   single SVC into the target Cluster.)

   A possible solution is for the CSR's underlying cell switching fabric
   to provide AAL_SDU-aware cell forwarding.  If segmented AAL_SDUs
   arriving from the source  Cluster  could  be buffered  and  forwarded
   in groups of cells representing entire AAL_SDUs, the CSR would need
   only a single SVC into the target  Cluster.  Its impact on the
   Clusters it was attached to would then be the same as that of a
   conventional Mrouter.  (This  does  not necessarily imply full re-
   assembly followed by segmentation. It would be  sufficient  for  the
   incoming cells  to be buffered in sequence, and the fed onto the
   outbound SVC. The CSRs switch fabric would  not  be  performing any
   AAL  level  checks  other  than detecting AAL_SDU boundaries.)


Armitage               Expires January 12th, 1997                [Page 6]

Internet Draft  <draft-armitage-ion-cluster-size-00.txt> July 12th, 1996


7. The impact of Multicast Servers (MCSs)

   The MCS has an intra-Cluster affect  somewhat  analogous  to the
   inter-Cluster  affect  of  the  Mrouter.  It aggregates AAL_SDU flows
   around the Cluster into a single  pt-mpt  SVC.  This  single pt-mpt
   SVC is the only one that needs to be updated when an intra-cluster
   group membership change occurs.

   It also reduces the amount  of  MARS_JOIN/LEAVE  traffic  on
   ClusterControlVC - such  messages for MCS supported groups are
   propagated out  on  ServerControlVC,  thus  interrupting only  the
   (presumably smaller) set of MCSes attached to the MARS. One way to
   look at an MCS is a stripped-down  Mrouter, operating intra-Cluster
   and performing minimal (if any) forwarding decisions based on IP
   level information. Whether the use  of MCSs allows you to deploy
   larger Clusters depends on the mix of MCS supported groups and VC
   Mesh supported groups within your Cluster.

8. Conclusion

   This short document has provided a high level overview of  the
   parameters affecting the size of MARS Clusters within multicast
   capable IP/ATM networks. Limitations on the  number  of leaf nodes a
   pt-mpt SVC may support, sizes of the MARS database, propagation
   delays of MARS and UNI messages,  and  the frequency  of  MARS and
   UNI control messages are all identified as  issues  that  will
   constrain  Clusters.  Mrouters (either  conventional  or in Cell
   Switch Router form) were identified as useful aggregators of IP
   multicast traffic  and  signaling  information.  Large scale IP
   multicasting over ATM requires a combination of Mrouters and
   appropriately sized MARS Clusters.


Security Consideration

   Security consideration are not addressed in this document.

Acknowledgments


Author's Address

   Grenville Armitage
   Bellcore, 445 South Street
   Morristown, NJ, 07960
   USA


Armitage               Expires January 12th, 1997                [Page 7]

Internet Draft  <draft-armitage-ion-cluster-size-00.txt> July 12th, 1996


   Email: gja@thumper.bellcore.com
   Ph. +1 201 829 2635


References

   [1] G. Armitage, "Support for Multicast over UNI 3.0/3.1 based ATM
   Networks.", Bellcore, INTERNET DRAFT, draft-ietf-ipatm-ipmc-12.txt,
   February 1996.


Armitage               Expires January 12th, 1997                [Page 8]