Internet DRAFT - draft-dunbar-nvo3-nva-mapping-distribution

draft-dunbar-nvo3-nva-mapping-distribution



NV03 working group                                  L. Dunbar
Internet Draft                                    D. Eastlake
Category: Standards Track                              Huawei
Expires: November 2016                           Tom Herbert
                                                      Google

                                            October 19, 2015


           NVA Address Mapping Distribution (NAMD) Protocol

           draft-dunbar-nvo3-nva-mapping-distribution-02.txt

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79. This document may not be
   modified, and derivative works of it may not be created, except
   to publish it as an RFC and to translate it into languages other
   than English.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet-
   Drafts as reference material or to cite them other than as "work
   in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on April 19, 2015.






Dunbar, et al         Expires November 2016           [Page 1]

Internet-Draft    NVA mapping distribution



Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document. Code Components extracted from this
   document must include Simplified BSD License text as described in
   Section 4.e of the Trust Legal Provisions and are provided
   without warranty as described in the Simplified BSD License.


Abstract

   This draft describes the mechanism  for NVA to promptly and
   incrementally distribute the inner (TS) to outer (NVE) mapping
   and VN Context to relevant NVEs in a timely manner.

Table of Contents


   1. Introduction...................................................4
   2. Terminology....................................................4
   3. Overall Requirement for NVE<->NVA Control Plane................5
   4. Terminologies and Assumptions..................................6
   5. Overview of NVA Address Mapping Distribution (NAMD) Protocol...7
   6. TLV for NVE reachable addresses................................7
   7. Push Mechanism.................................................8
      7.1. Requesting Push Service...................................9
      7.2. Incremental Push Service.................................12
   8. Pull Mechanism................................................13
      8.1. Pull Query Format........................................14
      8.2. Pull Response............................................16
      8.3. Cache Consistency........................................19
      8.4. Update Message Format....................................20
      8.5. Acknowledge Message Format...............................21
      8.6. Pull Request Errors......................................21
      8.7. Redundant Pull NVAs......................................21
   9. Hybrid Mode...................................................21
   10. Redundancy...................................................22
   11. Inconsistency Processing.....................................22
   12. Protocols to consider to carry NAMD messages.................23


Dunbar, et al                                           [Page 2]

Internet-Draft    NVA mapping distribution



   13. Security Considerations......................................23
   14. IANA Considerations..........................................24
   15. Acknowledgements.............................................24
   16. References...................................................24
      16.1. Normative References....................................24
      16.2. Informative References..................................24
   Authors' Addresses...............................................25











































Dunbar, et al                                           [Page 3]

Internet-Draft    NVA mapping distribution



1. Introduction

   Section 4.5 of [nvo3-problem-statement] describes the back-end
   Network Virtualization Authority (NVA) that is responsible for
   distributing the mapping information for entire overlay system.
   [nvo3-nve-nva-cp-req] defines the requirement for the control
   plane between NVA and NVE.

   This draft describes a mechanism for NVA to promptly and
   incrementally distribute the inner (TS) to outer (NVE) mapping
   and VN Context to relevant NVEs in a timely manner.



   For ease of description, the term "NAMD" is used to represent the
   NVA Address Mapping Distribution protocol.

2. Terminology

   The following terms are used interchangeably in this document:

            - The terms "Subnet" and "VLAN" because it is common to
               map one subnet to one VLAN.
            - The term "Directory" and "Network Virtualization
               Authority (NVA)"
            - The term "NVE" and "Edge"


   Bridge:  IEEE Std 802.1Q-2011 compliant device [802.1Q]. In this
            draft, Bridge is used interchangeably with Layer 2
            switch.

   NAMD Timeout:  The time interval that an NVE can assume NVA is
            not reachable if the NVE hasn't received any updates
            from NVA during this time. NAMD Timeout is an unsigned
            byte that gives the amount of time in seconds during
            which the NVA will send at least three update PDUs. An
            empty update is used as a keep alive. It defaults to 30
            seconds.

   DA:      Destination Address

   DC:      Data Center

   EoR:     End of Row switches in data center. Also known as
            aggregation switches.



Dunbar, et al                                           [Page 4]

Internet-Draft    NVA mapping distribution



   End Station:      Guest OS running on a physical server or on a
            virtual machine. An end station in this document has at
            least one IP address and at least one MAC address, which
            could be in DA or SA field of a data frame.

   LISP:    Locator/ID Separation Protocol

   NVA:     Network Virtualization Authority

   NVE:     Network Virtualization Edge

   SA:      Source Address

   Station: A node, or a virtual node, with IP and/or MAC addresses,
            which could be in the DA or SA of a data frame.

   ToR:     Top of Rack Switch in data center. It is also known as
            access switches in some data centers.

   TS:      Tenant System

   VM:      Virtual Machines

   VN:      Virtual Network

   VNID:    Virtual Network Instance Identifier



3. Overall Requirement for NVE<->NVA Control Plane

   Section 3.1 of [nvo3-cp-req] describes the basic requirement of
   inner address to outer address mapping for NVO3.  A NVE needs to
   know the mapping of the Tenant System destination (inner) address
   to the (outer) address (IP) on the Underlying Network of the
   egress NVE.

   Section 3.1 of [nvo3-cp-req] states that a protocol is needed to
   provide this inner to outer mapping and VN Context to each NVE
   that requires it and keep the mapping updated in a timely manner.
   Timely updates are important for maintaining connectivity between
   Tenant Systems.








Dunbar, et al                                           [Page 5]

Internet-Draft    NVA mapping distribution



4. Terminologies and Assumptions

   NVAs can be centralized or distributed with each NVA holding the
   mapping information for a subset of VNs. By saying that an NVA
   holds mapping information for a VN, it means that the NVA has
   mapping information for all the TSs in the VN.

   Centralized NVA means that the NVA holds mapping information for
   all the VNs in the administrative domain. There could be multiple
   instances of centralized NVA for redundancy purpose.

   A NVA could be instantiated on a server/VM attached to a NVE,
   very much like a TS attached to a NVE, or could be integrated
   within an NVE. When a NVA is a standalone server/VM attached to a
   NVE, it has to be reachable via the attached NVE by other NVEs. A
   NVA can also be instantiated on a NVE that doesn't have any TSs
   attached. The NVE-NVA control plane for NVA being attached to NVE
   (like a VM) will require additional functions on NVEs than NVA
   being embedded in a NVE.

   NVA should have at least the following information for each TS:
      . Inner Address: TS (host) Address family (IPv4/IPv6, MAC,
        virtual network Identifier MPLS/VLAN, etc)

      . Outer Address: The list of locally attached edges (NVEs);
        normally one TS is attached to one edge, TS could also be
        attached to 2 edges for redundancy (dual homing). One TS is
        rarely attached to more than 2 edges, though it could be
        possible;

      . VN Context (VN ID and/or VN Name)

      . Timer for NVEs to keep the entry when pushed down to or
        pulled from NVEs.

      . Optionally the list of interested remote edges (NVEs). This
        information is for NVA to promptly update relevant edges
        (NVEs) when there is any change to this TS' attachment to
        edges (NVEs). However, this information doesn't have to be
        kept per TS. It can be kept per VN.

   By saying that a NVE is participating in a VN or the VN is active
   on the NVE, it means that the VN is enabled on the NVE and there
   is at least one TS of the VN being attached to the NVE.





Dunbar, et al                                           [Page 6]

Internet-Draft    NVA mapping distribution



5. Overview of NVA Address Mapping Distribution (NAMD) Protocol

   The inner-outer address mapping could change as TSs move from NVE
   to another. At any given period, probably only a small set of TSs
   would move, resulting in a small portion of changes on the inner-
   outer address mapping. Therefore, it is important to have a
   mechanism for NVA to send incremental updates to NVEs for the
   changes instead of entire database of the mapping entries. This
   document specifies the incremental update messages (TLVs) from
   NVAs to NVEs, to maintain data consistency between NVAs and NVEs.

   The NAMD mechanism requires messages to distribute NVA content to
   all the NVEs, inform the incremental changes to the relevant
   NVEs, and maintain the database consistency between NVA and NVEs.
   This document specifies the structures (a.k.a. TLVs) of those
   messages, which are referred to as NAMD messages throughout this
   document.  The NAMD TLVs can be included in BGP or IGP protocol
   messages. How they are integrated with the BGP or IPG will be
   further specified in the corresponding working groups.

   A NVA can offer services in a Push, Pull model, or the
   combination of the two.

   In Push model, the NVE, upon restart or initialization, sends
   requests for all the interested VNs as a multicast to all the
   NVAs. NVAs with the requested VNs use NAMD messages to distribute
   the mapping entries to the requested NVEs. Whenever, there are
   changes in the mapping entries, NVA uses NAMD messages to only
   send the changed portion of the entries.

   In the Pull model, an NVA periodically sends VN scoped broadcast
   messages to all NVEs. An NVE, upon receiving a unknown unicast or
   ARP/ND with unknown target NVE, sends the pull request to the NVA
   that supports the VN that the targets belongs to.



6. TLV for NVE reachable addresses

   The Reachable Interface Addresses (IA) TLV is used to advertise a
   set of addresses within a VN being attached to (or reachable by)
   a specific NVE, and optionally the NVE Virtual Access Point.

   These addresses can be in different address families. For
   example, it can be used to declare that a particular interface
   with specified IPv4, IPv6, and 48-bit MAC addresses in some
   particular VN is reachable from a particular NVE.



Dunbar, et al                                           [Page 7]

Internet-Draft    NVA mapping distribution



   This document suggests using the Interface Addresses APPsub-TLV
   defined by [IA] except using NVE address subTLV in the fourth
   field shown below:

      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | Type = TBD                    |  (2 bytes)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | Length                        |  (2 bytes)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | Addr Sets End                 |  (2 bytes)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | NVE Address subTLV   ...         (variable)
      +-+-+-+-+-+-+-+-+-+-+-+-
      | Flags         |                  (1 byte)
      +-+-+-+-+-+-+-+-+
      | Confidence    |                  (1 byte)
      +-+-+-+-+-+-+-+-+-+-
      | Template ...                     (variable)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+
      | Address Set 1    (size determined by Template)    |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+
      | Address Set 2    (size determined by Template)    |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+
      |   ...
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+
      | Address Set N    (size determined by Template)    |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...-+
      | optional sub-sub-TLVs ...
      +-+-+-+-+-+-+-+-+-+-+-+-...

     Figure 1. The Interface Addresses APPsub-TLV

   Addr Sets End: The unsigned integer offset of the byte, within
   the IA APPsub-TLV [IA] value part, of the last byte of the last
   Address Set. This will be the byte just before the first sub-sub-
   TLV if any sub-sub-TLVs are present (see Section 3). If this is
   equal to Length, there are no sub-sub-TLVs. If this is greater
   than Length or points to before the end of the Template, the IA
   APPsub-TLV is corrupt and MUST be discarded. This field is always
   two bytes in size.



7. Push Mechanism

   Under this mode, NVA pushes the inner-outer mapping for all the
   TSs of the VNs to relevant NVEs. This service is scoped by VN. A
   Push NVA also advertises whether or not it believes it has pushed
   complete mapping information for a VN. It might be pushing only a


Dunbar, et al                                           [Page 8]

Internet-Draft    NVA mapping distribution



   subset of the mapping and/or reachability information for a VN.
   The Push Model uses the NAMD messages as its distribution
   mechanism.

   With the Push model, if the destination of a data frame arriving
   at the Ingress NVE can't be found in its inner-outer mapping
   database that are pushed down from the NVA, the Ingress edge
   could be configured with one or more of the following policies:

          - simply drop the data frame,
          - flood the data frames to other NVEs that have the VN
             enabled, or
          - start the "pull" process to get information from Pull
             NVA.
             When the NVE is waiting for reply from the Pull
             process, the NVE can either drop or queue the packet.


    One drawback of the Push Mode is that it will push more mapping
    entries to an NVE than needed.  Under the normal process of edge
    cache aging and unknown destination address flooding, rarely
    used entries would have been removed.  It would be difficult for
    NVA to predict the communication patterns from/to TSs within one
    VN.  Therefore, it is likely that the NVA will push down all the
    entries for all the VNs that are enabled on the NVE.

    Another drawback with Push model: there really can't be any
    source-based policy. It's all or nothing.

7.1. Requesting Push Service

   When a NVE is initialized or re-started, it needs to send request
   to the relevant NVAs to push down the mapping information for the
   active VNs on the NVE. NVE could use Virtual Network scoped
   message to announce all the Virtual Networks in which it is
   participating to NVAs who have the mapping information for the
   VNs. A new subTLV (Enabled-VN TLV) is specified here for NVE to
   indicate all its interested VNs in the NAMD message. The new
   subTLV can be included in an IGP protocol message or BGP message.

   For 24-bits VN ID, there could be 16 million VNs. Multiple ways
   can be used to express the interested VNs:

  - Starting VN & End VNs & bit map for the VNs in between.
  - Starting VN & End VN (for the VNs that are contiguous)


Dunbar, et al                                           [Page 9]

Internet-Draft    NVA mapping distribution



  - Individual VN listing (for a small number of VNs that are not
     contiguous)


   Therefore 3 different types of subTLV are specified:



      +-+-+-+-+-+-+-+-+
      |INT-VN-TYPE-1  |                  (1 byte)
      +-+-+-+-+-+-+-+-+
      |   Length      |                  (1 byte)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |          Start VN ID          |  (4 bytes)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | VNID bit-map....
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   Figure 2. Enabled-VN TLV using bit map



      +-+-+-+-+-+-+-+-+
      | INT-VN-TYPE-2 |                  (1 byte)
      +-+-+-+-+-+-+-+-+
      |   Length      |                  (1 byte)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |          Start VN ID          |  (4 bytes)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |          End VN ID            |  (4 byptes)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   Figure 3. Enabled-VN TLV using Range















Dunbar, et al                                           [Page 10]

Internet-Draft    NVA mapping distribution



      +-+-+-+-+-+-+-+-+
      | INT-VN-TYPE-3 |                  (1 byte)
      +-+-+-+-+-+-+-+-+
      |   Length      |                  (1 byte)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                VN ID          |  (4 bytes)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                VN ID          |  (4 bytes)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                VN ID          |  (4 bytes)
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |     .   .   .
      +-+-+-+-+-+-+-+-+-+-+-+-+
   Figure 4. Enabled-VN TLV using list


   -  Type: indicating different ways to express the VNs that NVE is
   participating: INT-VN-TYPE-1 is for using bit map to express the
   interested VNs; INT-VN-TYPE-2 is for using range to express the
   interested VNs (if the interested VNs are contiguous); IT-VN-
   TYPE-3 is for using individual VN list to express the interested
   VNs.

   -  Length: Variable.

   -  RESV: 4 reserved bits that MUST be sent as zero and ignored on
   receipt.

   -  Start VN ID: The 24-bit VN-ID that is represented by the high-
   order bit of the first byte of the VN-ID bit-map.

   VN-ID bit-map: The highest-order bit indicates the VN equal to
   the start VN ID, the next highest bit indicates the VN equal to
   start VN ID + 1, continuing to the end of the VN bit-map field.

   If this sub-TLV occurs more than once in a Hello, the set of
   enabled VNs is the union of the sets of VNs indicated by each of
   the Enabled-VLAN sub-TLVs in the Hello.

   When NVA is distributed, there could be multiple NVAs with each
   hosting mapping information for a subset of VNs.

   Each NVA advertises its availability to push mapping information
   for a particular virtual network to all NVEs who participate in
   the VN. NVEs subscribe the relevant NVAs.



Dunbar, et al                                           [Page 11]

Internet-Draft    NVA mapping distribution



   The subscription is VN scoped, so that a NVA doesn't need to push
   down the entire set of mapping entries. Each Push NVA also has a
   priority. For robustness, the one or two NVAs with the highest
   priority are considered as Active in pushing information for the
   VN to all NVEs who have subscribed for that VN.

7.2. Incremental Push Service

   Whenever there is any change in TS' association to an NVE, which
   can be triggered by TS being added, removed, or de-commissioned,
   an incremental update has to be sent to the NVEs that are
   impacted by the change. Therefore, proper sequence numbers have
   to be maintained by NVA and edges NVEs. NAMD incremental message
   is used to update and maintain the database consistency between
   NVAs and NVEs. We assume that NVA gets notification from an
   authoritative source, such as VM management system when TS-NVE
   attachment changes occur.

   A new TLV is needed for to carry NAMD timeout value and a flag
   for NVA to indicate it has completed all updates.

   If the Push NVA is configured to believe it has complete mapping
   information for VN X then, after it has actually transmitted all
   of its messages for VN X it sets the Complete Push (CP) bit to
   one. It then maintains the CP bit as one as long as it is Active.

               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
               | Type                          |   (2 bytes)
               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
               | Length                        |   (2 bytes)
               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
               |R| Priority    |                   (1 byte)
               +-+-+-+-+-+-+-+-+
               | NAMD Timeout  |                   (1 byte)
               +-+-+-+-+-+-+-+-+
               | Flags         |                   (1 byte)
               +---------------+
               | Reserved for expansion            (variable)
               +-+-+-+-...
      Figure 3. NAMD Complete TLV


   Flags: A byte of flags defined as follows:







Dunbar, et al                                           [Page 12]

Internet-Draft    NVA mapping distribution



                     0   1   2   3   4   5   6   7
                  +---+---+---+---+---+---+---+---+
                  | UN|CP |       RESV            |
                  +---+---+---+---+---+---+---+---+


   The UN flag indicates that the NVA will accept and properly
   process NVA- PDUs sent by unicast

   The CP flag is to indicate that NVA has completed its update.


8. Pull Mechanism

   Under this mode, an NVE pulls the mapping entries from the NVA
   when its cache doesn't have the mapping entries.

   The main advantage of Pull Mode is that the mapping is stored
   only where it needs to be stored and only when it is required. In
   addition, in the Pull Mode, NVEs can age out mapping entries if
   they haven't been used for a certain period of time. Therefore,
   each NVE will only keep the entries that are frequently used, so
   its mapping table size will be smaller than a complete table
   pushed down from NVA.

   The drawback of Pull Mode is that it might take some time for
   NVEs to pull the needed mapping from NVA. Before NVE gets the
   response from NVA, the NVE has to buffer the subsequent data
   frames with destination address to the same target. The buffer
   could overflow before the NVE gets the response from NVA.
   However, this scenario should not happen very often in data
   center environment because most likely the TSs are end systems
   which have to wait for (TCP) acknowledgement before sending
   subsequent data frames.  Another option is forward, not flood,
   subsequent frames to a default location, if the NVE is configured
   with a default node that has the ability to forward data frames
   when the NVE doesn't have the mapping information. This node can
   be the gateway, or a re-encapsulating NVE in NAMD context.

   It worth noting that the practice of an edge waiting and dropping
   packets upon receiving an unknown DA is not new. Most deployed
   routers today drop packets while waiting for target addresses to
   be resolved. It is too expensive to queue subsequent packets
   while resolving target address. The routers send ARP/ND requests
   to the target upon receiving a packet with DA not in its ARP/ND
   cache and wait for an ARP or ND responses. This practice
   minimizes flooding when targets don't exist in the subnet. When
   the target doesn't exist in the subnet, routers generally re-send


Dunbar, et al                                           [Page 13]

Internet-Draft    NVA mapping distribution



   an ARP/ND request a few more times before dropping the packets.
   The holding time by routers to wait for an ARP/ND response when
   the target doesn't exist in the subnet can be longer than the
   time taken by the Pull Mode to get mapping from NVA.

8.1. Pull Query Format

      Here are some events that can trigger the pulling process:

       o An NVE receives a data frame from the attached TSs with a
          destination whose attached NVE is unknown, or
       o The NVE receives an ingress ARP/ND request for a target
          whose link address (MAC) or attached NVE is unknown.

     Each Pull request can have queries for multiple inner-outer
     mapping entries. The message format is defined below:


       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |  Ver  | Type  | Flags | Count |      Err      |    SubErr     |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                        Sequence Number                        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | QUERY 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...
      | QUERY 2
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...
      | ...
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...
      | QUERY K
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...
   Figure 4. Pull Query TLV


     Type: 1 for Query. Queries received by an NVE that is not a
     Pull NVA result in an error response unless inhibited by rate
     limiting.

     Flags, Err, and SubErr: MUST be sent as zero and ignored on
     receipt.

     Count: Number of QUERY Records present. A Query message Count
     of zero is explicitly allowed, for the purpose of pinging a
     Pull NVA server to see if it is responding. On receipt of such



Dunbar, et al                                           [Page 14]

Internet-Draft    NVA mapping distribution



     an empty Query message, a Response message that also has a
     Count of zero is sent unless inhibited by rate limiting.

     QUERY: Each QUERY Record within a Pull Directory Query message
     is formatted as follows:

             0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
           +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
           |        SIZE           |    RESV   |   QTYPE   |
           +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
             If QTYPE = 1
           +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
           |                      AFN                      |
           +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
           |  Query address ...
           +--+--+--+--+--+--+--+--+--+--+--...
             If QTYPE = 2, 3, 4, or 5
           +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
           |  Query frame ...
           +--+--+--+--+--+--+--+--+--+--+--...

     SIZE: Size of the QUERY record in bytes as an unsigned integer
     starting after the SIZE field and following byte. Thus the
     minimum legal value is 2. A value of SIZE less than 2 indicates
     a malformed QUERY record. The QUERY record with the illegal
     SIZE value and any subsequent QUERY records MUST be ignored and
     the entire Query message MAY be ignored.

     RESV: A block of reserved bits. MUST be sent as zero and
     ignored on receipt.

     QTYPE: There are several types of QUERY Records currently
     defined in two classes as follows: (1) a QUERY Record that
     provides an explicit address and asks for all addresses for the
     interface specified by the query address and (2) a QUERY Record
     that includes a frame. The fields of each are specified below.
     Values of QTYPE are as follows:

                  QTYPE   Description
                  -----   -----------
                     0    reserved
                     1    address query
                     2    ARP query frame
                     3    ND query frame
                     4    RARP query frame
                     5    Unknown unicast MAC query frame
                  6-14    assignable by IETF Review


Dunbar, et al                                           [Page 15]

Internet-Draft    NVA mapping distribution



                    15    reserved

     AFN: Address Family Number of the query address.

     Address Query: The query is asking for any other addresses, and
     the address of NVE from which they are reachable, that
     correspond to the same interface, within the VN of the query.
     Typically that would be either (1) a MAC address with the
     querying NVE primarily interested in the NVE by which that MAC
     address is reachable, or (2) an IP address with the querying
     NVE interested in the corresponding MAC address and the NVE by
     which that MAC address is reachable. But it could be some other
     address type.

     Query Frame: Where a QUERY Record is the result of an ARP, ND,
     RARP, or unknown unicast MAC destination address, the ingress
     NVE MAY send the frame to a Pull NVA if the frame is small
     enough that the resulting Query message not exceeding the MTU.

     If no response is received to a Pull Directory Query message
     within a timeout configurable in milliseconds that defaults to
     200, the Query message should be re-transmitted with the same
     Sequence Number up to a configurable number of times that
     defaults to three. If there are multiple QUERY Records in a
     Query message, responses can be received to various subsets of
     these QUERY Records before the timeout. In that case, the
     remaining unanswered QUERY Records should be re-sent in a new
     Query message with a new sequence number.  If an NVE is not
     capable of handling partial responses to queries with multiple
     QUERY Records, it MUST NOT send a Request message with more
     than one QUERY Record in it.

8.2. Pull Response

      There are several possibilities of the Pull Response:

      1. Valid inner-outer address mapping, coupled with the valid
        timer indicating how long the entry can be cached by the
        NVE.
        The timer for cache should be short in an environment where
        VMs move frequently. The cache timer can also be configured.





Dunbar, et al                                           [Page 16]

Internet-Draft    NVA mapping distribution



      2. The target being queried is not available. The response
        should include the policy if requester should forward data
        frame in legacy way, or drop the data frame.

      3. The requestor is administratively prohibited from getting an
        informative response.


      Pull NVA Response messages are sent as unicast to the
      requesting NVE. Responses are sent with the same VN. The
      specific data format is as follows:


       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |  Ver  | Type  | Flags | Count |      Err      |    SubErr     |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                        Sequence Number                        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | RESPONSE 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...
      | RESPONSE 2
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...
      | ...
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...
      | RESPONSE K
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-...
   Figure 5. Pull Response TLV



      Type: 2 = Response.

      Flags: MUST be sent as zero and ignored on receipt.

      Count: Count is the number of RESPONSE Records present in the
      Response message.

      Sequence Number: There are many Pull Queries from NVEs; each
      Pull Query has a different sequence number. The Sequence
      Number in the Pull Response reflects the sequence number for
      the query.

      Err, SubErr: A two part error code. Zero unless there was an
      error in the Query message, for which case see Section 3.5.




Dunbar, et al                                           [Page 17]

Internet-Draft    NVA mapping distribution



      RESPONSE: Each RESPONSE record within a Pull NVA Response
      message is formatted as follows:


           0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
         +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
         |         SIZE          |OV|  RESV  |   Index   |
         +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
         |                   Lifetime                    |
         +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
         |                Response Data ...
         +--+--+--+--+--+--+--+--+--+--+--...

      SIZE: Size of the RESPONSE Record in bytes starting after the
      SIZE field and following byte. Thus the minimum value of SIZE
      is 2. If SIZE is less than 2, that RESPONSE Record and all
      subsequent RESPONSE Records in the Response message MUST be
      ignored and the entire Response message MAY be ignored.

      OV: The overflow flag. Indicates, as described below, that
      there was too much Response Data to include in one Response
      message.

      RESV: Four reserved bits that MUST be sent as zero and ignored
      on receipt.

      Index: The relative index of the QUERY Record in the Query
      message to which this RESPONSE Record corresponds. The index
      will always be one for Query messages containing a single
      QUERY Record. If the Index is larger than the Count that was
      in the corresponding Query, that RESPONSE Record MUST be
      ignored and subsequent RESPONSE Records or the entire Response
      message MAY be ignored.

      Lifetime: The length of time for which the response should be
      considered valid in units of 200 milliseconds except that the
      values zero and 2**16-1 are special. If zero, the response can
      only be used for the particular query from which it resulted
      and MUST NOT be cached. If 2**16-1, the response MAY be kept
      indefinitely but not after the Pull NVA goes down or becomes
      unreachable. The maximum definite time that can be expressed
      is a little over 3.6 hours.

      Response Data: There are various types of RESPONSE Records.

      -  If the Err field is non-zero, then the Response Data is a
      copy of the corresponding QUERY Record data, that is, either
      an AFN followed by an address or a query frame.


Dunbar, et al                                           [Page 18]

Internet-Draft    NVA mapping distribution



      -  If the Err field is zero and the corresponding QUERY Record
      was an address query, then the Response Data is the contents
      of an Interface Addresses APPsub-TLV [IA]. The maximum size of
      such contents is 253 bytes in the case when SIZE is 255.

      -  If the Err field is zero and the corresponding QUERY Record
      was a frame query, then the Response data consists of the
      response frame for ARP, ND, or RARP and a copy of the frame
      for unknown unicast destination MAC.

      Multiple RESPONSE Records can appear in a Response message
      with the same index if the answer to a QUERY Record consists
      of multiple Interface Address APPsub-TLV contents. This would
      be necessary if, for example, a MAC address within a Data
      Label appears to be reachable by multiple NVEs. However, all
      RESPONSE Records to any particular QUERY Record MUST occur in
      the same Response message. If a Pull NVA holds more mappings
      for a queried address than will fit into one Response message,
      it selects which to include by some method outside the scope
      of this document and sets the overflow flag (OV) in all of the
      RESPONSE Records responding to that query address.

      If no response is received from a Pull request within a
      configurable timeout, the request should be re-transmitted
      with the same Sequence Number up to a configurable number of
      times that defaults to three.

8.3. Cache Consistency

      It is important that the cached information be kept consistent
      with the actual placement of VMs. Therefore, it is highly
      desirable to have a mechanism to prevent NVEs from using the
      staled mapping entries.

      When there is any change in a Pull NVA, such as an entry being
      deleted or new entry added, and there may be unexpired stale
      information at some NVEs, the Pull NVA MUST send an
      unsolicited Update message to the relevant NVEs.

      To achieve this goal, a Pull NVA server MUST maintain one of
      the following, in order of increasing specificity.

      1. An overall record per VN of when the last returned query
      data will expire at a requestor and when the last query record
      specific negative response will expire.





Dunbar, et al                                           [Page 19]

Internet-Draft    NVA mapping distribution



      2. For each unit of data (IA APPsub-TLV Address Set) held by
      the NVA and each address about which a negative response was
      sent, when the last expected response with that unit or
      negative response will expire at a requester.

      Note: It is much more important to cache negative reply,
      because there are many invalid address queries. Study has
      shown that for each valid ND query, there are 100's of invalid
      address queries.

      3. For each unit of data held by the NVA and each address
      about which a negative response was sent, a list of NVEs that
      were sent that unit as the response or sent a negative
      response to the address, with the expected time to expiration
      at each of them.



8.4. Update Message Format

      An Update message is formatted as a Response message except
      that the Type field in the message header is a different
      value.

      Update messages are initiated by a Pull NVA. The Sequence
      number space used is controlled by the originating Pull NVA
      and different from Sequence number space used in a Query and
      the corresponding Response that are controlled by the querying
      NVE.

      The Flags field of the message header for an Update message is
      as follows:

            +---+---+---+---+
            | F | P | N | R |
            +---+---+---+---+

      F: The Flood bit. If zero, the response is to be unicast. If
      F=1, it is multicast to relevant NVEs.

      P, N: Flags used to indicate positive or negative Update
      messages. P=1 indicates positive. N=1 indicates negative. Both
      may be 1 for a flooded all addresses Update.



      R: Reserved. MUST be sent as zero and ignored on receipt



Dunbar, et al                                           [Page 20]

Internet-Draft    NVA mapping distribution



8.5. Acknowledge Message Format

      An Acknowledge message is sent in response to an Update to
      confirm receipt or indicate an error unless response is
      inhibited by rate limiting. It is also formatted as a Response
      message.

      If there are no errors in the processing of an Update message,
      the message is essentially echoed back with the Type changed
      to Acknowledge.

      If there was an overall or header error in an Update message,
      it is echoed back as an Acknowledge message with the Err and
      SubErr fields set appropriately.

      If there is a RESPONSE Record level error in an Update
      message, one or more Acknowledge messages may be returned.

8.6. Pull Request Errors

      If errors occur at the query level, they MUST be reported in a
      response message separate from the results of any successful
      queries. If multiple queries in a request have different
      errors, they MUST be reported in separate response messages.
      If multiple queries in a request have the same error, this
      error response MAY be reported in one response message.



8.7. Redundant Pull NVAs

      There could be multiple NVAs holding mapping information for a
      particular VN for reliability or scalability purposes. Pull
      NVAs advertise themselves by having the Pull Directory flag on
      in their Interested VNs sub-TLV [rfc6326bis].

     A pull request can be sent to any of them that is reachable
     but it is RECOMMENDED that pull requests be sent to a NVA that
     is least cost from the requesting NVE.


9. Hybrid Mode

      For some edge nodes that have great number of VNs enabled and
      combined number of TSs under all those VNs are large, managing
      the inner-outer address mapping for TSs under all those VNs
      can be a challenge. This is especially true for Data Center


Dunbar, et al                                           [Page 21]

Internet-Draft    NVA mapping distribution



      gateway nodes, which need to communicate with a majority of
      VNs if not all.

      For those NVE nodes, a hybrid mode should be considered. That
      is the Push Mode being used for some VNs, and the Pull Mode
      being used for other VNs. It is the network operator's
      decision by configuration as to which VNs' mapping entries are
      pushed down from NVA and which VNs' mapping entries are
      pulled.

      In addition, NVA can inform the NVE to use legacy way to
      forward if it doesn't have the mapping information, or the NVE
      is administratively prohibited from forwarding data frame to
      the requested target.



10. Redundancy

   For redundancy purpose, there should be multiple NVAs that hold
   mapping information for each VN. At any given time, only one or a
   small number of push NVAs is considered as active for a
   particular VN. All NVAs should announce its capability and
   priority to all the edges.

11. Inconsistency Processing

   If an NVE notices that a Push NVA is no longer reachable, it MUST
   ignore any mapping entries from that NVA because it is no longer
   being updated and may be stale.

   There may be transient conflicts between mapping information from
   different Push NVAs or conflicts between locally learned
   information and information received from a Push NVA. NVA may
   have a confidence level with address table information so, in
   case of such conflicts, information with a higher confidence
   value is preferred over information with a lower confidence. In
   case of equal confidence, Push NVA information is preferred to
   locally learned information and if information from Push NVAs
   conflicts, the information from the higher priority Push NVA is
   preferred.









Dunbar, et al                                           [Page 22]

Internet-Draft    NVA mapping distribution



12. Protocols to consider to carry NAMD messages

   NAMD messages can be carried by IGP, BGP, or even OVSDB. NVO3 WG
   only focuses on specifying the NAMD message structure. How NAMD
   TLVs are integrated with BGP or IGP messages will be discussed in
   the corresponding WGs, e.g. BESS WG.

   OVSDB (Open vSwitch Database Management protocol - RFC7047 by
   individual submission), is to bootstrap a vSwitch with the needed
   configuration (e.g. number of flow tables, the pipeline among
   those flow tables, path/link cost, Timer for Spanning Tree, Hello
   Timer, enabling Multicast snooping, etc). After OVSDB bootstrap a
   vSwitch, OpenFlow is used to dynamically pass down the flow
   entries.

   Theoretically, some components of OVSDB can be potentially
   adopted (with update) to achieve the control plane between NVA
   and NVE. For example, changes to OVSDB are needed to address:

       - How Edge nodes request for Push?

       - How Edge nodes express the participated VNs?

       - How NVA express the supported VNs ranges/list/?

       - How Edge nodes feedback newly discovered attached TSs to
          NVA

       - How Edge nodes exchange mapping among themselves.



13. Security Considerations

   Incorrect information in NVA can result in a variety of security
   threats including the following:

   Incorrect directory mappings can result in data being delivered
   to the wrong hosts/VMs, or set of hosts in the case of multi-
   destination packets, violating security policy.

   Missing or incorrect data in NVA can result in denial of service
   due to sending data packets to black holes or discarding data on
   ingress due to incorrect information that their destinations are
   not reachable.





Dunbar, et al                                           [Page 23]

Internet-Draft    NVA mapping distribution



   Push NVA data messages can be authenticated by including an
   Authentication TLV. See [RFC5304] and [RFC5310].



14. IANA Considerations

   This section gives IANA allocation and registry considerations.

15. Acknowledgements

   Special thanks to David Black, Dino Farinacci, Mingui Zhang,
   XiaoHu Xu for valuable suggestions and comments to this draft.

16. References

   16.1. Normative References

   [RFC4971] J. Vasseur et al, "Intermediate System to Intermediate
             System (IS-IS) Extensions for Advertising Router
             Information", July 2007.

   [nvo3-nve-nva-cp-req] draft-ietf-nvo3-nve-nva-cp-req-00, "Network
             Virtualization NVE to NVA Control Protocol
             Requirements", Kreeger, et al. July 31, 3013.

   [IA] - Eastlake, D., L. Yizhou, R. Perlman, "TRILL: Interface
             Addresses APPsub-TLV", draft-ietf-trill-ia-appsubtlv,
             work in progress.



   16.2. Informative References

   [802.1Q] IEEE Std 802.1Q-2011, "IEEE Standard for Local and
             metropolitan area networks - Virtual Bridged Local Area
             Networks", May 2011.

   [802.1Qbg] IEEE Std 802.1Qbg-2012, "Media Access Control (MAC)
          Bridges and Virtual Bridged Local Area Networks-Edge
          Virtual Bridging", July 2012.

   [RFC826] Plummer, D., "An Ethernet Address Resolution Protocol",
             RFC 826, November 1982.

   [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman,
             "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861,
             September 2007.


Dunbar, et al                                           [Page 24]

Internet-Draft    NVA mapping distribution





 Authors' Addresses

   Linda Dunbar
   Huawei Technologies
   5430 Legacy Drive, Suite #175
   Plano, TX 75024, USA
   Phone: (469) 277 5840
   Email: linda.dunbar@huawei.com


   Donald Eastlake
   Huawei Technologies
   155 Beaver Street
   Milford, MA 01757 USA
   Phone: 1-508-333-2270
   Email: d3e3e3@gmail.com


   Tom Herbert
   Google
   Email: therbert@google.com



























Dunbar, et al                                           [Page 25]