ForCES Working Group Jamal Hadi Salim Internet Draft Znyx Networks Expires: December 2003 Robert Haas IBM Steven Blake Ericsson June 2003 Netlink2 as ForCES Protocol Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes Netlink2, which is an extension of Linux Netlink [Netlink]. This document is intended as a proposal for the ForCES IETF working group protocol. ForCES attempts to define a clear separation between the two enti- ties of the NE in order to have them evolve separately as opposed to the current monolithic evolution. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC-2119]. Salim/Haas/Blake Expires December 2003 [Page 1] Internet-Draft Netlink2 as ForCES Protocol June 2003 1. Introduction The concept of IP control and forwarding separation was first introduced in the early 1980s by the BSD 4.4 routing sockets [Stevens]. The focus at that time was to provide a simple IP(v4) forwarding service and allow the control plane, either via a com- mand line configuration tool or a dynamic route daemon, to control forwarding tables for that IPv4 forwarding service. The IP world has evolved considerably since then. Linux Netlink [Netlink], when observed from a service provisioning and management point of view, takes routing sockets one step further by breaking the narrow focus on IPv4 forwarding. Since the Linux 2.1 kernel, Netlink has been providing the IP service abstraction for a few additional services other than classical RFC 1812 IPv4 forwarding. Netlink was designed with a goal of solving the forwarding and con- trol separation. This means that many of the main issues have been thought through and resolved over the years. In other words Netlink is proven as a protocol addressing separation of forwarding and control. Netlink is also network-ready because it uses packet formating techniques and concepts (e.g., multicast addressing). This, and the availability of publicly running and tested code which is widely deployed, form a major motivator to base Netlink2 on Netlink. Netlink2 extends Linux Netlink to meet the requirements of the ForCES working group charter for a protocol. Netlink is extended to have a distributed addressing and transport scheme, and missing mechanisms are added to make Netlink2 meet the ForCES protocol requirements [ForCES_REQ]. Netlink2 operates in a mode where knowledge of the NE, its topol- ogy, and modeling MAY have already been discovered, or is discov- ered within the Netlink2 protocol. 2. Definitions We use the definitions provided in [ForCES_REQ], as well as the following: Logical Functional Block (LFB): same as Forwarding Engine Compo- nents as defined in [Netlink]. This is a forwarding datapath com- ponent in the FE driven by the ForCES protocol in order to achieve a certain service. Salim/Haas/Blake Expires December 2003 [Page 2] Internet-Draft Netlink2 as ForCES Protocol June 2003 Control Element Component (CEC): same as defined in Control Plane Component in [Netlink]. This is a component in the CE that drives LFB(s) in order to achieve a certain service. 3. Netlink2 Overview An IP forwarding service accomplished by a FE is represented as a logical functional block (LFB) in the FE. CE components (CEC) in the CE interact with LFBs over a Netlink2 bundle (described in Sec- tion 6.2) to execute a certain service. The interactions between LFBs and CECs are proper to each service and are defined using tem- plates as presented in [Netlink]. The Netlink2 message is used to communicate between the FE and CEC for configuration of the LFBs, asynchronous event notification of LFB events to the CECs, and statistics querying/gathering (typi- cally by a CEC). Other activities include transfer of control packets between FE and CEC. For instance, the IPv4 Forwarding service (called NETLINK_ROUTE) defines a message template for handling IP routes and the message types to insert, remove, or query a route. The routing CEC(s) and the IPv4 Forwarding LFB(s) interact using these message templates and message types over the Netlink2 bundle to execute the IPv4 For- warding service. The message types in Netlink2 messages allow the FE to demultiplex messages to the appropriate LFB. Messages of a certain service destined to a LFB can travel on dif- ferent Netlink2 wires within the same bundle. Note that a LFB can process messages from different bundles. Netlink2 by itself does not constitute a protocol, but rather a set of base mechanisms that can be utilized depending on service requirements. The interaction between the LFB and the CEC, as in the Netlink con- text, would define a protocol. Netlink2 provides mechanisms for the CE Component and the FE Component to define their own protocol. The LFB might continuously get updates from the control-element component on how to operate the service (e.g., for IPv4 forwarding, or for route additions or deletions). Netlink2 messages and mechanisms are used to derive the protocol. For example: the LFB and CEC may choose to define a reliable or semi-reliable protocol between each other. By default, however, Netlink2 transactions are unreliable. Salim/Haas/Blake Expires December 2003 [Page 3] Internet-Draft Netlink2 as ForCES Protocol June 2003 4. Netlink2 Modifications to Netlink To conform to the ForCES requirements [ForCES_REQ], the Netlink protocol [Netlink] is extended in the following respects: 1) Base header modifications 2) Feature expandability extensions by means of optional header TLVs to accommodate current generic ForCES requirements and to make it possible to add more in the future. This facilitates adding such features as authentication, checksumming, etc., when required. 3) IP and Transport encapsulations to carry Netlink messages. With these complementary changes to the existing Netlink function- ality, Netlink2 fulfills the requirements to become the ForCES pro- tocol. 4.1. Header Modifications 1) PID field redefinition and addition In Netlink, PID 0 referred to the equivalent of the FE (kernel). The equivalent of the CE (user process) was referred by its OS pro- cess id. In Netlink2 a PID of the unicastPID type is assigned to each FE and CE in the pre-association phase. In this way the CE uniquely iden- tifies the FE and avoids any collision. We maintain the name PID for historical purposes. - Destination PID: the PID field is redefined as the Destination PID field. This field identifies the parties on the wire that must process the message. - Source PID: this field is introduced in the header to identify the source of the message. Different types of PIDs are discussed in Section 6.3*. 2) The Length field has been reduced to 16 bits, with length 0 being reserved. The rest of the old 32-bit Length field is now split between a new version field and a new extended flags field. 3) A Version field is introduced in the Netlink2 header. This 8-bit field is 4 bits major number and 4 bits minor number in the form of major:minor. For Netlink2, this becomes: 0x20. Salim/Haas/Blake Expires December 2003 [Page 4] Internet-Draft Netlink2 as ForCES Protocol June 2003 4) A new Extended Flags field is introduced to take over the remaining 8 bits from the 16-bits taken from the original 32-bit Length field in Netlink. Turning different bits on enables addi- tional new features such as proclaiming the presence of extended TLVs, etc. Extended Flags also introduce the concept of a SYN mes- sage which is issued by the FE as the first message after the pre- association phase to indicate its presence. Also, a FIN flag is issued last to indicate the departure of the FE. 5) Netlink2-specific TLVs follow directly after the Netlink2 base header. They are optional and their presence is indicated only by an extended flag bit. Typical use of Netlink2-specific TLVs is to compensate for capabilities lacking in a underlying transport. For example, in an IP network not deployed with IPSEC, the Netlink2-specific authentication TLV could be used to emulate the features provided by IPSEC-AH. Other than these changes, all mechanisms provided by Netlink are sufficient to meet the requirements for ForCES. The reader is encouraged to refer to [Netlink] as a companion to this one. 4.2. Addressing and Transport Extensions 1) Support for UDP/TCP/SCTP transport over unicast/multicast IP (Section 6.1). 2) Support for bundles (Section 6.2). 3) Message recipient scoping using the Destination PID (Section 6.3). 4) Support for both local scope and global scope addressing (Sec- tions 6.4 and 6.5). 5. Netlink2 Message Format There are three mandatory levels to a Netlink2 message: The general Netlink message header, the IP-service-specific template, and the IP-service-specific data. Netlink2-specific TLVs and IP-service- specific TLVs are optional. Salim/Haas/Blake Expires December 2003 [Page 5] Internet-Draft Netlink2 as ForCES Protocol June 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Netlink2 message header | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Netlink2-specific TLVs (optional) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | IP Service Template | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | IP-Service-specific data in TLVs | | (optional) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The Netlink2 message header is generic for all services, whereas the IP Service Template header is specific to a service. Each IP Service then carries configuration parameters (CEC->LFB direction) or query responses (LFB->CEC direction). These parameters are in (Type-Length-Value) TLV format and unique to the particular ser- vice. Note that we maintain the same IP Service Templates as in Netlink, i.e., nothing has changed here. 5.1. Netlink2 Message Header Netlink2 messages are laid out exactly the same as Netlink mes- sages. Each Netlink2 message contains a byte stream with a Netlink2 header followed by its associated payload. A single PDU may contain more than one Netlink2 message. This is referred to as batching. Netlink batching is reused in Netlink2 and allows for messages with different commands (such as adding routes and deleting a QoS policy) to be carried in the same batch message. A Netlink2 message may be split across multiple PDUs if it does not fit into the PDU. This is referred to as a multipart Netlink2 mes- sage and is also inherited from Netlink. Salim/Haas/Blake Expires December 2003 [Page 6] Internet-Draft Netlink2 as ForCES Protocol June 2003 For multipart messages, the first and all following headers have the NLM_F_MULTI Netlink header flag set, except for the last header, which has the Netlink header type NLMSG_DONE. The Netlink2 message header is shown below. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version | Flags_E | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Optional TLVs | ~ ~ ~ ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The fields in the header are: Version: 8 bits The version field is split into major:minor (4:4 bits) sub- fields. The value for Netlink2 is 0x20. Flags_E: 16 bits These are extended flags: NLM_F_SYN Set on the first message. Interpreted as a boot message. NLM_F_FIN Set on the last message. Interpreted as a departure message. NLM_F_ETLV Set to indicate presence of extended TLVs. NLM_F_PRIO Message priority: 1 for high and 0 for low. Additional QoS level set in QOS TLV. NLM_F_ASTR Set the ACK strategy: 1 for partial ACKs and 0 for full ACKs Salim/Haas/Blake Expires December 2003 [Page 7] Internet-Draft Netlink2 as ForCES Protocol June 2003 Length: 16 bits The length of the Netlink2 message in bytes including the header. Type: 16 bits This field describes the message content. It can be one of the standard message types: NLMSG_NOOP message is ignored NLMSG_ERROR the message signals an error and the payload contains a nlmsgerr structure. This can be looked at as a NACK and typically it is from LFB to CEC. NLMSG_DONE message terminates a multipart message Individual IP Services specify more message types, for e.g., NETLINK_ROUTE Service specifies several types such as RTM_NEWLINK, RTM_DELLINK, RTM_GETLINK, RTM_NEWADDR, RTM_DELADDR, RTM_NEWROUTE, RTM_DELROUTE, etc. Flags: 16 bits The standard flag bits used in Netlink are: NLM_F_REQUEST Must be set on all request messages (typically from CE to FE) NLM_F_MULTI Indicates the message is part of a multipart message terminated by NLMSG_DONE NLM_F_ACK Request for an acknowledgment on success. Typical direction of request is from CEC to LFB. NLM_F_ECHO Echo this request. Typical direction of request is from CEC to LFB. Additional flag bits for GET requests on config information in the LFB: NLM_F_ROOT Return the complete table instead of a single entry. NLM_F_MATCH Return all matching criteria passed in message content NLM_F_ATOMIC Return an atomic snapshot of the table being referenced. This may require special privileges because it has the potential to interrupt service in the FE for a longer time. Convenience macros for flag bits: NLM_F_DUMP This is NLM_F_ROOT or'ed with NLM_F_MATCH Salim/Haas/Blake Expires December 2003 [Page 8] Internet-Draft Netlink2 as ForCES Protocol June 2003 Additional flag bits for NEW requests: NLM_F_REPLACE Replace existing matching config object with this request. NLM_F_EXCL Do not replace the config object if it already exists. NLM_F_CREATE Create config object if it does not already exist. NLM_F_APPEND Add to the end of the object list. For those familiar with BSDish use of such operations in route sockets, the equivalent translations are: - BSD ADD operation equates NLM_F_CREATE or-ed with NLM_F_EXCL - BSD CHANGE operation equates NLM_F_REPLACE - BSD Check operation equates NLM_F_EXCL - BSD APPEND equivalent is actually mapped to NLM_F_CREATE Sequence Number: 32 bits The sequence number of the message. Source PID: 32 bits The PID of the sender of the message (unicast or logical PID). Destination PID: 32 bits The PID of the destination of the message (unicast, logical, or broadcast PID). 5.2. Netlink2-specific TLVs 5.2.1. Authentication [TBD] 5.2.2. Checksum 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TLV Type =12 | TLV Length =2 | Checksum (16 bits) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This TLV is optional. To compute the correct checksum, an imple- mentation MUST add the optional checksum TLV to the Netlink2 mes- sage with the initial checksum value of 0 and compute the checksum Salim/Haas/Blake Expires December 2003 [Page 9] Internet-Draft Netlink2 as ForCES Protocol June 2003 over such a Netlink2 message. Refer to [RFC3358] for details on the Checksum TLV. 5.2.3. Message Priority 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TLV Type =13 | TLV Length =2 | Priority (16 bits) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This TLV is optional. It is used if the network does not support prioritization. This field is used to indicate priorities to the remote end. 6. Addressing and Transport Extensions We extend Netlink to make it distributed. The focus is on making Netlink2 have a strong local scope view of the world while fitting well into a global scope when the hop distance between the FE and CE increases. If the network interconnecting the FE(s) and CE(s) is completely hidden from the outside (black-box view), for instance an internal Ethernet segment or a switching fabric in which CE(s) and FE(s) are connected within physical proximity, then communications between FE and CE are assumed to be of a local scope. On the other hand, if communications between FE and CE cross parts of the network that are not hidden from the outside, communications are considered to be of global scope. 6.1. Transport Methods The ideal environment for Netlink2 is considered to be a multicast- capable medium with IP above it and with UDP/TCP/SCTP running over IP. Netlink2 will run over non-IP, non-multicast-capable environments; however, it will require extra processing and messaging by the ForCES layer to compensate for services that IP already offers. 6.1.1. Why Multicast? Multicast is considered important to facilitate one-to-many/some communication. For example, a single command from a CE can be Salim/Haas/Blake Expires December 2003 [Page 10] Internet-Draft Netlink2 as ForCES Protocol June 2003 multicast to multiple FEs, which eases the scalability requirements mentioned in [ForCES_REQ]. This is discussed in later sections. When running Netlink2 over non-multicast-capable media, it is expected that mechanisms similar to those used in OSPF NBMA [RFC2328] networks will be put in place. 6.1.2. Why IP? IP runs on virtually every link layer. Leveraging this fact alone helps deploying the protocol wider and faster. IP also provides numerous services such as fragmentation and reassembly, prioritization, and security, which are inherent requirements for the ForCES protocol. This means that to success- fully run an alternative to IP requires that similar services be provided by whatever is underneath in order to meet the require- ments. Netlink2-specific optional TLVs can be used to compensate for lack- ing functionality if running on network transport other than IP or directly on the link layer. Netlink already allows the definition of multipart messages with IP segmenting/reassembling when the path MTU is exceeded. When run- ning on top of non-IP media, the Netlink2 message can be limited to not exceed the MTU; the multipart messages facility can be then be used to provide framing for segmenting/reassembling. Netlink2-specific Authentication TLV can be used to carry authenti- cation signatures in a medium that does not have this capability. Netlink2-specific Checksum TLV can be used to carry checksums in a medium that does not have this capability. Netlink2-specific Message Priority TLV can be used to carry priori- tization if transports are not capable of making priorities in their headers. 6.1.3. Why UDP/TCP/SCTP? On a local scope, it is assumed that multicast UDP over IP is the preferred mode of operation. On a global scope it is expected that TCP or SCTP would be used for enhanced reliability and internet congestion friendliness. Salim/Haas/Blake Expires December 2003 [Page 11] Internet-Draft Netlink2 as ForCES Protocol June 2003 All three protocols provide 16-bit ports, which are further address-demultiplexing points. Also, all three protocols provide checksum capability to enhance integrity of the Netlink2 message. In the case of UDP, the checksum is optional (which fits the model that the local scope is less error-prone than global scope and hence the integrity check could be turned on only when needed). 6.2. The Netlink2 wire and bundle A Netlink2 wire displays the same behavior as a Netlink wire. It interconnects FEs and CEs in order to support services they jointly offer. The only conceptual difference between a Netlink2 wire and a Netlink wire is that whereas the Netlink wire is localized, the Netlink2 wire is distributed. We also introduce the concept of a Netlink2 bundle. A Netlink2 bundle interconnects a set of FE(s) and/or CE(s) by means of one or more Netlink2 wires. Note that a Netlink2 bundle does not neces- sarily mean a full-mesh interconnection (see examples later on). Parties (FEs and CEs) on a Netlink2 bundle share a common configu- ration, provisioning and event-notification end goals. A Netlink2 wire MAY be constructed using a multicast connection or a unicast connection or a multiple number of multicast and unicast connections. A wire MUST belong to only one bundle. A bundle may have only a single wire (unicast or multicast). In most cases we believe there will only be one multicast address for a bundle, although scalability issues could require the use of unicast con- nections in addition. When a multicast IP address is used, a Netlink2 wire MUST run over UDP - a UDP port is used to uniquely identify the wire. There MAY be multiple wires using the same multicast address as long as they run over different UDP ports. When a unicast IP address is used, the description of how to con- nect to an endpoint (CE/FE) is subject to the agreement between the CE and FE. The connection could be directly over IP (do we need an IP protocol number?) or via transport-layer ports (TCP/UDP/SCTP). In both unicast and multicast wires, the necessary parameters (such as IP address and port numbers) can be discovered by the involve- ment of the FE and CE Managers. Salim/Haas/Blake Expires December 2003 [Page 12] Internet-Draft Netlink2 as ForCES Protocol June 2003 6.2.1. What wires go in a bundle? Netlink2 provides flexibility to have a bundle of purely unicast wires or multicast wires or a hybrid of both. The decision of what goes into a bundle can be made in the pre-association phase. A good analogy is to think of a multicast wire as a broadcast link (as is done in Netlink) in which CE(s) and FE(s) are parties attached to that broadcast link. Depending on the number of FEs and CEs on an NE, a choice of a sin- gle multicast wire in the bundle may be sufficient. Multicast allows one-to-some messagging. A single message sent by an origi- nator is seen by all parties on the wire. This simplifies synchro- nization in an HA environment as well as implementation of the pro- tocol. The fact that multicast messages are seen by all parties could cause scalability issues as the number of nodes grows. Parties need to filter out messages not designated for them if they are not the destination. This can take compute or table resources if fil- tering is done in hardware. The extra messages also consume unnec- essary bandwidth for FE(s) and CE(s) not interested in seeing these messages. Unicast wires could be used to create point-to-point connections between the parties; when every party is connected to every other party, then this becomes a full mesh. A full unicast mesh topology removes the need to filter the unnec- essary messages but introduces scalability concerns as the number of connections required grows quadratically with the number of par- ties (FEs and CEs) present. This requires a lot more compute and state information to be maintained at each party. A pure mesh topology also complicates HA because more state must be maintained (for instance, the IP addresses of the CEs and FEs that are active and what their backups are) and therefore needs to perform extra processing to achieve failover. This remains transparent if multi- cast is used among all parties. Netlink2 allows a bundle to have a hybrid of unicast and multicast connections. Note this is a model used by other protocols such as OSPF over broadcast links where the Hello protocol is multicast but responses to LSA updates are unicasted. We present some examples of Netlink2 bundles: Salim/Haas/Blake Expires December 2003 [Page 13] Internet-Draft Netlink2 as ForCES Protocol June 2003 1) A trivial case is a Netlink2 bundle consisting of a single uni- cast wire between the CE and FE it interconnects. 2) Multiple FEs and a CE could be interconnected with a Netlink2 bundle using a single multicast connection. 3) In the same example as 2) above, the unicast address of the CE could in addition also be used, for instance, to deliver acknowl- edgments or notifications from the FEs to the CE, and not be seen by all other FEs. The unicast addresses of the FEs could also be used, for instance, to deliver certain messages only to a specific FE, such as a retransmission of a message in a two-phase commit only to an FE that did not respond. 4) Multiple FEs and CEs could use a wire with two multicast con- nections: one for all FEs, the other for all CEs, so that messages only relevant to FEs are not seen by CEs and vice-versa. 6.3. Redefining the Netlink PID Semantics We maintain the name PID for historical purposes and introduce a Destination PID and a Source PID as mentioned earlier. For every message received by each party on the wire, the destina- tion PID field indicates the recipient of the message. The addressed party could be either a FE or a CE, respectively a LFB or a CEC. In addition to Netlink2 wires (unicast or multicast) defining the destination of a particular message delivered, the PID types pro- vide further control, namely to define which entity actually has to process the message. So if the bundle uses only a single multicast wire, messages will be heard by all parties on the wire, but only those with a matching PID will actually process these messages. We introduce special- purpose PIDs addressed to specific listeners on the wire. The following types of PIDs are defined and can be used in the Netlink2 messages. The actual values for the PID of a FE or CE must be the same across all wires of the same bundle and must be established during the pre-association phase. Default values are given. PIDs must be unique within a Netlink2 wire. They may also be unique within the NE. PIDs are subdivided into two 16-bit subfields named wire and party in the form wire:party. Salim/Haas/Blake Expires December 2003 [Page 14] Internet-Draft Netlink2 as ForCES Protocol June 2003 1) unicastPID: allows one to uniquely address a FE or CE. Each FE/CE must have such a unicast PID. Only the FE or CE assigned to this PID must process an incoming message with such a Destination PID. Other parties MAY silently discard the message. The wire sub- field is a unique identifier of the FE or CE. The party subfield acts as a port number: it can for instance be used to further demultiplex a message to the appropriate process in a CE (CEC) or the appropriate LFB in an FE. Default value: none. 2) logicalPID: in addition to unicastPID, a FE/CE MAY have zero or more logical PIDs assigned to it. A logicalPID can be used for active-backup pairs of FEs: for instance, the active and the backup FE have the same logical PID or at least the same wire subfield. The wire subfield is an identifier of the group of FEs and/or CEs participating in the group. Pre-association configuration ensures that the same party identifier is not assigned twice to different CECs or LFBs on the same wire. Default value: none. 3) broadcastPID: all parties on all wires must process an incoming message with such a Destination PID. An example of a message that might be broadcast is when a CE is brought down for maintenance. Default value: 0xffffffff 4) FEbroadcastPID: all FEs on all wires must process an incoming message with such a Destination PID. Typically a route update from the CE to all FEs. Other parties (CEs) can silently discard the message. Default value: 0xffffefff 5) CEbroadcastPID: all CEs on all wires must process an incoming message with such a Destination PID. Other parties (FEs) can silently discard the message. Default value: 0xffffdfff A Netlink2 message must have as Destination PID one of the PIDs types defined above. The Source PID of a Netlink message must be of the unicastPID or logicalPID type. In addition, if the NLM_F_ACK flag is set, then every party processing the message MUST reply with an acknowledgment after processing the message, unless the NLM_F_ASTR flag is used to prevent ACK implosion. Salim/Haas/Blake Expires December 2003 [Page 15] Internet-Draft Netlink2 as ForCES Protocol June 2003 Pre-configured translation tables are used to map a given PID into the underlying wire in a bundle, i.e., an IP unicast or multicast address. 6.4. Local Scope Addressing and Encapsulation At a local scope, the addressing used for a wire is a UDP port on top of a multicast IP address. Multiple wires can run on one multicast address with further demul- tiplex level based on the UDP port. The wire addressing parameters MAY be discovered during the pre- association phase. 6.5. Global Scope Addressing and Encapsulation When addressing a non-local scope the Netlink2 message is encapsu- lated over a transport header and shuttled to the remote end where it is decapsulated and run as if originating from the local scope of that remote end. The global scope addressing could use any transport protocol configured (SCTP, UDP or TCP) as agreed upon in the pre-association phase. This can be viewed as extensions of the local scope wires. 7. Protocol Architecture 7.1. Protocol Phases ForCES in relation to NEs involves three phases: the Pre-Associa- tion phase, the association phase where the ForCES protocol oper- ates, and a termination phase where a party in the relationship leaves a bundle. 7.1.1. The Pre-Association Phase In a simple setup, this phase is static. All the parameters for the association phase are well known (example multicast groups for each Netlink2 bundle and its wires, etc.). Salim/Haas/Blake Expires December 2003 [Page 16] Internet-Draft Netlink2 as ForCES Protocol June 2003 In the case of dynamic discovery, the FE Manager and the CE Manager agree on all the parameters and clearly articulate topology and other information to each other. Vendors may use their own proprietary service discovery protocol. As minimum, we assume a static configuration. On completion of the Service Discovery phase, the FEM will have established contact with the appropriate CEM component. Initial- ization and Authentication will be complete at this point. An FE is issued a service identifier which will be used for accounting, identification and authentication purposes. The identifier is translated as the PID in the association phase. The multicast and unicast addresses for communication are also known at this point. All capabilities may also have been discovered at this point. 7.1.2. The Association Phase In this phase, the FE and CP components cooperate to deliver the IP service. The CP component might be registered (in the pre-associa- tion phase) to receive FE-specific services (such as link events). Essentially, in this phase, the IP service is provisioned and exe- cuting. The FE component might continuously get updates from the control plane component on how to operate the service (for example, the V4 forwarding route additions or deletions). The association phase is where Netlink2 operates as the ForCES pro- tocol. On startup, a SYN Netlink2 message with an ACK flag set is issued by the FE on the bundle(s) to which the FE is connected. The con- trolling CE will respond (given the ACK flag in the request) with either an ACK to imply that the FE has been accepted by the CE or a NACK, which is interpreted as a rejection of the FE by the CE. If no response is received within a timeout period a retry is attempted. After a configurable number of retries without response, it is assumed that a CE does not exist and control is handed to the FEM. The SYN state is followed by the synchronization phase where the FE is loaded with updates to tables. 7.1.3. Service Termination Service termination could be issued by either component of the ser- vice abstraction. Normally it will be issued by the FE component so that the latter does not continue to get billed for services. The FE component may also issue the termination message if it wants Salim/Haas/Blake Expires December 2003 [Page 17] Internet-Draft Netlink2 as ForCES Protocol June 2003 to change to a comparatively better CP service provider. FE or the CE initiating the termination will issue a BOOT command with a FIN extended flag. An ACK flag may be set if a response to the FIN is required. 7.2. Protocol Logical Model In the diagram below we show a simple LFB<->CEC logical relation- ship. We use the IPv4 Forwarding LFB as an example. CE----------------------------------- | /^^^^^ /^^^^^ | | | | / CEC-2 | | | CEC-1 | | COPS | | | | ospfd | | PEP | | | / _____/ | | _____/ | | | | | | ****************************************| ************* NETLINK2 BUNDLE *********** FE---------- *****************************************. | IPv4 Forwarding| | | | | LFBs | | | | | --------------/ ----|-----------|-------- | | | / | | | | | | .-------. .-------. .------. | | | | |ingress| | IPv4 | |Egress| | | | | |police | |Forward| | QoS | | | | | |_______| |_______| |Sched | | | | | ------ | | | --------------------------------------- | | | ----------------------------------------------------- Netlink2 logically models LFBs and CECs in the form of service blocks interconnected to each other via a Netlink2 bundle. Acknowledgements and responses to messages do not have to be sent onto the same wire from which the triggering messages came from but MUST be sent on the same bundle to the same originating PID. For instance, a wire interconnecting a CE with multiple FEs using a multicast address could be used to send route updates from the CE. On the other hand, independent unicast wires from each FE to the CE could be used to send back route events or acknowledgments. Note Salim/Haas/Blake Expires December 2003 [Page 18] Internet-Draft Netlink2 as ForCES Protocol June 2003 that sequencing is done per wire and Source PID, and ACKs can travel back on any wire of a bundle. The Netlink2 wire can be shared or be specific to a service. There can be multiple Netlink2 wires bundled in a bundle carrying messages of the same service. In order to reduce (for example to avoid extra processing) or restrict the messaging accessible for partitioning or security rea- sons, additional Netlink2 wires can be used. A possible partition- ing is a Netlink2 bundle per service. In the example above the IPv4 Forwarding LFB would be considered a service. Assuming capabilities have been discovered during the pre-associa- tion phase (between the FEM and CEM), blocks (CECs or LFBs as illustrated above) connect to the agreed wires on the Netlink2 bun- dle, and listen to receive specific messages. CECs may connect to multiple Netlink2 wires if it helps them to control the service better. All blocks (CECs and LFBs) dump packets on the Netlink2 wires. LFBs or CECs join Netlink2 wires and listen to messages of interest for processing or monitoring purposes. All messages addressed to the LFB (for example the IPv4 forwarding LFB illustrated above) will have the FE PID agreed upon by both the CE and the FE at the pre-association phase. LFBs (as well as CECs) also process message with the broadcast PIDs. They may also process messages destined to other LFBs (as well as CECs) for availability synchronization purposes. A further demultiplexing point is the command type in the Netlink2 message. Each of the LFBs (e.g., the ingress police LFB above) knows how to respond to a specific command-set as defined by the Netlink2 message type. 7.3. Service Addressing Connecting to a service is achieved by connecting to a defined Netlink2 bundle by both the CEC and LFB. This Netlink2 bundle is derived in the pre-association phase. A service would typically be related to a specific Netlink2 bundle. Command types would be used to configure different LFBs. This allows reuse of the 16-bit command type with every new bundle. Connecting to a service is followed (at any point during the life- time of the connection) by either issuing a service-specific com- mand mostly for configuration purposes (from the CEC to the LFB) or Salim/Haas/Blake Expires December 2003 [Page 19] Internet-Draft Netlink2 as ForCES Protocol June 2003 for statistics collection. The LFB could also send event announce- ments to the CEC or respond or ACK queries issued by the CEC. 7.4. IP Service Templates IP services are defined by using service templates. Refer to the Netlink document [Netlink] for the different templates used for IP services that fit within the current scope of the ForCES charter. 7.5. Mechanisms for Creating Protocols Mechanisms for reliable or non-reliable protocols creation are pro- vided. In addition, mechanisms for facilitating availability are embedded in Netlink2. 7.5.1. Building Reliable Protocols By default the Netlink2 header flags NLM_F_PRIO and NLM_F_ACK are not set so that Netlink2 messages are sent with a lower priority messages and do not require acknowledgements. One could create a reliable protocol between an LFB and a CEC by using the combination of sequence numbers, ACKs and retransmit timers. Both sequence numbers and ACKs are provided by Netlink2. Timers are provided by the operating system or hardware. Prioritization is an orthogonal mechanism to reliability. When a node runs out of resources, a message sent with a higher priority will get preferential treatment. For instance, if a FE has only enough memory to allocate one message in response to a message from the CE and it has to choose between one of two messages to respond to, then it will use that memory for the request which was sent with the higher priority. This also applies to other resources such as computing cycles and bandwidth. In other words, the NLM_F_PRIO is more than only the classical bandwidth prioritization of packets on a link. Another orthogonal mechanism provided by Netlink2 is the ACK strat- egy which is selected by the NLM_F_ASTR flag. We define two types of acknowledgement strategies: 1) partial ACKs (using multicast ACK slotting and damping tech- niques [XTP]): receivers multicast an ACK after a random time if Salim/Haas/Blake Expires December 2003 [Page 20] Internet-Draft Netlink2 as ForCES Protocol June 2003 they have note yet seen an ACK sent by another receiver. This lim- its the number of ACKs returned to the source of the message and improves performance. For messages which a CE sends to a group of FEs partial ACKs imply that anyone of the FEs generating an ACK back it is sufficient to deem the message was delivered. 2) full ACKs: each receiver sends an ACK back to the source. This allows the source to immediately detect problems with receivers. In two-phase commits it is important that all FEs respond so that the full ACKs strategy should be used. 7.5.2. Building Availability A protocol component or an application could passively listen to Netlink2 commands and events within one or several Netlink2 wires. Doing so allows a very simple way of building complex applications which are aware of all service components that affect them for HA reasons. To ensure transparent CE or FE redundancy for certain services, it is sufficient to ensure that the backup CEC/LFB is always attached to the same wires to which the active CEC/LFB is attached, so that the backup CEC/LFB receives all messages destined to the active CEC/LFB (whatever PID they are sent to) as well as all messages originating from the active CEC/LFB. One could create a heartbeat protocol between the LFB and CEC by using the ECHO flags and the NLMSG_NOOP message. The heartbeat, in addition to listening to FE or CE events, could be used to facili- tate takeover. This topic is beyond the scope of ForCES and will not be discussed further here. Note, however, that Netlink2 has the mechanisms required to enable this when required. 7.5.3. The ACK Netlink2 Message This message is actually used to denote both an ACK and a NACK. Typically the direction is from LFB to CEC (in response to an ACK request message). However, CEC should be able to send ACKs back to LFB when requested. The semantics for this are IP service spe- cific. Salim/Haas/Blake Expires December 2003 [Page 21] Internet-Draft Netlink2 as ForCES Protocol June 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Netlink2 message header | | type = NLMSG_ERROR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | error code | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OLD Netlink2 message header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Error code: integer (typically 32 bits) An error code of zero indicates that the message is an ACK response. An ACK response message contains the original Netlink2 message header that can be used to compare against (sent sequence numbers, etc). A non-zero error code message is equivalent to a Negative ACK (NACK). In such a situation, the Netlink2 data that was sent down to the kernel is returned appended to the original Netlink2 message header. 7.5.4. Batching, Atomicity and Ordering of Transactions As mentioned earlier (repeated here for clarity) Standard Netlink multi-message batching looks as follows: NLMSG:NLMSG:NLMSG.... where NLMSG is a Netlink2 header and its associated payload. This has the advantage of allowing inter-mixing of multiple com- mands (example adds/deletes) generally in a request from CE->FE. It is also useful for batching multiple events from the FE->CE. In a two-phase commit messages are bound into a relationship. Typ- ically, the first and all following headers have the NLM_F_MULTI Netlink2 header flag set, except for the last header, which has the Netlink2 header type NLMSG_DONE. Typically, in netlink, the NLMSG_DONE shows up in separate PDUs to define a commit. Atomicity of a transaction including that of a batch is achieved by using the NLM_F_ATOMIC flag. Use of the NLM_F_ATOMIC is expensive because it may necessitate the locking of access to tables Salim/Haas/Blake Expires December 2003 [Page 22] Internet-Draft Netlink2 as ForCES Protocol June 2003 (depending on the implementation. 8. Putting together the base protocol for WG charter [TBF] 9. References [RFC1633] R. Braden, D. Clark, and S. Shenker, "Integrated Ser- vices in the Internet Architecture: an Overview", RFC 1633, ISI, MIT, and PARC, June 1994. [RFC1812] F. Baker, "Requirements for IP Version 4 Routers", RFC 1812, June 1995. [RFC2475] M. Carlson, W. Weiss, S. Blake, Z. Wang, D. Black, and E. Davies, "An Architecture for Differentiated Services", RFC 2475, December 1998. [RFC2748] J. Boyle, R. Cohen, D. Durham, S. Herzog, R. Rajan, A. Sastry, "The COPS (Common Open Policy Service) Protocol", RFC 2748, January 2000. [RFC2328] J. Moy, "OSPF Version 2", RFC 2328, April 1998. [RFC2844] T. Przygienda, P. Droz, R. Haas, "OSPF over ATM and Proxy-PAR", RFC 2844, May 2000. [RFC3358] T. Przygienda, "Optional Checksums in Intermediate System to Intermediate System (ISIS)", RFC 3358, August 2002. [RFC1157] J.D. Case, M. Fedor, M.L. Schoffstall, C. Davin, "Simple Network Management Protocol (SNMP)", RFC 1157, May 1990. [RFC3036] L. Andersson, P. Doolan, N. Feldman, A. Fredette, B. Thomas "LDP Specification", RFC 3036, January 2001. [Stevens] G.R Wright, W. Richard Stevens, "TCP/IP Illustrated Vol- ume 2, Chapter 20", June 1995. [Netfilter] http://netfilter.samba.org [Diffserv] http://diffserv.sourceforge.net [Netlink] J. H. Salim, H. Khosravi, A. Kleen, A. Kuznetsov, "Netlink as an IP Services Protocol", draft-ietf-forces- Salim/Haas/Blake Expires December 2003 [Page 23] Internet-Draft Netlink2 as ForCES Protocol June 2003 netlink-03.txt, June 2002. [ForCES_REQ] H. Khosravi, T. Anderson, "Requirements for Separation of IP Control and Forwarding", draft-ietf-forces-require- ments-07.txt, October 2002. [XTP] XTP Forum, "Xpress Transport Protocol Specification, XTP Revision 4.0", March 1995. 10. Author's Address: Jamal Hadi Salim Znyx Networks Ottawa, Ontario Canada hadi@znyx.com Robert Haas IBM Research Zurich Research Laboratory Saeumerstrasse 4 CH-8803 Rueschlikon Switzerland rha@zurich.ibm.com Steven Blake Ericsson IP Infrastructure 920 Main Campus Drive, Suite 500 Raleigh, NC 27606 steven.blake@ericsson.com 11. Appendix 1: Sample Service Hierarchy In the diagram below we show a simple IP service, foo, and the interaction it has between CP and FE components for the ser- vice(labels 1-3). The diagram is also used to demonstrate CP<->FE addressing. In this section we illustrate only the addressing semantics. In Appendix 2 , the diagram is referenced again to define the protocol interaction between service foo's CEC and LFB (labels 4-10). Salim/Haas/Blake Expires December 2003 [Page 24] Internet-Draft Netlink2 as ForCES Protocol June 2003 CP [--------------------------------------------------------. | .-----. | | | . --------. | | | CLI | / | | | | | CP protocol | | | /->> -. | component | <-. | | __ _/ | | For | | | | | | IP service | ^ | | Y | foo | | | | | ___________/ ^ | | Y 1,4,6,8,9 / ^ 2,5,10 | 3,7 | --------------- Y------------/---|----------|----------- | ^ | ^ **|***********|****|**********|********** ************* Netlink2 layer ************ **|***********|****|**********|********** FE | | ^ ^ .-------- Y-----------Y----|--------- |----. | | / | | Y / | | . --------^-------. / | | |FE component/module|/ | | | for IP Service | | --->---|------>---| foo |----->-----|------>-- | ------------------- | | | | | ------------------------------------------ The control plane protocol for IP service foo does the following to connect to its FE counterpart. The steps below are also numbered in the diagram above. 1) Connect to IP service foo through a socket connect. A typical con- nection would be via a call to: socket(AF_NETLINK, SOCK_RAW, NETLINK_FOO) 2) Bind to listen to specific async events for service foo 3) Bind to listen to specific async FE events Note that a wrapper socket can be created on top of the real sock- ets: depending on the dest PID given, it chooses the most Salim/Haas/Blake Expires December 2003 [Page 25] Internet-Draft Netlink2 as ForCES Protocol June 2003 appropriate socket to send the packet onto (if here are two multi- cast groups, one for all FEs, and one for all FEs and CEs, a packet from the CE to the FEs will use the first multicast group). The wrapper socket basically maps a message to the most appropriate wire in the bundle. 12. Appendix 2: Sample Protocol for the foo IP Service Our proverbial IP service "foo" is used again to demonstrate how one can deploy a simple IP service control using Netlink2. These steps are continued from Appendix 1 (hence the numbering). 4) query for current config of FE component 5) receive response to 4) via channel on 3) 6) query for current state of IP service foo 7) receive response to 6) via channel on 2) 9) register the protocol specific packets you would like the FE to forward to you 10) send specific service foo commands and receive responses for them if needed 12.1. Interacting with Other IP Services The diagram in Appendix 1 shows another control component configur- ing the same service. In this case, it is a proprietary Command Line Interface. The CLI may or may not be using the Netlink proto- col to communicate with the foo component. If the CLI should issue commands that will affect the policy of the LFB for service "foo", then the "foo" CEC is notified. It could then make algorithmic decisions based on this input. For example if a FE allowed another service to delete policies installed by a different service and a policy that foo installed was deleted by service bar, there might be a need to propagate this to all the peers of service "foo"). 13. Appendix 3: Examples In this example we show a simple configuration Netlink2 message sent from a TC CEC to an egress TC FIFO queue. This queue algo- rithm is based on packet counting and drops packets when the limit exceeds 100 packets. We assume the queue is in hierarchical setup Salim/Haas/Blake Expires December 2003 [Page 26] Internet-Draft Netlink2 as ForCES Protocol June 2003 with a parent 100:0 and a classid of 100:1 and that it is to be installed on device with ifindex of 4. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version | Flags_E | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type (RTM_NEWQDISC) | Flags (NLM_F_EXCL | | | |NLM_F_CREATE | NLM_F_REQUEST) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number (arbitrary number) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Family(AF_INET)| Reserved1 | Reserved1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index (4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Qdisc handle (0x1000001) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Parent Qdisc (0x1000000) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TCM Info (0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type (TCA_KIND) | Length(4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Value ("pfifo") | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type (TCA_OPTIONS) | Length(4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Value (limit=100) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Salim/Haas/Blake Expires December 2003 [Page 27]