ForCES Working Group Jamal Hadi Salim Internet Draft Znyx Networks Robert Haas IBM December 2002 Netlink2 as ForCES protocol draft-jhsrha-forces-netlink2-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress.'' The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC-2119]. 1. Abstract This document describes Netlink2, which is an extension of Linux Netlink [Netlink]. This document is intended as a proposal for the ForCES IETF working group protocol. ForCES attempts to define a clear separation between the two enti- ties of the NE in order to have them evolve separetely as opposed draft-jhsrha-forces-netlink2-00.txt ^L[Page 1] jhsrha draft-jhsrha-forces-netlink2-00.txt to the current monolithic evolution. 2. Introduction The concept of IP control and forwarding separation was first introduced in the early 1980s by the BSD 4.4 routing sockets [stevens]. The focus at that time was a simple IP(v4) forwarding service and how the control plane, either via a command line con- figuration tool or a dynamic route daemon, can control forwarding tables for that IPv4 forwarding service. The IP world has evolved considerably since then. Linux Netlink [Netlink], when observed from a service provisioning and management point of view, takes routing sockets one step further by breaking the narrow focus on IPv4 forwarding. Since the Linux 2.1 kernel, Netlink has been providing the IP service abstraction to a few ser- vices other than classical RFC 1812 IPv4 forwarding. Netlink2 extends Linux Netlink to meet the requirements of the ForCES working group charter for a protocol. Netlink is extended to have a distributed addressing and transport scheme, and missing mechanisms are added to make Netlink2 meet the ForCES protocol requirements [forces_req]. We select to use Netlink as the base set because it is freely available. Netlink is also already proven because it is widely deployed with the Linux operating system since the 2.1 kernel. Netlink2 operates in a mode where knowledge of the NE, its topology and modeling MAY have already been discovered, or is discovered within the Netlink2 protocol. 2.1. Why Netlink-derived? Netlink was designed with a goal of solving the forwarding and con- trol separation. This means that many of the main issues have been thought through and resolved over the years. In other words Netlink is proven as a protocol addressing separation of forwarding and control. Netlink is also network-ready because it uses packet for- mating techniques and concepts (eg multicast addressing). This and the availability of publicly running and tested code form a major motivator to base Netlink2 on Netlink. draft-jhsrha-forces-netlink2-00.txt ^L[Page 2] jhsrha draft-jhsrha-forces-netlink2-00.txt 2.2. Definitions We use the definitions provided in [forces_req], as well as the follow- ing: Forwarding Element Component (FEC): same as Forwarding Engine Com- ponents as defined in [Netlink]. This is a component in the FE driven by the ForCES protocol in order to achieve a certain ser- vice. Control Element Component (CEC): same as defined in Control Plane Component in [Netlink]. This is a component in the CE that drives FEC(s) in order to achieve a certain service. 3. Extensions to the Netlink Message Format To conform to the ForCES requirements [forces_req], the Netlink protocol [Netlink] is extended in the following respects: 1) IP and Transport encapsulations to carry Netlink messages. 2) Feature expandability extensions to accommodate current generic ForCES requirements and make it possible to add more in the future. This facilitates things such aspects as authentication, checksum- ming, etc, when required. With these changes to complement the existing Netlink functional- ity, Netlink2 fulfills the requirements to become the ForCES proto- col. 3.1. Netlink Header Extensions 1) PID redefinition and addition In Netlink, PID 0 referred to the equivalent of the FE (kernel). The equivalent of the CE (user process) was referred by its OS pro- cess id. In Netlink2 a PID of the unicastPID type is assigned to each FE and CE in the pre-association phase. Different types of PIDs are dis- cussed further below. In this way the CE uniquely identifies the FE draft-jhsrha-forces-netlink2-00.txt ^L[Page 3] jhsrha draft-jhsrha-forces-netlink2-00.txt and avoids any collision. We maintain the name PID for historical purposes. - destination PID: the PID field is redefined as the destination PID field. This field identifies the parties on the wire that must process the message. - source PID: this field is introduced in the header to identify the source of the message. Different types of PIDs are discussed further below. 2) The length has been reduced to 16 bits, with length 0 being reserved. The rest of the old 32-bit length field is now split between a new version field and a new additional flags field. 3) A Version field is introduced in the Netlink2 header. This 8-bit field is 4 bits major number and 4 bits minor number in the form of major:minor. For Netlink2, this becomes: 0x20. 4) A new Extended Flags field is introduced to take over the remainder 8 bits from the 16-bit field taken from the Length. Turn- ing different bits on enables additional new features such as pro- claiming the presence of extended TLVs etc. Extended Flags also introduce the concept of a SYN message which is issued by the FE as the first message after the pre-association phase to indicate its presence. Also, a FIN flag issued last to indicate departure of the FE. 5) Netlink2-specific TLVs come right after the older Netlink header (refer to diagram further below). They are optional and their presence is only indicated if the Extended Flags indicate their presence. Typical use of Netlink2-specific TLVs is to compensate for capabilities lacking in a transport. For example in an IP net- work not deployed with IPSEC, the Netlink2-specific authentication TLV could be used to emulate IPSEC-AH. Other than these changes, all mechanisms provided by Netlink are sufficient to meet the requirements for ForCES. The reader is encouraged to refer to that document as a companion to this one. draft-jhsrha-forces-netlink2-00.txt ^L[Page 4] jhsrha draft-jhsrha-forces-netlink2-00.txt 3.1.1. Netlink2-specific TLVs 1) Authentication [TBD] 2) Checksum 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TLV Type =12 | TLV Length =2 | Checksum (16 bits) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This TLV is optional. To compute the correct checksum, an implemen- tation MUST add the optional checksum TLV to the Netlink2 message with the initial checksum value of 0 and compute the checksum over such a netlink2 message. Refer to [RFC3358] for details on the Checksum TLV. 3) Message Priority 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TLV Type =13 | TLV Length =2 | Priority (16 bits) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This TLV is optional. It is used if the network does not support prioritization. This field is used to indicate priorities to the remote end. 3.2. Addressing and Transport Extensions We extend Netlink to make it distributed. The focus is on making Netlink2 have a strong local scope view of the world while fitting well into a global scope when the hop distance between the FE and CE increases. If the network interconnecting the FE(s) and CE(s) is completely hidden from the outside (black-box view), for instance an internal Ethernet segment or a switching fabric in which CE(s) and FE(s) are connected within physical proximity, then communications between FE and CE are assumed to be of a local scope. On the other hand, if draft-jhsrha-forces-netlink2-00.txt ^L[Page 5] jhsrha draft-jhsrha-forces-netlink2-00.txt communications between FE and CE cross parts of the network that are not hidden from the outside, communications are considered to be of global scope. 3.2.1. Transport Methods The ideal environment for Netlink2 is considered to be a multicast- capable medium with IP above it and with UDP/TCP/SCTP running over IP. Netlink2 will run over non-IP, non-multicast-capable environments; however, it will require extra processing and messaging by the ForCES layer to compensate for services that IP already offers. 3.2.1.1. Why Multicast? Multicast is considered important to facilitate one-to-many/some communication. For example, a single command from a CE can be mul- ticast to multiple FEs, which eases the scalability requirements mentioned in [forces_req]. This is discussed in later sections. When running Netlink2 over non-multicast-capable media, it is expected that mechanisms similar to those used in OSPF NBMA [RFC2328] networks will be put in place. 3.2.1.2. Why IP? IP runs on virtually every link layer. Leveraging this fact alone helps deploying the protocol wider and faster. IP also provides numerous services such as assembly and fragmenta- tion, prioritization, and security, which are inherent requirements for the ForCES protocol. This means to successfully run an alter- native to IP requires that similar services be provided by whatever is underneath in order to meet the requirements. Netlink2-specific optional TLVs can be used to compensate for lack- ing functionality if running on network transport other than IP or directly on the link layer. Netlink already allows the definition of multipart messages with IP segmenting/assembling when the path MTU is exceeded. When running on top of non-IP media, the Netlink2 message can be limited to not draft-jhsrha-forces-netlink2-00.txt ^L[Page 6] jhsrha draft-jhsrha-forces-netlink2-00.txt exceed the MTU; the multipart messages facility can be then be used to provide framing for assembling/segmenting. Netlink2-specific Authentication TLV can be used to carry authenti- cation signatures in a medium that does not have this capability. Netlink2-specific Checksum TLV can be used to carry checksums in a medium that does not have this capability. Netlink2-specific Message Priority TLV can be used to carry priori- tization if transports are not capable of making priorities in their headers. 3.2.1.3. Why UDP/TCP/SCTP? On a local scope, it is assumed that multicast UDP over IP is the preferred mode of operation. On a global scope it is expected that TCP or SCTP would be used for enhanced reliability and internet congestion friendliness. All three protocols provide 16-bit ports, which are further address-demultiplexing points. Also, all three protocols provide checksum capability to enhance integrity of the Netlink2 message. In the case of UDP, the checksum is optional (which fits the model that the local scope is less error-prone than global scope and hence the integrity check could be turned on only when needed). 3.2.2. The Netlink2 wire and bundle A Netlink2 wire displays the same behavior as a Netlink wire. It interconnects FEs and CEs in order to support services they jointly offer. The only conceptual difference between a Netlink2 wire and a Netlink wire is that whereas the Netlink wire is localized, the Netlink2 wire is distributed. We also introduce the concept of a Netlink2 bundle. A Netlink2 bundle interconnects a set of FE(s) and/or CE(s) by means of one or more Netlink2 wires. Note that a Netlink2 bundle does not necessar- ily mean a full-mesh interconnection (see examples later on). draft-jhsrha-forces-netlink2-00.txt ^L[Page 7] jhsrha draft-jhsrha-forces-netlink2-00.txt Parties (FEs and CEs) on a Netlink2 bundle share a common configu- ration, provisioning and event-notification end goals. A Netlink2 wire MAY be constructed using a multicast connection or a unicast connection or a multiple number of multicast and unicast connections. A wire MUST belong to only one bundle. A bundle may have only a single wire (unicast or multicast). In most cases we believe there will only be one multicast address for a bundle, although scalability issues could require the use of unicast con- nections in addition. When a multicast IP address is used, a netlink2 wire MUST run over UDP - a UDP port is used to uniquely identify the wire. There MAY be multiple wires using the same multicast address as long as they run over different UDP ports. When a unicast IP address is used, the description of how to con- nect to an endpoint (CE/FE) is subject to the agreement between the CE and FE. The connection could be directly over IP (do we need an IP protocol number?) or via transport-layer ports (TCP/UDP/SCTP). In both unicast and multicast wires, the necessary parameters (such as IP address and port numbers) can be discovered by the involve- ment of the FE and CE Managers. 3.2.2.1. What wires go in a bundle? Netlink2 provides flexibility to have a bundle of purely unicast wires or multicast wires or a hybrid of both. The decision of what goes into a bundle can be made in the pre-association phase. A good analogy is to think of a multicast wire as a broadcast link (as is done in Netlink) in which CE(s) and FE(s) are parties attached to that broadcast link. Depending on the number of FEs and CEs on an NE, a choice of a sin- gle multicast wire in the bundle may be sufficient. Multicast allows one-to-some messagging. A single message sent by an origi- nator is seen by all parties on the wire. This simplifies synchro- nization in an HA environment as well as implementation of the pro- tocol. The fact that multicast messages are seen by all parties could cause scalability issues as the number of nodes grows. Parties need to filter out messages not designated for them if they are not the draft-jhsrha-forces-netlink2-00.txt ^L[Page 8] jhsrha draft-jhsrha-forces-netlink2-00.txt destination. This can take compute or table resources if filtering is done in hardware. The extra messages also consume unnecessary bandwidth for FE(s) and CE(s) not interested in seeing these mes- sages. Unicast wires could be used to create point-to-point connections between the parties; when every party is connected to every other party, then this becomes a full mesh. A full unicast mesh topology removes the need to filter the unnec- essary messages but introduces scalability concerns as the number of connections required grows quadratically with the number of par- ties (FEs and CEs) present. This requires a lot more compute and state information to be maintained at each party. A pure mesh topology also complicates HA because more state must be maintained (for instance, the IP addresses of the CEs and FEs that are active and what their backups are) and therefore needs to perform extra processing to achieve failover. This remains transparent if multi- cast is used among all parties. Netlink2 allows a bundle to have a hybrid of unicast and multicast connections. Note this is a model used by other protocols such as OSPF over broadcast links where the Hello protocol is multicast but responses to LSA updates are unicasted. We present some examples of Netlink2 bundles: 1) A trivial case is a Netlink2 bundle consisting of a single uni- cast wire between the CE and FE it interconnects. 2) Multiple FEs and a CE could be interconnected with a Netlink2 bundle using a single multicast connection. 3) In the same example as 2) above, the unicast address of the CE could in addition also be used, for instance, to deliver acknowl- edgments or notifications from the FEs to the CE, and not be seen by all other FEs. The unicast addresses of the FEs could also be used, for instance, to deliver certain messages only to a specific FE, such as a retransmission of a message in a two-phase commit only to an FE that did not respond. 4) Multiple FEs and CEs could use a wire with two multicast con- nections: one for all FEs, the other for all CEs, so that messages only relevant to FEs are not seen by CEs and vice-versa. draft-jhsrha-forces-netlink2-00.txt ^L[Page 9] jhsrha draft-jhsrha-forces-netlink2-00.txt 3.2.3. Redefining the Netlink PID Semantics We maintain the name PID for historical purposes and introduce a destination PID and a source PID as mentioned earlier. For every message received by each party on the wire, the destina- tion PID field indicates the recipient of the message. The addressed party could be either an FE or a CE. In addition to Netlink2 wires (unicast or multicast) defining the destination of a particular message delivered, the PID types pro- vide further control, namely to define which entity actually has to process the message. So if the bundle uses only a single multicast wire, messages will be heard by all parties on the wire, but only those with a matching PID will actually process these messages. We introduce special-purpose PIDs addressed to specific listeners on the wire. The following types of PIDs are defined and can be used in the Netlink2 messages. The actual values for the PID of an FE or CE must be the same across all wires of the same bundle and must be established during the pre-association phase. Default values are given. PIDs must be unique within a Netlink2 wire. They may also be unique within the NE. 1) unicastPID: allows one to uniquely address an FE or CE. Each FE/CE must have such a unicast PID. Only the FE or CE assigned to this PID must process an incoming message with such a destination PID. Other parties MAY silently discard the message. Default value: none. 2) logicalPID: in addition to unicastPID, an FE/CE MAY have zero or more logical PIDs assigned to it. A logicalPID can be used for active-backup pairs of FEs: for instance, the active and the backup FE have the same logical PID. Default value: none. 3) broadcastPID: all parties on the wire must process an incoming message with such a destination PID. An example of a message that might be broadcast is when a CE is brought down for maintenance. Default value: 0xffffffff 4) FEbroadcastPID: all FEs on the wire must process an incoming message with such a destination PID. Typically a route update from draft-jhsrha-forces-netlink2-00.txt ^L[Page 10] jhsrha draft-jhsrha-forces-netlink2-00.txt the CE to all FEs. Other parties (CEs) can silently discard the message. Default value: 0xefffffff 5) CEbroadcastPID: all CEs on the wire must process an incoming message with such a destination PID. Other parties (FEs) can silently discard the message. Default value: 0xdfffffff A Netlink2 message must have as destination PID one of the PIDs types defined above. The source PID of a Netlink message must be of the unicastPID or logicalPID type. In addition, if the NLM_F_ACK flag is set, then every party processing the message MUST reply with an acknowledgment after processing the message. 3.2.4. Local Scope Addressing and Encapsulation At a local scope, the addressing used for a wire is a UDP port on top of a multicast IP address. Multiple wires can run on one multicast address with further demul- tiplex level based on the UDP port. The wire addressing parameters MAY be discovered during the pre- association phase. 3.2.5. Global Scope Addressing and Encapsulation When addressing a non-local scope the Netlink2 message is encapsu- lated over a transport header and shuttled to the remote end where it is decapsulated and run as if originating from the local scope of that remote end. The global scope addressing could use any transport protocol configured (SCTP, UDP or TCP) as agreed upon in the pre-association phase. This can be viewed as extensions of the local scope wires. draft-jhsrha-forces-netlink2-00.txt ^L[Page 11] jhsrha draft-jhsrha-forces-netlink2-00.txt 4. Netlink2 Architecture An IP service accomplished by an FE is represented as an FE compo- nent (FEC) in the FE. CE components (CEC) in the CE interact with FECs over a Netlink2 bundle to execute a certain service. The interactions between FECs and CECs are proper to each service and are defined using templates as presented in [Netlink]. For instance, the IPv4 Forwarding service (called NETLINK_ROUTE) defines a message template for handling IP routes and the messages types to insert, remove, or get a route. The routing CEC(s) and the IPv4 Forwarding FEC(s) interact using these message templates and message types over the Netlink2 bundle to execute the IPv4 Forward- ing service. The message types in Netlink2 messages allow the FE to demultiplex messages to the appropriate FEC. Messages of a certain service destined to an FEC can travel on dif- ferent Netlink2 wires within the same bundle. Note that an FEC can process messages from different bundles. Netlink2 by itself does not constitute a protocol, but rather a set of base mechanisms that can be picked up depending on service requirements. The interaction between the FEC and the CPC, as in the Netlink con- text, would define a protocol. Netlink2 provides mechanisms for the CP Component and the FE Component to define their own protocol. The FEC might continuously get updates from the control-plane com- ponent on how to operate the service (e.g. for V4 forwarding, or for route additions or deletions). Netlink2 messages and mechanisms are used to derive the protocol. For example: the FEC and CPC may choose to define a reliable or semi-reliable protocol between each other. By default, however, Netlink2 provides an unreliable communication. 4.1. Protocol Logical Model In the diagram below we show a simple FEC<->CEC logical relation- ship. We use the IPv4 Forwarding FEC as an example. draft-jhsrha-forces-netlink2-00.txt ^L[Page 12] jhsrha draft-jhsrha-forces-netlink2-00.txt CE----------------------------------- | /^^^^^ /^^^^^ | | | | / CEC-2 | | | CEC-1 | | COPS | | | | ospfd | | PEP | | | / _____/ | | _____/ | | | | | | ****************************************| ************* NETLINK2 BUNDLE *********** FE---------- *****************************************. | IPv4 Forwarding| | | | | FEC | | | | | --------------/ ----|-----------|-------- | | | / | | | | | | .-------. .-------. .------. | | | | |ingress| | IPv4 | |Egress| | | | | |police | |Forward| | QoS | | | | | |_______| |_______| |Sched | | | | | ------ | | | --------------------------------------- | | | ----------------------------------------------------- Netlink2 logically models FECs and CECs in the form of service blocks interconnected to each other via a Netlink2 bundle. Acknowledgements and responses to messages do not have to be sent onto the same wire from which the triggering messages came from but MUST be sent on the same bundle to the same originating PID. For instance, a wire interconnecting a CE with multiple FEs using a multicast address could be used to send route updates from the CE. On the other hand, independent unicast wires from each FE to the CE could be used to send back route events or acknowledgments. Note that sequencing is done per wire and source PID, and ACKs can travel back on any wire of a bundle. The Netlink2 wire can be shared or be specific to a service. There can be multiple Netlink2 wires bundled in a bundle carrying mes- sages of the same service. In order to reduce (for example to avoid extra processing) or restrict the messaging accessible for partitioning or security reasons, additional Netlink2 wires can be used. A possible partitioning is a Netlink2 bundle per service. In the example above the IPv4 Forwarding FEC would be considered a draft-jhsrha-forces-netlink2-00.txt ^L[Page 13] jhsrha draft-jhsrha-forces-netlink2-00.txt service. Assuming capabilities have been discovered during the pre-associa- tion phase (between the FEM and CEM), blocks (CECs or FECs as illustrated above) connect to the agreed wires on the Netlink2 bun- dle, and listen to receive specific messages. CECs may connect to multiple Netlink2 wires if it helps them to control the service better. All blocks (CECs and FECs) dump packets on the Netlink2 wires. FECs or CECs join Netlink2 wires and listen to messages of interest for processing or monitoring purposes. All messages addressed to the FEC (for example the IPv4 forwarding FEC illustrated above) will have the FE PID agreed upon by both the CE and the FE at the pre-association phase. FECs (as well as CECs) also process message with the broadcast PIDs. They may also process messages destined to other FECs (as well as CECs) for availability synchronization purposes. A further demultiplexing point is the command type in the Netlink2 message. Each of the blocks in an FEC (e.g., the ingress police block above) knows how to respond to a specific command-set as defined by the Netlink2 message type (refer to the Netlink2 message format and messaging further below). 4.2. The Message Format There are three mandatory levels to a Netlink2 message: The general Netlink message header, the IP-service-specific template, and the IP-service-specific data. Netlink2-specific TLVs and IP-service- specific TLVs are optional. draft-jhsrha-forces-netlink2-00.txt ^L[Page 14] jhsrha draft-jhsrha-forces-netlink2-00.txt 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Netlink2 message header | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Netlink2-specific TLVs (optional) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | IP Service Template | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | IP-Service-specific data in TLVs | | (optional) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The Netlink2 message header is generic for all services, whereas the IP Service Template header is specific to a service. Each IP Service then carries parameterization data (CEC->FEC direction) or response (FEC->CEC direction). These parameterizations are in (Type-Length-Value) TLV format and unique to the service. Note that we maintain the same IP Service Templates as in Netlink, i.e., nothing has changed here. 4.3. Protocol Model This section expands on how Netlink provides the mechanism for ser- vice-oriented FEC and CEC interaction. 4.3.1. General Messaging The Netlink2 message is used to communicate between the FEC and CEC for parameterization of the FECs, asynchronous event notification of FEC events to the CECs, and statistics querying/gathering (typi- cally by a CEC). Other activities include transfer of control pack- ets between FEC and CEC. draft-jhsrha-forces-netlink2-00.txt ^L[Page 15] jhsrha draft-jhsrha-forces-netlink2-00.txt 4.3.2. Service Addressing Connecting to a service is achieved by connecting to a defined Netlink2 bundle by both the CEC and FEC. This Netlink2 bundle is derived in the pre-association phase. A service would typically be related to a specific Netlink2 bundle. Command types would be used to configure different FECs (and blocks). This allows reuse of the 16-bit command type with every new bundle. Connecting to a service is followed (at any point during the life- time of the connection) by either issuing a service-specific com- mand mostly for configuration purposes (from the CEC to the FEC) or for statistics collection. The FEC could also send event announce- ments to the CEC or respond or ACK queries issued by the CEC. 4.3.3. Netlink2 Message Header Netlink2 messages are laid out exactly the same as Netlink mes- sages. Each Netlink2 message contains a byte stream with a Netlink2 header followed by its associated payload. A single PDU may contain more than one Netlink2 message. This is referred to as batching. Netlink batching is reused in Netlink2 and allows for messages with different commands (such as adding routes and deleting a QoS policy) to be carried in the same batch message. A Netlink2 message may be split across multiple PDUs if it does not fit into the PDU. This is refereed to as a multipart Netlink2 mes- sage and is also inherited from Netlink. For multipart messages, the first and all following headers have the NLM_F_MULTI Netlink header flag set, except for the last header, which has the Netlink header type NLMSG_DONE. The Netlink2 message header is shown below. draft-jhsrha-forces-netlink2-00.txt ^L[Page 16] jhsrha draft-jhsrha-forces-netlink2-00.txt 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | flags_e | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | source PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | destination PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Optional TLVs | ~ ~ ~ ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The fields in the header are: Length: 16 bits The length of the Netlink2 message in bytes including the header. flags_e: 16 bits These are extended flags. NLM_F_SYN Set on the first message. Interpreted as a boot message. NLM_F_FIN Set on the last message. Interpreted as a departure message. NLM_F_ETLV Set to indicate presence of extended TLVs. NLM_F_PRIO Message priority: 1 for high and 0 for low. Additional QoS level set in QOS TLV. NLM_F_ASTR Set the ACK strategy: 1 for partial ACKs and 0 for full ACKs draft-jhsrha-forces-netlink2-00.txt ^L[Page 17] jhsrha draft-jhsrha-forces-netlink2-00.txt Type: 16 bits This field describes the message content. It can be one of the standard message types: NLMSG_NOOP message is ignored NLMSG_ERROR the message signals an error and the payload contains a nlmsgerr structure. This can be looked at as a NACK and typically it is from FEC to CEC. NLMSG_DONE message terminates a multipart message Individual IP Services specify more message types, for e.g., NETLINK_ROUTE Service specifies several types such as RTM_NEWLINK, RTM_DELLINK, RTM_GETLINK, RTM_NEWADDR, RTM_DELADDR, RTM_NEWROUTE, RTM_DELROUTE, etc. Flags: 16 bits The standard flag bits used in Netlink are NLM_F_REQUEST Must be set on all request messages (typically from CE to FE) NLM_F_MULTI Indicates the message is part of a multipart message terminated by NLMSG_DONE NLM_F_ACK Request for an acknowledgment on success. Typical direction of request is from CEC to FEC. NLM_F_ECHO Echo this request. Typical direction of request is from CEC to FEC. Additional flag bits for GET requests on config information in the FEC. NLM_F_ROOT Return the complete table instead of a single entry. NLM_F_MATCH Return all matching criteria passed in message content NLM_F_ATOMIC Return an atomic snapshot of the table being referenced. This may require special privileges because it has the potential to interrupt service in the FE for a longer time. Convenience macros for flag bits: NLM_F_DUMP This is NLM_F_ROOT or'ed with NLM_F_MATCH Additional flag bits for NEW requests NLM_F_REPLACE Replace existing matching config object with this request. NLM_F_EXCL Do not replace the config object if it already exists. NLM_F_CREATE Create config object if it does not already exist. NLM_F_APPEND Add to the end of the object list. draft-jhsrha-forces-netlink2-00.txt ^L[Page 18] jhsrha draft-jhsrha-forces-netlink2-00.txt For those familiar with BSDish use of such operations in route sockets, the equivalent translations are: - BSD ADD operation equates NLM_F_CREATE or-ed with NLM_F_EXCL - BSD CHANGE operation equates NLM_F_REPLACE - BSD Check operation equates NLM_F_EXCL - BSD APPEND equivalent is actually mapped to NLM_F_CREATE Sequence Number: 32 bits The sequence number of the message. Source PID: 32 bits The PID of the sender the message (unicast or logical PID). Destination PID: 32 bits The PID of the destination the message (unicast, logical, or broadcast PID). 4.3.4. Mechanisms for Creating Protocols Mechanisms for reliable or non-reliable protocols creation are pro- vided. In addition, mechanisms for facilitating availability are embedded in Netlink2. 4.3.4.1. Building Reliable Protocols By default the netlink2 header flags NLM_F_PRIO and NLM_F_ACK are not set so that Netlink2 messages are sent with a lower priority messages and do not require acknowledgements. One could create a reliable protocol between an FEC and a CEC by using the combination of sequence numbers, ACKs and retransmit timers. Both sequence numbers and ACKs are provided by Netlink2. Timers are provided by the operating system or hardware. Prioritization is an orthogonal mechanism to reliability. When a node runs out of resources, a message sent with a higher priority will get preferential treatment. For instance, if a FE has only enough memory to allocate one message in response to a message from the CE and it has to choose between one of two messages to respond to, then it will use that memory for the request which was sent with the higher priority. This also applies to other resources such draft-jhsrha-forces-netlink2-00.txt ^L[Page 19] jhsrha draft-jhsrha-forces-netlink2-00.txt as computing cycles and bandwidth. In other words, the NLM_F_PRIO is more than only the classical bandwidth prioritization of packets on a link. Another orthogonal mechanism provided by Netlink2 is the ACK strat- egy which is selected by the NLM_F_ASTR flag. We define two types of acknowledgement strategies: 1) partial ACKs (using multicast ACK slotting and damping tech- niques [xtp]): receivers multicast an ACK after a random time if they have note yet seen an ACK sent by another receiver. This lim- its the number of ACKs returned to the source of the message and improves performance. For messages which a CE sends to a group of FEs partial ACKs imply that anyone of the FEs generating an ACK back it is sufficient to deem the message was delivered. 2) full ACKs: each receiver sends an ACK back to the source. This allows the source to immediately detect problems with receivers. In two-phase commits it is important that all FEs respond so that the full ACKs strategy should be used. 4.3.4.2. Building Availability A protocol component or an application could passively listen to Netlink2 commands and events within one or several Netlink2 wires. Doing so allows a very simple way of building complex applications which are aware of all service components that affect them for HA reasons. To ensure transparent CE or FE redundancy for certain services, it is sufficient to ensure that the backup CEC/FEC is always attached to the same wires to which the active CEC/FEC is attached, so that the backup CEC/FEC receives all messages destined to the active CEC/FEC (whatever PID they are sent to) as well as all messages originating from the active CEC/FEC. One could create a heartbeat protocol between the FEC and CEC by using the ECHO flags and the NLMSG_NOOP message. The heartbeat, in addition to listening to FE or CE events, could be used to facili- tate takeover. This topic is beyond the scope of ForCES and will not be discussed further here. Note, however, that Netlink2 has the mechanisms draft-jhsrha-forces-netlink2-00.txt ^L[Page 20] jhsrha draft-jhsrha-forces-netlink2-00.txt required to enable this when required. 4.3.4.3. The ACK Netlink2 Message This message is actually used to denote both an ACK and a NACK. Typically the direction is from FEC to CEC (in response to an ACK request message). However, CEC should be able to send ACKs back to FEC when requested. The semantics for this are IP service specific. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Netlink2 message header | | type = NLMSG_ERROR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | error code | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OLD Netlink2 message header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Error code: integer (typically 32 bits) An error code of zero indicates that the message is an ACK response. An ACK response message contains the original Netlink2 message header that can be used to compare against (sent sequence numbers, etc). A non-zero error code message is equivalent to a Negative ACK (NACK). In such a situation, the Netlink2 data that was sent down to the kernel is returned appended to the original Netlink2 message header. 4.3.4.4. Batching, Atomicity and Ordering of Transactions As mentioned earlier (repeated here for clarity) Standard Netlink multi-message batching looks as follows: NLMSG:NLMSG:NLMSG.... where NLMSG is a Netlink2 header and its associated payload. draft-jhsrha-forces-netlink2-00.txt ^L[Page 21] jhsrha draft-jhsrha-forces-netlink2-00.txt This has the advantage of allowing inter-mixing of multiple com- mands (example adds/deletes) generally in a request from CE->FE. It is also useful for batching multiple events from the FE->CE. In a two-phase commit messages are bound into a relationship. Typ- ically, the first and all following headers have the NLM_F_MULTI Netlink2 header flag set, except for the last header, which has the Netlink2 header type NLMSG_DONE. Typically, in netlink, the NLMSG_DONE shows up in separate PDUs to define a commit. Atomicity of a transaction including that of a batch is achieved by using the NLM_F_ATOMIC flag. Use of the NLM_F_ATOMIC is expensive because it may necessitate the locking of access to tables (depend- ing on the implementation. 5. Protocol Architecture IP services are defined by using service templates. Refer to the Netlink document [Netlink] for the different templates used for IP services that fit within the current scope of the ForCES charter. ForCES in relation to NEs involves three phases: the Pre-Associa- tion phase, the association phase where the ForCES protocol oper- ates, and a termination phase where a party in the relationship leaves a bundle. 1) The Pre-Association Phase In a simple setup, this phase is static. All the parameters for the association phase are well known (example multicast groups for each Netlink2 bundle and its wires, etc.). In the case of dynamic discovery, the FE Manager and the CE Manager agree on all the parameters and clearly articulate topology and other information to each other. Vendors may use their own proprietary service discovery protocol. As minimum, we assume a static configuration. On completion of the Service Discovery phase, the FEM will have established contact with the appropriate CEM component. Initial- ization and Authentication will be complete at this point. An FE draft-jhsrha-forces-netlink2-00.txt ^L[Page 22] jhsrha draft-jhsrha-forces-netlink2-00.txt is issued a service identifier which will be used for accounting, identification and authentication purposes. The identifier is translated as the PID in the association phase. The multicast and unicast addresses for communication are also known at this point. All capabilities may also have been discovered at this point. 2) The Association Phase In this phase, the FE and CP components cooperate to deliver the IP service. The CP component might be registered (in the pre-associa- tion phase) to receive FE-specific services (such as link events). Essentially, in this phase, the IP service is provisioned and exe- cuting. The FE component might continuously get updates from the control plane component on how to operate the service (for example, the V4 forwarding route additions or deletions). The association phase is where Netlink2 operates as the ForCES pro- tocol. On startup, a SYN Netlink2 message with an ACK flag set is issued by the FE on the bundle(s) to which the FE is connected. The con- trolling CE will respond (given the ACK flag in the request) with either an ACK to imply that the FE has been accepted by the CE or a NACK, which is interpreted as a rejection of the FE by the CE. If no response is received within a timeout period a retry is attempted. After a configurable number of retries without response, it is assumed that a CE does not exist and control is handed to the FEM. The SYN state is followed by the synchronization phase where the FE is loaded with updates to tables. 3) Service Termination Service termination could be issued by either component of the ser- vice abstraction. Normally it will be issued by the FE component so that the latter does not continue to get billed for services. The FE component may also issue the termination message if it wants to change to a comparatively better CP service provider. FE or the CE initiating the termination will issue a BOOT command with a FIN extended flag. An ACK flag may be set if a response to the FIN is required. draft-jhsrha-forces-netlink2-00.txt ^L[Page 23] jhsrha draft-jhsrha-forces-netlink2-00.txt 6. Putting together the base protocol for WG charter 7. References [RFC1633] R. Braden, D. Clark, and S. Shenker, "Integrated Services in the Internet Architecture: an Overview", RFC 1633, ISI, MIT, and PARC, June 1994. [RFC1812] F. Baker, "Requirements for IP Version 4 Routers", RFC 1812, June 1995. [RFC2475] M. Carlson, W. Weiss, S. Blake, Z. Wang, D. Black, and E. Davies, "An Architecture for Differentiated Services", RFC 2475, December 1998. [RFC2748] J. Boyle, R. Cohen, D. Durham, S. Herzog, R. Rajan, A. Sastry, "The COPS (Common Open Policy Service) Pro- tocol", RFC 2748, January 2000. [RFC2328] J. Moy, "OSPF Version 2", RFC 2328, April 1998. [RFC2844] T. Przygienda, P. Droz, R. Haas, "OSPF over ATM and Proxy-PAR", RFC 2844, May 2000. [RFC3358] T. Przygienda, "Optional Checksums in Intermedi- ate System to Intermediate System (ISIS)", RFC 3358, August 2002. [RFC1157] J.D. Case, M. Fedor, M.L. Schoffstall, C. Davin, "Simple Network Management Protocol (SNMP)", RFC 1157, May 1990. [RFC3036] L. Andersson, P. Doolan, N. Feldman, A. Fredette, B. Thomas "LDP Specification", RFC 3036, January 2001. draft-jhsrha-forces-netlink2-00.txt ^L[Page 24] jhsrha draft-jhsrha-forces-netlink2-00.txt [stevens] G.R Wright, W. Richard Stevens, "TCP/IP Illus- trated Volume 2, Chapter 20", June 1995. [netfilter] http://netfilter.samba.org [diffserv] http://diffserv.sourceforge.net [Netlink] J. H. Salim, H. Khosravi, A. Kleen, A. Kuznetsov, "Netlink as an IP Services Protocol", draft-ietf-forces- netlink-03.txt, June 2002. [forces_req] H. Khosravi, T. Anderson, "Requirements for Separation of IP Control and Forwarding", draft-ietf-forces- requirements-07.txt, October 2002. [xtp] XTP Forum, "Xpress Transport Protocol Specification, XTP Revision 4.0", March 1995. 8. Author's Address: Jamal Hadi Salim Znyx Networks Ottawa, Ontario Canada hadi@znyx.com Robert Haas IBM Research Zurich Research Laboratory Saeumerstrasse 4 CH-8803 Rueschlikon Switzerland rha@zurich.ibm.com 9. Appendix 1: Sample Service Hierarchy In the diagram below we show a simple IP service, foo, and the interaction it has between CP and FE components for the ser- vice(labels 1-3). The diagram is also used to demonstrate CP<->FE addressing. In this section we illustrate only the addressing semantics. In Appendix 2 , the diagram is referenced again to define the protocol interac- tion between service foo's CEC and FEC (labels 4-10). draft-jhsrha-forces-netlink2-00.txt ^L[Page 25] jhsrha draft-jhsrha-forces-netlink2-00.txt CP [--------------------------------------------------------. | .-----. | | | . --------. | | | CLI | / | | | | | CP protocol | | | /->> -. | component | <-. | | __ _/ | | For | | | | | | IP service | ^ | | Y | foo | | | | | ___________/ ^ | | Y 1,4,6,8,9 / ^ 2,5,10 | 3,7 | --------------- Y------------/---|----------|----------- | ^ | ^ **|***********|****|**********|********** ************* Netlink2 layer ************ **|***********|****|**********|********** FE | | ^ ^ .-------- Y-----------Y----|--------- |----. | | / | | Y / | | . --------^-------. / | | |FE component/module|/ | | | for IP Service | | --->---|------>---| foo |----->-----|------>-- | ------------------- | | | | | ------------------------------------------ The control plane protocol for IP service foo does the following to connect to its FE counterpart. The steps below are also numbered in the diagram above. 1) Connect to IP service foo through a socket connect. A typical con- nection would be via a call to: socket(AF_NETLINK, SOCK_RAW, NETLINK_FOO) 2) Bind to listen to specific async events for service foo 3) Bind to listen to specific async FE events Note that a wrapper socket can be created on top of the real sock- ets: depending on the dest PID given, it chooses the most draft-jhsrha-forces-netlink2-00.txt ^L[Page 26] jhsrha draft-jhsrha-forces-netlink2-00.txt appropriate socket to send the packet onto (if here are two multi- cast groups, one for all FEs, and one for all FEs and CEs, a packet from the CE to the FEs will use the first multicast group). The wrapper socket basically maps a message to the most appropriate wire in the bundle. 10. Appendix 2: Sample Protocol for the foo IP Service Our proverbial IP service "foo" is used again to demonstrate how one can deploy a simple IP service control using Netlink2. These steps are continued from Appendix 1 (hence the numbering). 4) query for current config of FE component 5) receive response to 4) via channel on 3) 6) query for current state of IP service foo 7) receive response to 6) via channel on 2) 9) register the protocol specific packets you would like the FE to forward to you 10) send specific service foo commands and receive responses for them if needed 10.1. Interacting with Other IP Services The diagram in Appendix 1 shows another control component configur- ing the same service. In this case, it is a proprietary Command Line Interface. The CLI may or may not be using the Netlink proto- col to communicate with the foo component. If the CLI should issue commands that will affect the policy of the FEC for service "foo", then the "foo" CEC is notified. It could then make algorithmic decisions based on this input. For example if an FE allowed another service to delete policies installed by a different service and a policy that foo installed was deleted by service bar, there might be a need to propagate this to all the peers of service "foo"). 11. Appendix 3: Examples In this example we show a simple configuration Netlink2 message sent from a TC CEC to an egress TC FIFO queue. This queue algorithm draft-jhsrha-forces-netlink2-00.txt ^L[Page 27] jhsrha draft-jhsrha-forces-netlink2-00.txt is based on packet counting and drops packets when the limit exceeds 100 packets. We assume the queue is in hierarchical setup with a parent 100:0 and a classid of 100:1 and that it is to be installed on device with ifindex of 4. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | flags_e | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type (RTM_NEWQDISC) | Flags (NLM_F_EXCL | | | |NLM_F_CREATE | NLM_F_REQUEST)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number(arbitrary number) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | source PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | destination PID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Family(AF_INET)| Reserved1 | Reserved1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Interface Index (4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Qdisc handle (0x1000001) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Parent Qdisc (0x1000000) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TCM Info (0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type (TCA_KIND) | Length(4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Value ("pfifo") | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type (TCA_OPTIONS) | Length(4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Value (limit=100) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ draft-jhsrha-forces-netlink2-00.txt ^L[Page 28]